Conformal Prediction


Current state of the projectOverview of the working papersDownloads

Current state of the project

The research described on this web page has resulted in a book, Algorithmic learning in a random world by Vovk, Gammerman and Shafer, published by Springer, New York, in 2005. The book's web site has now become the main site for this project; it also contains information about newer developments (since 2005).

Overview of the Working Papers

The basic applied problem that we consider is on-line prediction: at each trial n=1,2,..., given the first n-1 observations we would like to say something about the nth observation. (Often the observations will consist of two parts, the object and the label, and we might be asked to give a prediction for the nth label given the nth object, but this case can be treated in a similar way.) The observations are usually assumed to be generated independently from the same probability distribution (the iid model); only in Working Paper 8 is this assumption relaxed. One setting that we consider is that of region prediction: the prediction algorithm is required to output a set of possible values for the nth observation. We show that there exists a simple class of prediction algorithms (called Transductive Confidence Machines, abbreviated TCM, on this web site, and called conformal predictors on the new web site) which are automatically calibrated: when given a "significance level" δ, they make errors with frequency at most δ (precisely δ if randomisation is allowed). Moreover, the algorithm is calibrated in a very strong non-asymptotic sense: in the randomised case, the probability of error is δ at each trial and errors are made independently at different trials.

Working paper 1 proves the calibration result for the iid model and in the "structured" situation, where each observation consists of an object and a label. In principle, it is easy to be calibrated: since an error is made, by definition, when the prediction region does not contain the true label, errors can be avoided if the regions are made large enough. Another goal in region prediction is to make as few uncertain predictions as possible, where a prediction region is called "uncertain" if it contains more than one possible label (assuming the set of possible labels is finite: the case of classification). The justification for the use of TCM given in Working Paper 1 was that it produces reasonable results on standard benchmark data sets, whereas the standard methods (combining algorithms for point prediction with PAC bounds on the probability of error) are impractical even on such a clean data set as the USPS data set of hand-written digits.

Working Paper 3 provides theoretical evidence for the efficiency of TCM (again in the case of the iid model): there is a simple TCM based on the Nearest Neighbours algorithm whose long-run frequency of uncertain predictions does not exceed the frequency of uncertain predictions of any other calibrated algorithm (even an algorithm that is optimised for the data-generating distribution).

The basic on-line scenario, when the true label is disclosed immediately after the prediction is made, is not realistic: typically there is a delay before the true label becomes known, and some (or most) labels are never disclosed. Working Papers 6 and 7 show that calibration continues to hold even if only a small fraction of labels is eventually disclosed (perhaps with a delay).

Working Paper 8 extends conformal prediction to a wide class of statistical models called on-line compression models (and closely connected with Martin-Löf's repetitive structures and Friedman's summarizing statistics). The standard approach to modelling uncertainty is to choose a family of probability distributions (statistical model) one of which is believed to be the true distribution generating, or explaining in a satisfactory way, the data. (In some applications of probability theory, the true distribution is assumed to be known, and so the statistical model is a one-element set. In Bayesian statistics, the statistical model is complemented by another element, a prior distribution on the distributions in the model.) All modern applications of probability depend on this scheme.

In 1963–1970 Andrei Kolmogorov suggested a different approach to modelling uncertainty based on information theory; its purpose was to provide a more direct link between the theory and applications of probability. On-line compression models are a natural adaptation of Kolmogorov's programme to the technique of conformal prediction. Working Paper 8 introduces three on-line compression models: exchangeability (equivalent to the iid model), Gaussian and Markov.

The content of all other Working Papers is briefly described in the next section.

Downloads

You can download from here the following Working Papers:

  1. Choose pdf (252 KB) or zipped postscript (176 KB). This paper introduces on-line TCM and its computationally efficient modification, Inductive Confidence Machine (ICM), and shows that they are automatically calibrated. The extended abstract appeared in FOCS'02 Proceedings.
  2. Choose pdf (282 KB) or zipped postscript (176 KB). This paper shows that for any power distribution generating the examples there is an asymptotically optimal TCM. Its conference version appeared in ALT'02 Proceedings.
  3. Choose pdf (372 KB) or zipped postscript (209 KB). There is a TCM that makes asymptotically as few uncertain predictions as any other calibrated prediction algorithm, even if the probability distribution generating the examples in not known.
  4. Choose pdf (275 KB) or zipped postscript (156 KB). A simple modification of TCM with better conditional properties is constructed.
  5. Choose pdf (430 KB) or zipped postscript (226 KB). TCM is used for testing the iid model.
  6. Choose pdf (154 KB) or zipped postscript (118 KB). TCM remains calibrated in probability if most labels are never disclosed or disclosed with a delay.
  7. Choose pdf (144 KB) or zipped postscript (107 KB). TCM remains calibrated almost surely if most labels are never disclosed or disclosed with a delay.
  8. Choose pdf (247 KB) or zipped postscript (223 KB). TCM is generalised to work in an arbitrary on-line compression model.
  9. Choose pdf (241 KB) or zipped postscript (136 KB). This paper is concerned with probability forecasting (instead of "p-values forecasting", as done by TCM). It constructs the "Venn probability machine" (VPM) for the iid model and shows that: (a) every VPM is well-calibrated, in the strongest possible non-asymptotic sense; (b) there exists a VPM that is asymptotically optimal under any (unknown) data-generating iid distribution.
These are slides of some of my talks:

Industrial collaboration: Universal Prediction.

This page is maintained by Vladimir Vovk.   Last modified on 30 December 2009