Current state of the project — Overview of the working papers — Downloads

The research described on this web page has resulted in a book,
*Algorithmic learning in a random world*
by Vovk, Gammerman and Shafer,
published by Springer, New York, in 2005.
The book's web site
has now become the main site for this project;
it also contains information about newer developments (since 2005).

The basic applied problem that we consider is on-line prediction:
at each trial *n*=1,2,...,
given the first *n*-1 observations
we would like to say something about the *n*th observation.
(Often the observations will consist of two parts, the object and the label,
and we might be asked to give a prediction for the *n*th label
given the *n*th object,
but this case can be treated in a similar way.)
The observations are usually assumed to be generated independently
from the same probability distribution (the *iid model*);
only in Working Paper 8 is this assumption relaxed.
One setting that we consider is that of *region prediction*:
the prediction algorithm is required to output
a set of possible values for the *n*th observation.
We show that there exists a simple class of prediction algorithms
(called *Transductive Confidence Machines*, abbreviated TCM,
on this web site,
and called *conformal predictors*
on the new web site)
which are automatically *calibrated*:
when given a "significance level" δ,
they make errors with frequency at most δ
(precisely δ if randomisation is allowed).
Moreover, the algorithm is calibrated in a very strong non-asymptotic sense:
in the randomised case, the probability of error is δ at each trial
and errors are made independently at different trials.

Working paper 1 proves the calibration result for the iid model
and in the "structured" situation,
where each observation consists of an object and a label.
In principle, it is easy to be calibrated:
since an error is made, by definition,
when the prediction region does not contain the true label,
errors can be avoided if the regions are made large enough.
Another goal in region prediction is to make as few *uncertain* predictions
as possible,
where a prediction region is called "uncertain" if it contains more than one possible label
(assuming the set of possible labels is finite: the case of *classification*).
The justification for the use of TCM given in Working Paper 1
was that it produces reasonable results on standard benchmark data sets,
whereas the standard methods
(combining algorithms for point prediction
with PAC bounds on the probability of error)
are impractical even on such a clean data set
as the USPS data set of hand-written digits.

Working Paper 3 provides theoretical evidence for the efficiency of TCM (again in the case of the iid model): there is a simple TCM based on the Nearest Neighbours algorithm whose long-run frequency of uncertain predictions does not exceed the frequency of uncertain predictions of any other calibrated algorithm (even an algorithm that is optimised for the data-generating distribution).

The basic on-line scenario, when the true label is disclosed immediately after the prediction is made, is not realistic: typically there is a delay before the true label becomes known, and some (or most) labels are never disclosed. Working Papers 6 and 7 show that calibration continues to hold even if only a small fraction of labels is eventually disclosed (perhaps with a delay).

Working Paper 8 extends conformal prediction
to a wide class of statistical models
called on-line compression models
(and closely connected with Martin-Löf's repetitive structures
and Friedman's summarizing statistics).
The standard approach to modelling uncertainty
is to choose a family of probability distributions
(*statistical model*) one of which is believed to be the true distribution generating,
or explaining in a satisfactory way, the data.
(In some applications of probability theory,
the true distribution is assumed to be known,
and so the statistical model is a one-element set.
In Bayesian statistics, the statistical model is complemented by another element,
a prior distribution on the distributions in the model.)
All modern applications of probability depend on this scheme.

In 1963–1970 Andrei Kolmogorov suggested a different approach to modelling uncertainty based on information theory; its purpose was to provide a more direct link between the theory and applications of probability. On-line compression models are a natural adaptation of Kolmogorov's programme to the technique of conformal prediction. Working Paper 8 introduces three on-line compression models: exchangeability (equivalent to the iid model), Gaussian and Markov.

The content of all other Working Papers is briefly described in the next section.

You can download from here the following Working Papers:

- Choose pdf (252 KB) or zipped postscript (176 KB). This paper introduces on-line TCM and its computationally efficient modification, Inductive Confidence Machine (ICM), and shows that they are automatically calibrated. The extended abstract appeared in FOCS'02 Proceedings.
- Choose pdf (282 KB) or zipped postscript (176 KB). This paper shows that for any power distribution generating the examples there is an asymptotically optimal TCM. Its conference version appeared in ALT'02 Proceedings.
- Choose pdf (372 KB) or zipped postscript (209 KB). There is a TCM that makes asymptotically as few uncertain predictions as any other calibrated prediction algorithm, even if the probability distribution generating the examples in not known.
- Choose pdf (275 KB) or zipped postscript (156 KB). A simple modification of TCM with better conditional properties is constructed.
- Choose pdf (430 KB) or zipped postscript (226 KB). TCM is used for testing the iid model.
- Choose pdf (154 KB) or zipped postscript (118 KB). TCM remains calibrated in probability if most labels are never disclosed or disclosed with a delay.
- Choose pdf (144 KB) or zipped postscript (107 KB). TCM remains calibrated almost surely if most labels are never disclosed or disclosed with a delay.
- Choose pdf (247 KB) or zipped postscript (223 KB). TCM is generalised to work in an arbitrary on-line compression model.
- Choose pdf (241 KB) or zipped postscript (136 KB). This paper is concerned with probability forecasting (instead of "p-values forecasting", as done by TCM). It constructs the "Venn probability machine" (VPM) for the iid model and shows that: (a) every VPM is well-calibrated, in the strongest possible non-asymptotic sense; (b) there exists a VPM that is asymptotically optimal under any (unknown) data-generating iid distribution.

- Talk at the NeuroCOLT Workshop Generalisation Bounds Less than 0.5 (Windsor, UK, 1 May 2002); click here (153 KB).
- Talk at the EURANDOM Workshop Statistical Learning in Classification and Model Selection (Eindhoven, The Netherlands, 15 January 2003); click here (175 KB).
- Talk at the Dagstuhl Centennial Seminar on Kolmogorov Complexity and Applications (Schloss Dagstuhl, Germany, 29 April 2003); click here (119 KB).

Industrial collaboration: Universal Prediction.