This paper shows how the Oracle™ Cost Based Optimizer uses pervasively
a fundamental statistical formula for cardinality estimation of distinct, group-by and join operations.
The formula gives the expected number of distinct values when selecting from a bag without replacement;
the paper illustrates the use of this formula when the selection is made from tables instead of bags
as a consequence of using filtering predicates in SQL statements.
Both single-column and multiple-columns scenarios are illustrated.
Particular emphasys is placed on the multi-column join operation, that implements additional statistical
concepts to estimate the cardinality. The traditional algorithm is investigated, and also the new algorithm
introduced in 10g (and backported to the latest patchsets of 9i) commonly known as "10g multicolumn join selectivity sanity check".
The supporting material is here.