Speculation about the CBO reasoning

Why the CBO uses a such complicated, and seemingly counterintuitive, formula, especially when the range is completely inside one of the two lateral bands ?

In my opinion, that's because
(a) the CBO uses the knowledge that one distinct value is placed in min_x, and another in max_x, to build a near-histogram;
(b) the CBO expects that the data near min_x and max_x may not reflect perfectly the recorded statistics.

Fact 1: if we select around min_x (ie including min_x in the requested range), the CBO predicts a cardinality close to num_rows/num_distinct, regardless of the operator type (<, <=, >, <=), and even for a range of infinitesimal width (where we'd expect a cardinality near to 0); using again our num_distinct=4 case, num_rows=4000000, num_rows/num_distinct = 1,000,000, min_x = 0 (range_sel_speculation.sql):
SQL> select x from t where x >  0 - 1e-6 and x <  0 + 1e-6;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x >= 0 - 1e-6 and x <  0 + 1e-6;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x >= 0 - 1e-6 and x <= 0 + 1e-6;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x >  0 - 1e-6 and x <= 0 + 1e-6;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
This is the behaviour we'd expect if we had collected a frequency histogram, and one value of the histogram were inside the range (check "Cost Based Oracle", page 165).

It's true that we have collected no histogram, but we haven't prevented the CBO from knowing that one distinct value is placed in min_x and another in max_x. We have indeed concealed the information about the row count over these two distinct values, and so the CBO assumes that every distinct value has the same number of rows associated : num_rows/num_distinct.

Fact 2 : if we select near min_x or max_x and inside the min_x <--> max_x interval, we get the same result (cardinality equal to num_rows/num_distinct over an infinitesimal range, independency from the actual operator type (<, <=, >, <=)) :
SQL> select x from t where x > 0 + 1e-6 and x < 0 + 1e-6 + 1e-9;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x >= 0 + 1e-6 and x < 0 + 1e-6 + 1e-9;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x >= 0 + 1e-6 and x <= 0 + 1e-6 + 1e-9;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x > 0 + 1e-6 and x <= 0 + 1e-6 + 1e-9;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
Why the CBO returns 1000K and not 0 (rounded to the usual 1) when we select near min_x, but we don't include min_x in the range ? Well, that's consistent with the new behaviour of 10g ("Cost Based Oracle", page 130): when you select (fact 3) outside but near the min_x <--> max_x interval, you get the same result again:
SQL> select x from t where x > 0 - 1e-6 - 1e-9 and x < 0 - 1e-6;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x >= 0 - 1e-6 - 1e-9 and x < 0 - 1e-6;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x >= 0 - 1e-6 - 1e-9 and x <= 0 - 1e-6
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
SQL> select x from t where x > 0 - 1e-6 - 1e-9 and x <= 0 - 1e-6;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|  2929K|  1436   (6)| 00:00:18 |
|*  1 |  TABLE ACCESS FULL| T    |  1000K|  2929K|  1436   (6)| 00:00:18 |
--------------------------------------------------------------------------
If the CBO trusted perfectly the recorded statistics, it should predict a cardinality of 0 (as it did in 9i), since we're selecting outside the minx <--> max_x interval; but it doesn't, probably because in many cases, the data changes "rapidly" near min_x and/or max_x (think about column x as "insert_date", "last_update_time", "sequence-driven PK", etc), and so the min_x/max_x cannot be considered completely representative of the actual data - they are somewhat "weaker" statistics, the real value is probably close but not exactly the same.

And so - the CBO assumes that when we select near the recorded statistic min_x, we are actually selecting around the real min_x - and the same reasoning is made for fact 2.

So to recap - my interpretation is that the CBO threats a selection near min_x/max_x as a selection around min_x/max_x, and a selection around min_x/max_x as selecting exactly one value of a frequency histogram with row count equal to num_rows / num_distinct.

Then of course, as the range drifts away from min_x and/or max_x - the CBO interpolates to the "average" values inside [ num_rows * (high_x - low_x) / (max_x - min_x) ], and outside [0] minx <--> max_x (data from range_sel_speculation.sql; here's the data for the red, green, blue, violet lines) :

but we can see the presence of the near-histogram for min_x = 0 and max_x = 10.

Here's a link back to the conclusion.

For corrections / feedback:
[email protected]