CBO: range predicate cardinality, conclusion

Conclusion

Starting from Jonathan Lewis' test case "Discrete Dangers" on page 126 of his book "Cost Based Oracle",
we have investigated (and illustrated with graphs) the CBO cardinality estimation algorithm for range-based predicates:

  select x from t where x >  low_x and x <  high_x  (open  , open  )
  select x from t where x >= low_x and x <= high_x  (closed, closed)
  select x from t where x >  low_x and x <= high_x  (open,   closed)
  select x from t where x >= low_x and x <  high_x  (closed, open)

  template:
  select x from t where x #low low_x and x #high high_x
  #low  in (">", ">=")
  #high in ("<", "<=")

when column x has no associated histogram.

Letting

  min_x = min (x) over all rows
  max_x = max (x) over all rows
  B = (max_x - min_x) / num_distinct (Band width)

and defining

  left band      : min_x     < x < min_x + B
  central region : min_x + B < x < max_x - B
  right band     : max_x - B < x < max_x

we have seen that

For ranges completely contained in one of the two bands
```
     min_x     <= low_x < high_x <= min_x + B
  or max_x - B <= low_x < high_x <= max_x
```
the cardinality can depend on both low_x and high_x, only low_x, only high_x, or none (= constant), depending on the value of #low and #high, in a somewhat counterintuitive way. We have found the relevant formulae.

In all the other cases, a slightly modified (modifications underlined) version of Jonathan Lewis' "standard formula" applies:

  cardinality = num_rows * (high_x_effective - low_x_effective) / (max_x - min_x)
              + num_rows * correction_for_or_equal_operators
              - num_rows * correction_for_special_case

  if #low = ">=" and low_x inside the left band then 
    low_x_effective = min_x + B
  else
    low_x_effective = low_x
  end

  if #high = "<=" and high_x inside the right band
    high_x_effective = max_x - B
  else
    high_x_effective = high_x
  end

  correction_for_or_equal_operators = 0
    + 1/num_distinct if #low  = ">="
    + 1/num_distinct if #high = "<="

  correction_for_special_case = 0 
    + 1/num_distinct if low_x  = min_x and #low  = ">"    
    + 1/num_distinct if high_x = max_x and #high = "<"

This fitness test verifies the formula against the CBO, using its pl/sql implementation.

We have also graphically illustrated, hopefully for a better intuitive understanding, the case of selection over both an infinitesimal and a finite range for the (open,open) and (closed, closed) cases.

The discussion starts here.

And here's my speculation about the rationale behind this strange (but quite sound in hindsight) behaviour.

For corrections / feedback:
[email protected]