## Commentary on Simon Ogden’s Out of range histogram adjustments

[latexpage]
Simon Ogden recently wrote up an excellent post regarding Out of Range Histogram Adjustments. While he explains in detail how the out of range selectivity is computed, I think we can make it a little bit clearer. I hope Simon doesn’t mind! 🙂

The output of optdiag shows two unique values in the stored statistics (histogram) for the a table. :
Step Weight Value
1 0.00000000 < "C " 2 0.01000000 = "C " 3 0.00000000 < "N " 4 0.99000001 = "N "[/text] Remember the weight values for the unique values. Think of the weight values as percentage of the table not counting out of range values. So if we have the following query then we could expect 1% of the table to contain that value: [sql gutter="false"]select * from table a where a1 = "C"[/sql] Let's move on to out of range values. The general equation to determine the selectivity of a single out of range unique value is: $$\label{eq:selectivity} s = \cfrac{\cfrac{1}{u + n}} {1 + \cfrac{1}{u + n}}$$ We can simplify that using a bit of algebra: $$\label{eq:selectivitysimple} s = \cfrac{1}{u + 2n}$$ $u$ = the number of unique values in the histogram $n$ = the number of unique values outside of the histogram $s$ = selectivity of a single out of range unique value Let's assume we've added data to the table but haven't run update statistics yet. The stored statistics won't contain any metrics on the new data. So, ASE will estimate what it thinks the selectivity could be for the values we are searching on. In the following query, we are searching for T and Y. [sql highlight_lines="4"]select * from a where a1 in ('T', 'Y')[/sql] So how does ASE determine the selectivity for T and Y? We can use the selectivity query (\ref{eq:selectivitysimple}) replacing $u$ and $n$ to compute $s$: $0.1666667 = \cfrac{1}{2 + 2 \times 2}$ So, the selectivity for T is 0.1666667 and Y is 0.1666667. Since the sum of the histogram and the out of histogram selectivity (weight) must equal 1, we need to adjust the histogram weight values. We have two unique values in the histogram: C and N. The revised histogram weight ($h$) is the percentage of what is left for in histogram unique values: $$\label{eq:histpercent} h = 1-ns \]$$ $0.6666667 = 1 - 2 \times 0.1666667$ As you can see, we have 0.6666667 left over. If you remember from the histogram, the value C composes 1% of the table and the value N 99% of the table. All we have to do is plug those in: $0.0066667 = 0.01 \times 0.6666667$ $0.6600000 = 0.99 \times 0.6666667$ So we get the results that we're looking for: [text]'C' 0.0066667 'N' 0.6600000 'T' 0.1666667 'Y' 0.1666667[/text]