12c New Feature: Histogram Enhancements

Today i want to explain the new Histograms Enhancements in 12c. But first i want to explain some basic about basic about histograms, because i know that they are not well known by some DBA`s.The Optimizer normaly thinks that the data in a column is uniformly distributed. A example of this is a primary key column which is populated by a sequence. It stores all numbers from 1 up to the sequences current value exactly once. In such case to find out the numbers filtered out by a predicated is simple. For example if you use a where clause like where val2 between 2 and 6 the optimizer only needs the normal object statistics.

But what if the data distribution is something different. How can the optimizer make a estimation for a predicate val2=5 when the data is distributed like in the next example?

```SQL> select val2,count(*) from hist_sample group by val2;

VAL2   COUNT(*)
---------- ----------
1         9
2        27
3        68
4       132
5       501
6       232
7       101
8        40
9        88
10        99
11       501
12        92```

The answer is simple. It cannot without additional information. This additional information for the optimizer are called histograms

When to Use Histograms

Histograms are often used on columns that are used in where clauses of queries that have a high skewed data distribution. Oracle normaly determine automaticly the best Way to gather histograms because the METHOD_OPT is set to AUTO. The optimizer uses histograms to compute the selectivity of filter and join predicates. Please Note that Oracle only stores the first 32 bytes of any string values.

When to Not Use Histograms

Also, be aware that histograms, as well as all optimizer statistics, are static. If the data distribution of a column changes frequently, it is necessary to recompute the histogram for a given column. Histograms are not useful for columns with the following characteristics:

• the column data is uniformly distributed
• the column is not used in WHERE clauses of queries
• the column is unique and is used only with equality predicates

Which kind of histograms are avaible before 12c

FREQUENCY

Frequency histograms work fine if you have a low number of distinct values in your table column, but when the number exceeds the maximum number of buckets, you cannot create a bucket for each value. In this case the Optimizer creates Height balanced histograms. A frequency histogram was created only if the number of distinct values for the column was less than the number of histogram buckets specified (up to 254).

The histogram stored in the database is similar to the picture above. The difference is that in the dictionary the cumulative frequency is used. The get the Information from the database you can use a query like this.

```select endpoint_value
,endpoint_number
,endpoint_number - lag(endpoint_number,1,0) over (order by endpoint_number) as freqeuncy
from  user_tab_histograms
where table_name = 'HIST_SAMPLE'
and column_name = 'VAL2'
order by endpoint_number;
ENDPOINT_VALUE ENDPOINT_NUMBER    FREQEUNCY
-------------- --------------- ----------
1             9            9
2            36           27
3           104           68
4           236          132
5           737          501
6           969          232
7          1070          101
8          1110           40
9          1198           88
10          1297           99
11          1798          501
12          1890           92```

HEIGHT BALANCED

If you have more Number of Distinct Values (NDV) then avaible buckets than Oracle creates a Height Balanced Histogram. When a Height Balanced Histogram is created, first a Frequency Histogram is created. The Values from this Histogram are stacked up into a pile and  is then divided by the buckets. I hope this short description is enough. If not leave me a comment and i will try to describe it more.

Histogram Enhancements

So after this short introduction what is new in Oracle 12c….

In Oracle Database 12c, there are two new types of histograms introduced for columns, which have more than 254 distinct values. This Histograms  improve the optimizer estimates. These new histograms have a more accurate frequency of endpoint values, and enable the optimizer to choose better plans. If Oracle Database 12c creates new histograms, and if the sampling percentage is AUTO, they are either frequency or hybrid, but not height-balanced. If you upgrade the database from Oracle Database11g to 12c, any height-based histograms created before the upgrade remains in use.

Top Frequency Histograms

Is the same as the normal frequency histogram except that the database ignores statistically insignificant unpopular values and doesn`t store them. The benefit is that the database has only highly popular values stored in the histogram.

This Top Frequency Histograms are created when:

• The sampling percentage is AUTO. Please Note that the Database builds the histogram from a FULL Table Scan.
• The distinct Values in the column is greater then the buckets
• The percentage of rows occupied by the top number of frequent values is equal to or greater than threshold. The threshold is t=(1-(1/number of buckets))*100

Hybrid histogram

A hybrid histogram combines both the height-based histogram and frequency histogram. A hybrid histogram stores the endpoint repeat count value and the cumulative frequency of values that are smaller than the corresponding endpoint value for each bucket in the histogram. By using these two pieces of information, the optimizer can obtain accurate estimates for almost popular values.

This Hybrid Histograms are created when:

• The sampling percentage is AUTO.
• The top frequency Histograms does not apply.
• Number of distinct values is greater than the number of buckets.

You can check the histogram type with this SQL:

```SELECT column_name,num_distinct,histogram
FROM user_tab_col_statistics
WHERE table_name = 'HIST_SAMPLE'
AND column_name = 'VAL2'

COLUMN_NAME       NUM_DISTINCT HISTOGRAM
----------------- ------------ ---------------
VAL2              28           HYBRID```

```SELECT endpoint_number,
endpoint_value,
endpoint_repeat_count
FROM user_tab_histograms
WHERE table_name = 'HIST_SAMPLE'
AND column_name = 'VAL2'
ORDER by endpoint_number;
ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT
--------------- -------------- ---------------------
17          1              17
1469          5            1001
2132          7             200
2861         11             501
3744         52             500
4673        111             500
4803        182              39
4890        192              87
4988       1102              98
5579       2111             500```