Today i want to explain the new Histograms Enhancements in 12c. But first i want to explain some basic about basic about histograms, because i know that they are not well known by some DBA`s.The Optimizer normaly thinks that the data in a column is uniformly distributed. A example of this is a primary key column which is populated by a sequence. It stores all numbers from 1 up to the sequences current value exactly once. In such case to find out the numbers filtered out by a predicated is simple. For example if you use a where clause like where val2 between 2 and 6 the optimizer only needs the normal object statistics.
But what if the data distribution is something different. How can the optimizer make a estimation for a predicate val2=5 when the data is distributed like in the next example?
SQL> select val2,count(*) from hist_sample group by val2; VAL2 COUNT(*) ---------- ---------- 1 9 2 27 3 68 4 132 5 501 6 232 7 101 8 40 9 88 10 99 11 501 12 92
The answer is simple. It cannot without additional information. This additional information for the optimizer are called histograms
When to Use Histograms
Histograms are often used on columns that are used in where clauses of queries that have a high skewed data distribution. Oracle normaly determine automaticly the best Way to gather histograms because the METHOD_OPT is set to AUTO. The optimizer uses histograms to compute the selectivity of filter and join predicates. Please Note that Oracle only stores the first 32 bytes of any string values.
When to Not Use Histograms
Also, be aware that histograms, as well as all optimizer statistics, are static. If the data distribution of a column changes frequently, it is necessary to recompute the histogram for a given column. Histograms are not useful for columns with the following characteristics:
- the column data is uniformly distributed
- the column is not used in WHERE clauses of queries
- the column is unique and is used only with equality predicates
Which kind of histograms are avaible before 12c
Frequency histograms work fine if you have a low number of distinct values in your table column, but when the number exceeds the maximum number of buckets, you cannot create a bucket for each value. In this case the Optimizer creates Height balanced histograms. A frequency histogram was created only if the number of distinct values for the column was less than the number of histogram buckets specified (up to 254).
The histogram stored in the database is similar to the picture above. The difference is that in the dictionary the cumulative frequency is used. The get the Information from the database you can use a query like this.
select endpoint_value ,endpoint_number ,endpoint_number - lag(endpoint_number,1,0) over (order by endpoint_number) as freqeuncy from user_tab_histograms where table_name = 'HIST_SAMPLE' and column_name = 'VAL2' order by endpoint_number; ENDPOINT_VALUE ENDPOINT_NUMBER FREQEUNCY -------------- --------------- ---------- 1 9 9 2 36 27 3 104 68 4 236 132 5 737 501 6 969 232 7 1070 101 8 1110 40 9 1198 88 10 1297 99 11 1798 501 12 1890 92
If you have more Number of Distinct Values (NDV) then avaible buckets than Oracle creates a Height Balanced Histogram. When a Height Balanced Histogram is created, first a Frequency Histogram is created. The Values from this Histogram are stacked up into a pile and is then divided by the buckets. I hope this short description is enough. If not leave me a comment and i will try to describe it more.
So after this short introduction what is new in Oracle 12c….
In Oracle Database 12c, there are two new types of histograms introduced for columns, which have more than 254 distinct values. This Histograms improve the optimizer estimates. These new histograms have a more accurate frequency of endpoint values, and enable the optimizer to choose better plans. If Oracle Database 12c creates new histograms, and if the sampling percentage is AUTO, they are either frequency or hybrid, but not height-balanced. If you upgrade the database from Oracle Database11g to 12c, any height-based histograms created before the upgrade remains in use.
Top Frequency Histograms
Is the same as the normal frequency histogram except that the database ignores statistically insignificant unpopular values and doesn`t store them. The benefit is that the database has only highly popular values stored in the histogram.
This Top Frequency Histograms are created when:
- The sampling percentage is AUTO. Please Note that the Database builds the histogram from a FULL Table Scan.
- The distinct Values in the column is greater then the buckets
- The percentage of rows occupied by the top number of frequent values is equal to or greater than threshold. The threshold is t=(1-(1/number of buckets))*100
A hybrid histogram combines both the height-based histogram and frequency histogram. A hybrid histogram stores the endpoint repeat count value and the cumulative frequency of values that are smaller than the corresponding endpoint value for each bucket in the histogram. By using these two pieces of information, the optimizer can obtain accurate estimates for almost popular values.
This Hybrid Histograms are created when:
- The sampling percentage is AUTO.
- The top frequency Histograms does not apply.
- Number of distinct values is greater than the number of buckets.
You can check the histogram type with this SQL:
SELECT column_name,num_distinct,histogram FROM user_tab_col_statistics WHERE table_name = 'HIST_SAMPLE' AND column_name = 'VAL2' COLUMN_NAME NUM_DISTINCT HISTOGRAM ----------------- ------------ --------------- VAL2 28 HYBRID
To get more Information about the histogram you can select the <user|dba>_tab_histograms.
SELECT endpoint_number, endpoint_value, endpoint_repeat_count FROM user_tab_histograms WHERE table_name = 'HIST_SAMPLE' AND column_name = 'VAL2' ORDER by endpoint_number; ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT --------------- -------------- --------------------- 17 1 17 1469 5 1001 2132 7 200 2861 11 501 3744 52 500 4673 111 500 4803 182 39 4890 192 87 4988 1102 98 5579 2111 500