Oracle-Developer.com
| Navigation:
Home | Discussion Forums (Get expert advice) |
Scripts |
About Us | Links | Job Openings
|
Oracle 11g Multi Column CBO Statistics
By V.J.
Jain, January 2008 (varun.jain@varunjaininc.com)
Varun Jain, Inc. (an Oracle Partner)
| Oracle 11g
has introduced extended statistics that help the cost based optimizer (CBO) to
make better decisions using statistics on groups of columns when multiple
columns for a single table are specified in the where clause of the
query. In addition to the standard statistics, histograms can also be created
on these column groups for improved cost estimates when a skew exists in the
data distribution of the column group.
In order for the CBO to make its execution decisions for a query, it uses the available statistics to calculate cost estimates for possible access paths. If column histograms exist, the optimizer uses these histograms to compute the selectivity of predicate values. This selectivity is a factor in estimating costs and consequently choosing the best access path. Prior to 11g, statistics could only be created for individual columns. The inability of the CBO to realize the relationships between multiple columns on a single table is a significant limitation on the accuracy of cost estimates.
Consider a table that lists employee ID’s, the years they have been employed, and their annual bonus in thousands. In this company, the annual bonus of an employee is a calculated variable based on their years employed (years * $1000). Traditional column histograms can provide the optimizer with information about the distribution of values in each of these columns. However, querying the table by specifying both years employed and annual bonus, the CBO doesn’t have any means of realizing the correlation between these two columns. If we wanted to find employees who have been employed for 1 year but have a 20,000 bonus, the CBO would not realize that these two predicates yield an empty result set. Instead, the CBO would use statistics on each individual column possibly leading to an inefficient execution plan. For example, if the majority of employees at this company had been employed for only one year, the optimizer would likely favor a full table scan over an index access. If the optimizer knew that combining the two predicate values would return no records, the CBO could use that information to make better cost calculations.
The new MultiColumn statistics feature addresses this limitation by allowing for the creation of column groups within a single table. These column groups behave similar to a new column on the table. Their statistics are gathered using the DBMS_STATS package. Additionally, dictionary information for column groups is available from the same dictionary tables as regular columns. Included in the column group statistics is the number of distinct values and histograms (also known as hybrid histograms). As a result of these features, the optimizer can use the column group statistics to generate better cost estimates than previous Oracle releases by correlating statistics for multiple columns.
To demonstrate this new feature, look at the following test case using ALL_OBJECTS. Under an Oracle 11g base installation, certain queries on this table can result in inefficient execution plans when using multiple columns in the where clause. With the base installation, the majority of objects are owned by PUBLIC and the second most common object type is JAVA CLASS.
SQL> select owner, count(*) from all_objects group by owner 2 order by count(*) desc 3 /
OWNER COUNT(*) ------------------------------ ---------- PUBLIC 26418 SYS 23166 ORDSYS 2099 XMLTEST 1656 MDSYS 856 OLAPSYS 239 XDB 215 FLOWS_020200 147 WMSYS 117 CTXSYS 87 EXFSYS 80
SQL> select object_type, count(*) from all_objects group by object_type 2 order by count(*) desc 3 /
OBJECT_TYPE COUNT(*) ------------------- ---------- SYNONYM 26418 JAVA CLASS 21956 VIEW 1793 TYPE 1658 JAVA RESOURCE 818 TABLE 612 INDEX 487 PACKAGE 428 LOB 289 FUNCTION 228 TRIGGER 145
While the predicates for OWNER= ‘PUBLIC’ and OBJECT_TYPE= ‘JAVA CLASS’ each represent a large portion of the total rows (over 40%), there are no java classes owned by PUBLIC. This provides a good situation for creating a demonstration. Using a copy of this dictionary table shows how the optimizer uses multiple column statistics. Start by creating a test table using the data in ALL_OBJECTS. Create an index on owner and object_type and finally gather stats on the table.
SQL> create table allobjects_test1 2 as 3 select * from all_objects 4 /
Table created.
SQL> create index TEST1_IDX1 on allobjects_test1 (owner, object_type);
Index created.
SQL> begin 2 DBMS_STATS.GATHER_TABLE_STATS(NULL,'ALLOBJECTS_TEST1', method_opt => 'for all columns size skewonly'); 3 end; 4 /
PL/SQL procedure successfully completed.
View the column statistics on the new test table…
SQL> select column_name, num_distinct, histogram from user_tab_col_statistics 2 where 3 table_name = 'ALLOBJECTS_TEST1' 4 /
COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- OWNER 13 FREQUENCY OBJECT_NAME 31096 HEIGHT BALANCED SUBOBJECT_NAME 0 NONE OBJECT_ID 55095 NONE DATA_OBJECT_ID 1181 HEIGHT BALANCED OBJECT_TYPE 27 FREQUENCY CREATED 2983 HEIGHT BALANCED LAST_DDL_TIME 2619 HEIGHT BALANCED TIMESTAMP 3020 NONE STATUS 1 FREQUENCY TEMPORARY 2 FREQUENCY GENERATED 2 FREQUENCY SECONDARY 1 FREQUENCY NAMESPACE 11 FREQUENCY EDITION_NAME 1 FREQUENCY
15 rows selected.
Set autotrace on and run a query to find all java classes owned by public…
SQL> set autotrace on SQL> select * from allobjects_test1 where owner = 'PUBLIC' and object_type = 'JAVA CLASS' 2 /
no rows selected
Execution Plan ---------------------------------------------------------- Plan hash value: 1539201261
-------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10291 | 1025K| 233 (1)| 00:00:03 | |* 1 | TABLE ACCESS FULL| ALLOBJECTS_TEST1 | 10291 | 1025K| 233 (1)| 00:00:03 | --------------------------------------------------------------------------------------
Predicate Information (identified by operation id): ---------------------------------------------------
1 - filter("OBJECT_TYPE"='JAVA CLASS' AND "OWNER"='PUBLIC')
Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 832 consistent gets 0 physical reads 0 redo size 1116 bytes sent via SQL*Net to client 370 bytes received via SQL*Net from client 1 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 0 rows processed
In this example, the CBO is choosing not to use the index TEST_IDX1 (OWNER, OBJECT_TYPE). If you think about data distribution of the individual columns, this makes sense. Since approximately 48% of the rows have OWNER=’PUBLIC’ and 40% of the rows have TYPE= ‘JAVA CLASS’, it is logical for the optimizer to prefer a full table scan when considering these columns independently. To analyze the decisions made by the CBO that led to this execution plan, generate a 10053 trace file for this query which shows the following…
*************************************** SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for ALLOBJECTS_TEST1[ALLOBJECTS_TEST1] Column (#1): NewDensity:0.000092, OldDensity:0.000009 BktCnt:5410, PopBktCnt:5408, PopValCnt:11, NDV:13 Column (#6): NewDensity:0.000092, OldDensity:0.000009 BktCnt:5410, PopBktCnt:5408, PopValCnt:18, NDV:27 ColGroup (#0, Index) TEST1_IDX1 Col#: 1 6 CorStregth: -1.00 ColGroup Usage:: PredCnt: 2 Matches Full: Partial: Table: ALLOBJECTS_TEST1 Alias: ALLOBJECTS_TEST1 Card: Original: 55095.000000 Rounded: 10291 Computed: 10290.84 Non Adjusted: 10290.84 Access Path: TableScan Cost: 233.14 Resp: 233.14 Degree: 0 Cost_io: 232.00 Cost_cpu: 25496803 Resp_io: 232.00 Resp_cpu: 25496803 ColGroup Usage:: PredCnt: 2 Matches Full: Partial: ColGroup Usage:: PredCnt: 2 Matches Full: Partial: Access Path: index (AllEqRange) Index: TEST1_IDX1 resc_io: 386.00 resc_cpu: 9233056 ix_sel: 0.186784 ix_sel_with_filters: 0.186784 Cost: 386.41 Resp: 386.41 Degree: 1 Best:: AccessPath: TableScan Cost: 233.14 Degree: 1 Resp: 233.14 Card: 10290.84 Bytes: 0 ***************************************
The optimizer trace file shows that the cost for using the TEST1_IDX1 has been calculated as higher than using a full table scan. The trace file also shows that the statistics for the columns OWNER and OBJECT_TYPE have been included in the cost comparisons. The index selectivity for TEST1_IDX1 is based on the ratio of distinct values in the index to total rows in the table (10291:55095). This is a poor value for index selectivity and therefore results in a higher cost than using a full table scan.
However, if the optimizer had statistics on these columns as a group, it would know that the specified predicates have a very low cardinality. In order to collect these statistics, create a column group using DBMS_STATS.CREATE_EXTENDED_STATS
SQL> declare 2 cg_name varchar2(30); 3 begin 4 cg_name := dbms_stats.create_extended_stats(NULL,'ALLOBJECTS_TEST1', '(owner,object_type)'); 5 end; 6 /
PL/SQL procedure successfully completed.
Confirm the column group by using:
SQL> Select extension_name, extension 2 from user_stat_extensions 3 where table_name='ALLOBJECTS_TEST1';
EXTENSION_NAME EXTENSION ------------------------------ --------------------------- SYS_STUXJ8K0YTS_5QD1O0PEA514IY ("OWNER","OBJECT_TYPE")
In order to allow the optimizer to begin using this new column group, gather the statistics for these columns using DBMS_STATS.
SQL> begin 2 dbms_stats.gather_table_stats(null,'ALLOBJECTS_TEST1', 3 method_opt =>'for all columns size skewonly for columns (owner,object_type) skewonly'); 4 end; 5 /
PL/SQL procedure successfully completed.
Now re-query the column statistics for the table to show that the newly created column group now appears with statistics and a histogram (also referred to as a hybrid histogram).
SQL> select column_name, num_distinct, histogram from user_tab_col_statistics 2 where 3 table_name = 'ALLOBJECTS_TEST1' 4 /
COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- OWNER 13 FREQUENCY OBJECT_NAME 31096 HEIGHT BALANCED SUBOBJECT_NAME 0 NONE OBJECT_ID 55095 NONE DATA_OBJECT_ID 1181 HEIGHT BALANCED OBJECT_TYPE 27 FREQUENCY CREATED 2983 HEIGHT BALANCED LAST_DDL_TIME 2619 HEIGHT BALANCED TIMESTAMP 3020 NONE STATUS 1 FREQUENCY TEMPORARY 2 FREQUENCY GENERATED 2 FREQUENCY SECONDARY 1 FREQUENCY NAMESPACE 11 FREQUENCY EDITION_NAME 1 FREQUENCY SYS_STUXJ8K0YTS_5QD1O0PEA514IY 94 HEIGHT BALANCED
16 rows selected.
This new column group represents the combined values of OWNER and OBJECT_TYPE. This query shows that a height balanced histogram has been created on this column group. Also, the number of distinct values for the column group is listed as 94. In other words, there are 94 distinct values when combining both columns in the table. This can be verified by running…
SQL> select count(*) 2 from 3 (select distinct owner, object_type from allobjects_test1) 4 /
COUNT(*) ---------- 94
Now, executing the same query as before, the execution plan and I/O is greatly improved due to the additional statistics provided by the new column group.
SQL> set autotrace on SQL> select * from allobjects_test1 where owner = 'PUBLIC' and object_type = 'JAVA CLASS' 2 /
no rows selected
Execution Plan ---------------------------------------------------------- Plan hash value: 1510138874
------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 57 | 6498 | 3 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| ALLOBJECTS_TEST1 | 57 | 6498 | 3 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | TEST1_IDX1 | 57 | | 1 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id): ---------------------------------------------------
2 - access("OWNER"='PUBLIC' AND "OBJECT_TYPE"='JAVA CLASS')
Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 2 consistent gets 0 physical reads 0 redo size 1116 bytes sent via SQL*Net to client 370 bytes received via SQL*Net from client 1 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 0 rows processed
Analysis of the new 10053 optimizer trace file shows that…
*************************************** SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for ALLOBJECTS_TEST1[ALLOBJECTS_TEST1] Column (#1): NewDensity:0.000271, OldDensity:0.000009 BktCnt:5537, PopBktCnt:5537, PopValCnt:12, NDV:13 Column (#6): NewDensity:0.000090, OldDensity:0.000009 BktCnt:5537, PopBktCnt:5535, PopValCnt:17, NDV:27 Column (#16): NewDensity:0.001037, OldDensity:0.006993 BktCnt:75, PopBktCnt:68, PopValCnt:4, NDV:94 ColGroup (#0, VC) SYS_STUXJ8K0YTS_5QD1O0PEA514IY Col#: 1 6 CorStregth: -1.00 ColGroup Usage:: PredCnt: 2 Matches Full: Using density: 0.001037 of col #16 as selectivity of unpopular value pred #0 Partial: Sel: 0.0010 Table: ALLOBJECTS_TEST1 Alias: ALLOBJECTS_TEST1 Card: Original: 55095.000000 Rounded: 57 Computed: 57.14 Non Adjusted: 57.14 Access Path: TableScan Cost: 233.06 Resp: 233.06 Degree: 0 Cost_io: 232.00 Cost_cpu: 23708255 Resp_io: 232.00 Resp_cpu: 23708255 ColGroup Usage:: PredCnt: 2 Matches Full: Using density: 0.001037 of col #16 as selectivity of unpopular value pred #0 Partial: Sel: 0.0010 ColGroup Usage:: PredCnt: 2 Matches Full: Using density: 0.001037 of col #16 as selectivity of unpopular value pred #0 Partial: Sel: 0.0010 Access Path: index (AllEqRange) Index: TEST1_IDX1 resc_io: 3.00 resc_cpu: 58754 ix_sel: 0.001037 ix_sel_with_filters: 0.001037 Cost: 3.00 Resp: 3.00 Degree: 1 Best:: AccessPath: IndexRange Index: TEST1_IDX1 Cost: 3.00 Degree: 1 Resp: 3.00 Card: 57.14 Bytes: 0
***************************************
The density is related to the probability of finding particular value in the rows of a table. When the statistics are collected on the table, column density is calculated and stored in the column statistics. If a column histogram is available when the optimizer creates estimates for different access paths, it uses the specified predicate values to recalculate density based on the information in the histogram.
Since the predicates used in the query return no rows, the histogram on the column group has revealed that the predicates contain unpopular predicate values and therefore have a low density. The optimizer dump file above shows that the values in the predicates have been calculated to produce a density of 0.001037 for the column group.
This dump file also shows that the column group is automatically being used to adjust cost calculation for this query. It does this by using the density of the column group as a new selectivity value for the index. The selectivity represents a fraction of rows from a row set. Therefore, index selectivity influences the number of blocks that will need to be read when scanning an index; the lower the selectivity, the fewer blocks that need to be read, resulting in a lower cost. In the example above, the estimated cost of using the index has been greatly lowered since the index selectivity has been reduced from ix_sel: 0.186784 to ix_sel: 0.001037. In other words, the old selectivity value of 0.186784 meant that the optimizer estimated 18.6784% of the total rows would be returned. Likewise, the new selectivity value of 0.001037 means that 0.1037% of the rows will be returned. This new selectivity value results in a much lower cost estimation for the access path using the index TEST1_IDX1. Based on these cost estimates, the optimizer chose the plan using the index access path. The use of column groups in the example above improved the I/O for this query by several orders of magnitude.
By supplementing the statistics that are available to the CBO with column group statistics, the optimizer is able to make more intelligent decisions through understanding the correlation between multiple columns. These column groups are used to determine more accurate selectivity estimates when multiple columns for a single table are specified in the where clause of the query. Even when the distribution of data in the column group is skewed, the column group histograms allow for accurate cost estimates. This enhancement addresses one of the primary limitations of the CBO in previous Oracle releases and greatly improves the capability of the CBO to accurately compute the costs and determine the best execution plans.
|
Owned and Operated by Varun Jain, Inc, an Oracle Consulting Firm
Copyright ©2007 Oracle-Developer.com