Oracle-Developer.com
| Navigation:
Discussion Forums (Get expert advice) |
Publications |
Scripts |
About Us | Links |
Sudden Database Performance Degradation with
10g
by V.J. Jain, July 2007 (varun.jain@varunjaininc.com)
Varun Jain, Inc. (an Oracle
Partner)
One of the many advantages of upgrading to 10g is that the Cost Based Optimizer
(CBO) has been improved to allow for better performance by using a cost model
that accounts for I/O and CPU (link).
This cost model allows the optimizer to choose the best execution plans based on
statistics of the data (link).
However, the decisions made by the CBO depend on the statistics that are
available at the time that the execution plan is parsed. One of the most common
causes of performance issues in 10g databases is bad statistics.
Luckily, Oracle 10g also
introduces several extremely useful tools for monitoring database performance
and activity including Active Workload Repository (AWR) and Automatic Database
Diagnostic Monitor (ADDM). These tools can be used to pinpoint where bad
statistics might be causing the optimizer to make bad decisions.
The obvious question is "what constitutes bad statistics?". Unfortunately, the
answer to that is not as obvious. In fact, unless you understand the nuts and
bolts of how the CBO chooses its execution plan and how parsed execution plans
are used in the library cache, it is unlikely that you will even know when bad
statistics are causing your database to fail. The term "database failure" can
have different meanings but what I am referring to is when CPU, I/O, and wait
times spike and stay elevated until the database is unusable usually
necessitating a database bounce.
I am going to discuss one case where a client required a root cause analysis for
reoccurring database failures. Each time their database failed, they noticed the
same symptoms - suddenly the CPU, I/O, and wait times spiked and stayed
elevated. If the database failure occurred during production hours, the DBAs
would bounce the database almost immediately and rely on Statspack reports to
figure out what happened. But if the database failed outside normal business
hours, the management rightfully insisted on a root cause determination prior to
any corrective action. Since this client is a fairly large company, this meant
that there would be a phone conference consisting of approximately 20 people
trying to figure out what changed while the database was inaccessible for 6 to
18 hours. This team consisted of network administrators, database developers,
DBA's, consultants, and managers. I witnessed this happen numerous times in the
few weeks I was at this client. The first time, the team decided to double the
CPU's on the machine to combat the spiking CPU consumption. After this didn't
solve the problem, the team of 20 determined that the corrective action was to
create indexes and assign them to the keep pool. After the corrective action was
taken and the database was restarted and the response times would become normal.
These intermittent failures were a top priority for this client since each
failure had a substantial cost in resources and lost production time. Adding to
the client's frustration, they had recently completed an engagement with a
consulting company for performance tuning that same database.
My general approach to determining a root cause was similar to the client's
approach. I wanted to find out what changed and then determine the cause for
that change. Using the Active Workload Repository (AWR) to analyze the database
CPU and I/O, I was able to pinpoint the exact snapshot where the sudden resource
spike occurred.
| BEGIN|SNAPSHOT | CPU|BUSY | PHYSICAL|READS | LOGICAL|READS |
| 39240 | 10.83 | 2,632,960 | 5,540 |
| 39241 | 10.35 | 2,389,283 | 5,333 |
| 39242 | 9.84 | 1,928,372 | 4,765 |
| 39243 | 12.07 | 2,358,480 | 319,280 |
| 39244 | 19.41 | 367,416,720 | 76,615 |
| 39245 | 41.47 | 4,383,027,600 | 191,815 |
| 39246 | 41.79 | 5,344,862,800 | 350,030 |
| 39247 | 48.62 | 7,070,087,680 | 599,760 |
| 39248 | 44.97 | 2,385,390,000 | 420,200 |
| 39249 | 53.50 | 6,154,475,760 | 138,420 |
| 39250 | 63.46 | 8,277,101,680 | 135,990 |
| 39251 | 67.05 | 8,269,566,480 | 200,240 |
| 39252 | 86.25 | 4,821,330,360 | 230,178 |
| 39253 | 75.40 | 5,879,349,080 | 420,036 |
| 39254 | 86.19 | 7,777,096,448 | 719,712 |
| 39255 | 92.51 | 2,623,929,000 | 504,240 |
| 39256 | 89.90 | 6,769,923,336 | 166,104 |
| 39257 | 13.35 | 383,027,600 | 91,815 |
Using the data above, it is obvious that something changed during snapshot 39244. The source of resource consumption on a segment can be ascertained by understanding all of the operations taking place in the database at a particular time. The next action in the diagnosis was to look at the SQL stats in AWR for the top SQL during the trouble snapshot. I focused on the major consumers identified in the top SQL of the Statspack report during the trouble period, verified by running an Automatic Database Diagnostic Monitor (ADDM) report (link). Starting with the greatest consumer, we see that the logical and physical read deltas had greatly increased while the executions remained fairly consistent.
|
|
| BEGIN|SNAPSHOT | EXEC|DELTA | Buffer|Gets|DELTA | Cncr|WAIT|DELTA | CPU|TIME|DELTA | Elpsd|TIME|DELTA |
| 39240 | 3,658 | 9,854 | 0 | 329,982 | 329,982 |
| 39241 | 3,473 | 8,978 | 0 | 311,709 | 311,709 |
| 39242 | 3,161 | 8,630 | 0 | 293,170 | 293,170 |
| 39243 | 3,470 | 9,414 | 0 | 324,012 | 324,012 |
| 39244 | 1,014 | 20,939,367 | 5,225 | 717,640,129 | 859,772,865 |
| 39245 | 3,857 | 81,967,204 | 133,178 | 2,754,815,969 | 3,478,521,604 |
| 39246 | 3,976 | 88,320,579 | 11,639 | 2,973,254,882 | 3,751,332,591 |
| 39247 | 4,237 | 98,588,734 | 1,719,105 | 3,266,404,245 | 4,415,732,409 |
| 39248 | 3,643 | 88,857,830 | 906,682 | 2,937,868,965 | 3,831,412,484 |
| 39249 | 4,427 | 112,303,040 | 5,374,334 | 3,645,190,672 | 5,224,145,417 |
| 39250 | 4,460 | 117,456,518 | 31,105,732 | 3,770,210,494 | 6,212,396,813 |
| 39251 | 4,535 | 123,818,077 | 35,026,159 | 3,908,183,700 | 6,647,598,948 |
| 39252 | 5,171 | 148,715,473 | 893,610,223 | 4,390,247,982 | 18,029,668,378 |
| 39253 | 2,546 | 75,187,326 | 25,417,157 | 2,343,609,592 | 4,284,188,925 |
| 39254 | 3,657 | 112,091,347 | 1,531,886,898 | 3,285,867,054 | 20,014,193,638 |
| 39255 | 4,884 | 156,864,666 | 3,182,023,828 | 4,571,435,731 | 34,772,659,966 |
| 39256 | 2,289 | 91,288,199 | 1,281,384,017 | 2,123,128,775 | 19,283,382,187 |
| 39257 | 0 | 0 | 0 | 0 | 0 |
This situation held true for each SQL statement in the top SQL. Furthermore, I noticed that the text of each of the top SQL had a similar predicate on the same table. Naturally, the next question was, "what caused the marginal cost of the same SQL query to increase exponentially?". This question can be answered by querying the AWR's SQL stats again for additional information.
|
|
| BEGIN|SNAPSHOT | PLAN_HASH_VALUE |
| 39240 | 3058762805 |
| 39241 | 3058762805 |
| 39242 | 3058762805 |
| 39243 | 3058762805 |
| 39244 | 6038888 |
| 39245 | 6038888 |
| 39246 | 6038888 |
| 39247 | 6038888 |
| 39248 | 6038888 |
| 39249 | 6038888 |
| 39250 | 6038888 |
| 39251 | 6038888 |
| 39252 | 6038888 |
| 39253 | 6038888 |
| 39254 | 6038888 |
| 39255 | 6038888 |
| 39256 | 6038888 |
| 39257 | 6038888 |
Immediately, we notice that the sql plan hash value had changed. In other words,
the execution plans for the same SQL has changed during the trouble snapshot.
Comparing the two execution plans using AWR, I saw that the original plan used
an efficient index (using an ID) while the new execution plan used a very
inefficient plan (using a DATE).
|
PLAN_HASH_VALUE |
SQL_TEXT |
Execution Plan |
Index Detail | Session Stats (Sample) | |||||||||||||
|
3058762805 (original) |
SELECT 1 FROM LEADS L
WHERE L.ID = :B4 AND L.CONTACTED_FLAG
|
|
L_IDX1 (CONTACTED_FLAG, PRO_ID) |
17 Blocks Read per execution |
|||||||||||||
|
6038888 (new) |
SELECT 1 FROM LEADS L
WHERE L.ID = :B4 AND L.CONTACTED_FLAG
|
|
L_IDX5 (ENTERED, CREATION_TS) |
38,208 Blocks Read per execution |
|||||||||||||
The field CREATION_TS is a date field that shows the time that the record was
first created. This particular table has several thousands of rows
inserted daily. Why would the optimizer choose to range scan a time period
with thousands of records to return 1 record, rather than using a selective
index that only has 1-3 values. To any reasonable person viewing these two
plans, it is obvious that the old plan is much more efficient. So what is
causing the CBO to choose such a poor plan? Also, what is causing the execution
plan to be reparsed causing the plans to change?
Since my objective was to determine the root cause for the database failure, my
primary concern was the cause for the CBO using a poor execution plan. There are
a number of reasons that the plan might be reparsed but an infrequent
invalidation from the library cache should not cause a major problem. However,
in our case, this invalidation was causing a new execution plan that was
extremely inefficient leading to the failure of the database. Given this
situation, I could easily have recommended using plan stability, but this would
have been only treating the symptom instead of curing the cause. There are a few
ways to determine the optimizer's cost based decisions. Arguably, the most
effective is a CBO trace. With this particular situation, the root problem was
signified by more circumstantial data. Specifically, all of the TOP sql in the
ADDM and Statspack reports during the trouble snapshot showed the exact same
join on the same table within their SQL text. As shown above, the optimizer was
driving decisions based on two predicates - a date column and an ID column. For
this table, there were only a few values for each ID while there are thousands
of records for each date. The optimizer changed execution plans from using an
index on the ID column to using an index on the DATE column.
Comparing the two plans, the cost for the plan using ID was being calculated
equal to than the cost using the DATE. However, after executing each plan, the
session stats show that the plan using ID has only scanned 17 blocks while the
plan using DATE had scanned over 38,000 blocks. Why is the cost for the poor
plan being evaluated as lower than a much more efficient plan?
This client has an OLTP environment. Each of the queries that were identified as
the top consumers were all querying for the most recent day's data. In other
words, the value of the DATE range is between TRUNC(SYSDATE) and SYSDATE. Given
all of this diagnostic data, my suspicion was that there was a column histogram
on the DATE field causing the CBO to calculate costs based on old statistics.
Sure enough, I was able to confirm that there was a height balanced column
histogram on the DATE field.
|
SQL> HEIGHT BALANCED |
With statistics being gathered weekly to bi-weekly, it was
logical that the CBO believed the cost of using the most recent dates would be
very low. In other words, when reparsing the SQL, the CBO believed that there
were no values in the specified date range since the column histogram showed no
values for that bucket. A CBO trace verified this situation and showed
that the CBO calculated the cost for using that date range to be very low and
chose the DATE index in the execution plan. In reality, this proved to be
catastrophic to the database since this new plan consumed massive amounts of I/O
and was executed many thousands of times per hour.
This situation was easy to demonstrate by using an old date range compared with
a new date range. All queries using a date range prior to the date of the
latest statistics formed execution plans using the index on the ID column while
any date ranges after the latest statistics used the index on the DATE field.
This demonstration showed that any plans using these predicates that were
reparsed after the date of the latest statistics would favor using the index on
the DATE field, resulting in very inefficient execution plans.
According to Oracle, it is only proper to use column histograms when the values
of the column are highly skewed (link).
Obviously a DATE field with a fairly consistent distribution of values does not
fit this prescription. After asking the client's DBAs why they were gathering
histograms for all indexed columns, they said that their developers had
requested this option.
Finally, we had the root cause determination: the use of histograms on DATE
fields in an OLTP database where the queries on this field were specifying the
newest dates. This caused the CBO to miscalculate costs based on its belief that
there was no data for the specified dates, hence resulting in fewer blocks to
scan. The triggering event for the sudden degradation of performance was the
reparsing of the SQL much after the date of the latest statistics. By removing
the histogram on this column and reparsing the SQL, an immediate database-wide
improvement and lasting stability was realized.
The really remarkable part about this root cause analysis is that the time of the analysis was completely independent from the date of the actual failure. In other words, Oracle 10g AWR and ADDM have made the once crucial need for analysis between Crash and Recovery a thing of the past. These tools are instrumental in analyzing database issues while allowing for the minimum downtime.
Sponsored by Varun Jain, Inc. Oracle Applications and Database Consulting
Copyright ©2007 Oracle-Developer.com