Optimising your Oracle Infrastructure

Jon Cowling 11-Oct-2017 11:24:07

Optimising your Oracle Infrastructure

“Optimisation simply means that you take a bunch of little things that don’t seem to be very difficult to change and all of those little changes have a multiplying effect and you get a giant change from all of those tiny changes.”

Tony Robbins, 2017


The above Tony Robbins quote rings as true for an Oracle infrastructure as is does for everyday life.

The amount of information stored in Oracle databases is mind-boggling and growing exponentially. This might be the age of big data but if that data is to be useful then it needs to be fast data, too!

Tuning Oracle is considered one of the dark arts of database administration, and this notoriety is partly derived from the complexity of business workloads and IT platforms, which is not getting any simpler when you consider Oracle Cloud, AWS, or Azure are now part of the equation – often in a hybrid cloud set-up. Despite the complexity of the system the truth is tuning Oracle is mostly science with a little bit of art because there are often many different solutions to the same performance bottleneck.

Start with 'What?'

There are two questions that must be nailed to successfully optimise Oracle. They are 'What?' and 'What?'.

  • The first 'What' is, “What is slow, specifically?”
  • The second 'What' is, “What is the bottleneck?”

These two questions focus our attention on precisely the right place and as a result get maximum results fast.

Let’s consider these questions in turn…

What is Slow?

When you say slow, what is it, specifically, that is slow?

Is the whole system slow, e.g. the system feels like it’s hanging? Or is it one specific priority e.g. if the payroll run doesn’t finish by 4 pm no one gets paid, or item lookups are taking 60 seconds and our customers are leaving our website for a competitor’s? Getting this kind of clarity narrows the field of search and greatly reduces the complexity of a tuning engagement.

What is the Bottleneck?

Whole system slow-downs are treated differently to a single business process slow-down. However, the second question is still the same, namely, ‘What is the bottleneck?’

Notice until this point we haven’t even mentioned logging into a system! This is often where inexperienced performance analysts go wrong, as the temptation is definitely to jump onto a system and start even before we have the necessary context. Only now is it time to roll up our sleeves and log in, because to find the bottleneck we need to collect data.

Metrics

As mentioned earlier, whole system slow-downs are treated differently to a single business process slow-down. The metrics are collected on a system-wide basis for a whole system slow-down and on a per-business-process basis for a single business process slow-down.

Of course, to collect metrics one needs to have a monitoring system in place that stores the required metrics. It is a bad idea to rely only on real-time performance metrics, as relying on metrics that you collect and store only when you ‘need’ them means the performance problem needs to persist or repeat to allow you to collect the data. These repeat performance problems must be avoided with transient performance issues or performance issues. Another important reason to store your performance data is to establish historical performance baselines that you can use for comparison against any current performance issues.

Among the publicly available tools that we use at DSP are CA Universal Infrastructure Monitoring (UIM), Enterprise Manager with the tuning and diagnostic packs, and Statspack, as well as a suite of internally developed tools. We collect and store tens of thousands of metrics every day with little overhead so we can not only reactively optimise performance issues but, more importantly, spot the trends and prevent them from happening in the first place.

One of our latest tools is a set of preventative analytics that correlates data from several different monitoring products, along with our service desk incident, and changes data to predict and therefore prevent customer problems from happening. The data also allows us to benchmark our customer databases against each other so we know what good and bad looks like based on comparing systems.

Whole System Slow-Downs

With whole system slow-downs, we are looking for resources that are exhausted, e.g. 100% utilisation of the CPU or disk or memory or network. We are also looking for latches, locks, or other competitive resources running out or being blocked by examining log files. This is another place where the right monitoring set-up can help because on a modern or cloud-based system there can be hundreds of log files in all manner of locations and having to manually eyeball each one would be prohibitively time-consuming or prone to human error.

Business Process Slow-Down

For the slow-down of a single business process, it is more important to collect data at the business process level than at a whole system level. Often performance can look great at a system level as the average of all processes looks fine hiding a slow deviation. How many times has a user heard the phrase, “the system looks fine”, even as their own frustration reaches a boiling point while waiting on the poor performance of their own process? Often this mismatch in perception is down to collecting performance data at a system level when your user is interested only in the performance of their critical business process. It reminds me of a quote from my old maths teacher: “Then there is the man who drowned crossing a stream with an average depth of six inches.” Collect your data at a session level rather than a system level, lest our users suffer the same fate.

Often, the slow business process will have many parts to it and so, to locate the bottleneck, it is important to identify if any one unit is taking up most of the time. I remember a time when a user complained of poor performance as the regulatory reports were two hours late and, after a few searching questions, it transpired that it was nothing to do with the Oracle database but that the printer was switched off! I switched on the printer and was both a hero and a gentleman by not mentioning the on button.

The search method is to keep dividing the business process until the slowest unit is isolated in its smallest form. This may require some creativity as business processes can span many layers of the system, from printers to batch managers. Until this point, one has pretty much only needed the wall clock as a metric. However, once the long-running unit has been found, deep-dive metrics will be needed – typically from a database process trace. The process/session data can also be collected in Active Session History, AWR, or Statspack tables depending on what you are licensed to use and what monitoring you have set up.

Again, the objective is to keep diving down into the root of the bottleneck. Now at this depth, we are looking for the slowest system call(s) or slowest SQL statements. This laser focus on the bottleneck ensures we are optimising the parts of the system or workload that will make the biggest difference. Ideally, we will have an SQL statement or a number of SQL statements that are taking too long and the components of the system where they are spending too much of their time, e.g. 70% waiting on IO, 90% spent on the CPU, or 30% waiting on an internal latch, etc. At this point, we have successfully diagnosed the bottleneck and now we can get on with optimising it.

Tuning

There are four effective ways to tune a bottleneck:

  1. Use less of the bottleneck’s resources, e.g. if the bottleneck is IO then tune the SQL statement to use less IO.
  2. Give more resources to the bottleneck, e.g. if the bottleneck is IO then add more disks to the SAN to increase the bandwidth of the IO system.
  3. Make the resource faster at the bottleneck, e.g. if the bottleneck is IO then swap the hard disks in the IO subsystem for faster Solid State Disks.
  4. Use the bottleneck’s resources when it is less busy, e.g. if the bottleneck is IO then can low priority, IO hungry, batch jobs be moved to quiet overnight periods, freeing up resources at peak times?

There is a hierarchy of Oracle tuning, depending on where the changes are made.

  1. Changes in the application design, e.g. changes in the schema design or connection pooling.
  2. Changes in the application SQL, e.g. changes to the SQL statements or PL/SQL code.
  3. Changes in Oracle, e.g. changes to the init.ora parameters to increase resources like memory of latches.
  4. Changes in the hardware configuration, e.g. upgrading the IO subsystem, adding more CPUs, etc.

Changes at the top of the hierarchy are typically much more difficult to make once the system is live, but often have the greatest impact on performance. The moral of this hierarchy is that performance should be designed and built into a system when it is developed because it is much harder to make deep design changes after go-live. Tuning should be part of the life cycle of an application, right through the analysis, design, and coding phases and not just in the production, and maintenance stages.

Reactive vs Proactive

It’s fair to say that, on a lot of new Oracle systems we look at, tuning is only performed reactively in response to a user complaint. This is too late. The call to action for this paper is that DBAs need to understand the key business priorities of the Oracle database and application, and then continually monitor and these critical processes against performance baselines to ensure the performance is maintained and SLAs are met. At DSP, we adopt the philosophy of continuous maintenance. It works hand-in-hand with a series of preventative analytics we run on all our customers’ Oracle systems: predict the changes needed to keep a system fully optimised and prevent performance issues from happening. The proactive changes recommended by our preventative analytics are then made, sometimes automatically to pre-prod systems, ensuring our customers’ systems do not decay over time.

Let DSP handle it

Likely to be the safest and most reliable option: let us take care of your database. Our servers are already fully optimised to handle whatever your business needs and our constant maintenance means that they stay that way without dropping in efficiency. We’ve designed our databases to allow for maximum access with no bottlenecking and even if you should want to change the scale of your database, we’re set up to accommodate you so you don’t experience any reduction in server performance.

If you’d like to know more about how we can make your database safer and more efficient, get in touch using our contact page.