First aid with SQL
To extract data from a database, you can use SQL. It is fairly quick to learn and incredibly powerful to use. In addition, the language is almost universal: SQL is used in many applications to access, structure and manage data. However, what we often see happen, both to first-time users but also to more experienced analysts and developers, is that the performance of SQL operations leaves much to be desired. In this article we describe, a few common causes for lack of performance and teach you how best to address them.
Which problem
As mentioned, SQL is incredibly useful for accessing, editing and structuring data. Perhaps the most important tasks are the structured storage of data and determining the approach to unlock data; running an SQL query. As long as an SQL query leads to results, the user has nothing to worry about, it seems.
However, in determining the approach to unlocking data, data is physically read from disk, a relatively expensive and slow action. Obviously, this must be done efficiently. Even for more experienced users, there is no reason to stop reading now. After all, processing millions of records in a data warehouse is of a different order than updating a single record. In our experience, therefore, there are many places where gains can be made.
Do I have a problem?
The moment the run time of an SQL statement is undesirably long, we speak of a performance problem. For performance problems, it is useful to start with the approach the software takes to unlock or process data. Depending on your software, this is the execution plan or explain plan, see an example here. What I definitely look at is whether the number estimate is correct. If you see many more records than expected, then a suboptimal approach may be taken. This could be due to the structure of the SQL query, the joins between tables, the keys used, or something else. Some programs even give you hints on how to improve performance. Below we provide five examples and tips for performance improvement based on our daily practice.
SORT-operations
What makes SQL so powerful is its ability to sort and create groups. The ability to quickly gather data across multiple records and group the results is under the hood of many dashboards, for example. It has several manifestations with the GROUP BY and the DISTINCT being the most familiar. The reason to be vigilant here is that the amount of work (for the database) increases much faster than the increase in the number of records; namely, exponentially. This means that with the growth of the database, managing the performance and cost of resources may eventually become problematic. One solution is to increase processing power or memory, easy to do these days when working in a cloud environment. But this only shifts the problem to other resources and does not address it at its core.
To illustrate that growth, consider the following example. From a deck of cards, take out all the clubs, and put them in order of A, H, ..., 3, 2. I label this activity as one unit of work. Then shuffle the complete deck of cards and do the same for each suit. Four times as many cards, and yet more than four times the work. After all, filtering by color had to be done first. You can imagine how much more work and complex it becomes if you try to do this with two or more games of cards.
So assess whether you really need the GROUP BY or DISTINCT. A tip this I got a very long time ago is about the columns and expressions in the GROUP BY clause. When I want to select all kinds of details, and I also select the primary key (PK) of those details I only need to include the PK in the GROUP BY and I can still show the details with a (MIN) function. After all, with the same PK, all the details are identical. The advantage of this approach is that the (internal) workspace of the SQL interpreter becomes smaller resulting in performance gains. Below is an example, it sounds more complicated than it is.
Indexes or Partitioning
An index is a fairly efficient search structure based on the arrangement of the key (one column, or combination of multiple columns). Searching a record with a key can be done in a small number of steps, and leads to a reference of the internal, physical address of the searched record. Ideal when a very small set of records needs to be looked up.
Data warehouse environments often involve large(er) data sets. If the processing would run through an index, it is often consulted with multiple steps per record. As a result, the overhead becomes very large. At Oracle, the tipping point is at 2% or lower. By this is meant that an index pays off when less like 2% of all records in the table are selected. For this reason, I am certainly not in favor of using indexes to run queries through indexes.
Ideally, I use table partitioning, dividing the overall table by prior knowledge, attribute or predicate such as year of entry or a product group. Put all the red marbles in bin 1, all the green ones in bin 2 then you know exactly where to look when you are looking for a green marble. Then I don't have to look at the other bins, so efficient. This deliberate skipping of parts of a table is also called pruning. Then use preferably property to use for partitioning that is often found in joins. If two tables are partitioned on the same property, then the join (and the rest of the query) can also be done in pieces (partition wise joining). This can counteract that exponential resource requirement mentioned with GROUP BY. Note that pruning and partition wise joining cannot be used much, if at all, when "non-operational" properties are used, such as date or a hash.
Joins
Trivial is making sure that with joining tables (and views etc.) the full join condition is used. This means specifying the join type, the tables and the keys. This is probably the best example to control the total number of records (aka cardinality). In order to pay close attention to join conditions, I favor ANSI notation. This means that the JOIN is specified in the FROM clause and not in a table list with the join conditions in the WHERE clause. Necessary tools here are the definitions of primary keys, the unique keys and a good data model where the relationships between entities or tables are clear.
Distrust generated queries
For performance problems, don't focus on the output of the query itself, but also focus on the environment. I have worked on several assignments with a so-called ETL tool with client-server architecture. And notable on all these jobs was the prescription to "not write your own SQL"; the tool "writes" the SQL itself. In other words, get all the data to the tool, which loads it into working memory and does its thing. Often this is not efficient.
A recent example is an application where two sets of records are matched, both twenty million records in size. For each record from one set, the record from the other set must be found. However, the generated query chooses to sort all records, an expensive time-intensive solution. By writing your own query and using existing indices and partitioning, this can be done much more effectively.
Testing
If you are improving and want to make performance gains measurable, pay close attention. With great regularity, (potential) performance improvements are tested by comparing run times of the old and new code. However, the results can be highly influenced by the situations below:
- Other processes may claim resources from the servers and network so that the measurements are not easily comparable between them;
- Temporary storage of data in memory (caching) that eliminates the need for many reads from the physical disk (the physical IO).
- In general, I value the differences in the explain plans more than this kind of idle system testing. When it really can't be otherwise, I try to start all SQL versions at the same time and the load on the machine is more or less equal.
Parallel execution of SQL
If there are no more performance gains to be made in the SQL queries, one option remains: executing a query in parallel. The advantage of executing SQL in parallel is the reduction of processing time. The option has a limited shelf life; it works until other resources become the bottleneck, such as write buffers or memory. However, parallel execution also has disadvantages, the short-term load on the server is higher; since there is more coordination and communication to do.
Briefly summarize the tips for performance gains:
- Limit the use of GROUP BY, DISTINCT or ORDER BY.
- Use ANSI joins, do not join unnecessary tables and use inner joins as much as possible.
- Write out your join conditions completely in the FROM statement and keep your total number of records (cardinality) under control.
- Use indices when they work well; for larger sets it is preferable to partition tables (multiple tables and with the same properties)
- Distrust generated queries; they often contain limited knowledge of the data model and are suboptimal.
- Consider parallel execution of SQL queries. These are a greater short-term burden on the environment but provide faster results.



