Thursday, February 7, 2013

Optimizing your Database Queries using MySql, Explain and Indexes

Problem:

Imagine you run a successful website or app, month after month your traffic grows until you notice page views are getting slower and slower and the user experience is suffering. A little investigation reveals the database is the bottleneck.
 

Solutions:

This is a common scenario and there are several lines of attack you can follow to alleviate this problem:
1)      Caching. All enterprise level, high traffic websites use caching in one form or another to operate at in-memory speeds and provide a lightning quick user experience. Memcached or Terracotta BigMemory would be a good solution here.
2)      NoSql databases. Adding a NoSql database tier such as Cassandra or Solr/Lucene can provide huge advantages in terms of scalability and specialized search, such a geospatial search.  This doesn’t mean ditching your relational database, many enterprise web-architectures have both Sql and NoSql database tiers running in harmony together, with the relational database as the authoritative data repository while something like Solr satisfies the majority of the user search queries.
3)      Optimize your database queries.
In reality you probably want to employ a combination of all 3 approaches, but for this article I’m going to focus on approach 3.

Optimize your database queries

All databases (even NoSql ones) have the notion of indexes. Proper index design is absolutely essential to maintain performance as a table grows in size. Indexes need to be closely aligned to the actual queries being executed by the database to be effective. I’m also going to cover several important nuances of designing effective indexes.

Align your Indexes to your Queries

Start by capturing the exact queries being sent to the database. There are several alternative ways:
1)      Enable the MySql General Query Log.  This causes MySql to write all queries it receives to a log.
To activate the log use MySql Workbench or MySql Monitor to issue the command:
                                SET GLOBAL general_log = 1;
Then locate the log on the computer running MySql , e.g.
On Windows: C:\ProgramData\MySQL\MySQL Server 5.5\data\<hostname>.log
On Linux: /var/lib/mysql/ <hostname>.log
2)      Enable the Slow Query Log.
This causes MySql to log all queries taking longer than a default of 10 secs to execute.
# activate for all queries
SET GLOBAL long_query_time = 0;                       
# activate log
SET GLOBAL slow_query_log = 1;
Find the log:
Windows:
C:\ProgramData\MySQL\MySQL Server 5.5\data\<hostname>-slow.log
Linux:
 /var/lib/mysql/ <hostname>-slow.log
3)      Application Logging
If you are building your queries in your application, you’ll be able to print them to your application logs before they execute.
If you’re using an ORM such as Hibernate you can enable SQL debugging, setting these properties is often a good start:
        <property name="hibernate.show_sql" value="true"/>
         <property name="hibernate.format_sql" value="true"/>
        <property name="hibernate.use_sql_comments" value="true"/>

Which method is best?
I recommend method (2) to capture all queries to the database over a period of time, e.g. you could enable this on a test server during a load simulation for 15 mins to capture a reasonable sample of database queries. You should ensure your load simulation provides an accurate coverage of queries that are typically executed by your users, if you don’t have this confidence then you cannot beat capturing a log from your live database server as the logging is pretty lightweight and should not have a negative effect. Note: you can even capture slow query logs from Amazon RDS cloud databases.

Processing a Slow Query Log.

MySql now provides a tool to do this but I recommend using Percona’s excellent toolkit, specifically the tool pt-query-digest.
After you have installed this, process your captured slow log with a command like:
~/ptools/bin/pt-query-digest /var/lib/mysql/mysql-slow.log > ~/pqd.txt
Here pt-query-digest processes your mysql-slow.log to produce the file: pqd.txt , this is an aggregation of all the queries in the slow log , it shows you the top 20 slow running queries thus enabling you to target the lowest hanging fruit first.

Replacing SubSelects with Joins

If you notice your slowest queries use subselects  (a dependent select statement within the where clause of the main select) then I’d generally advise you to rewrite the query using Joins. Subselects often scale horribly on tables that contain lots of data, whereas the database optimizer can use Joins to its benefit to narrow down a result set very quickly. There is often an orders of magnitude difference between the 2 techniques with identical outputs.

Using Explain

Prefixing a query with the EXPLAIN keyword causes MySql to explain how it resolves a query. This is critical in devising the correct indexes to speed up the query.
For example, suppose we have a social networking site with a table containing messages between users.  When a user checks his message inbox, this may cause the following query to run:
SELECT * FROM message AS msg WHERE msg.dest_user=489764 AND msg.del_by_user=0 ORDER BY msg.priority DESC, msg.created DESC LIMIT 10
The query is running slow, but you have no idea why, EXPLAIN can help, issue the command:
EXPLAIN SELECT * FROM message AS msg WHERE msg.dest_user=489764 AND msg.del_by_user=0 ORDER BY msg.priority DESC, msg.created DESC LIMIT 10
MySql responds with:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,msg,ALL,NULL,NULL,NULL,NULL,1618964,"Using where; Using filesort"
 
This shows us 2 main things;
1)      No Key (Index) was used (or available) to satisfy this query.
2)      1618964 rows were scanned to deliver this result of up to 10 messages, this is not efficient!
Running SELECT COUNT(*) FROM message;  also returns 1618964 , meaning we did a full-table scan. This is the worst case scenario, it means the database had to read every record in the table to determine if it should form part of the result set. This really won’t scale as the message table gets bigger.
Fortunately, Indexes can really help us here…

Index Selectivity

Indexes are most effective when they have high selectivity, this means that the column we are indexing has a high number of possible values.
It’s defined as:
Selectivity = of distinct values in a column / number of records S = d/n
Thus a good candidate for an index in the above example would be dest_user. In our example app every new user receives a welcome message and additionally we don’t actually remove a message from the database when a user deletes it (hence the del_by_user flag), so it’s possible there’s a message in the table for each of the 1210778 users in existence. Thus our selectivity is:
S = 1210778 / 1618964
S is 0.75
As ‘distinct values in a column’ can never exceed ‘number of records’ the highest selectivity possible is 1, so 0.75 is a pretty good candidate for an index on this column.
Contrast this with putting an index on the del_by_user column. It’s Boolean so there are only 2 possible values. Thus our selectivity is:
S = 2 / 1618964
S is 0.0000012
Very low, not a particularly effective index, however it would narrow down the number of records to be scanned to half of the total if there’s an even distribution of read messages.

The Overhead of Indexes

Indexes do have an overhead, every time there’s a query that changes the data being indexed then both the table data and its associated indexes must be modified.  This is a price worth paying for an effective index that speeds up your most frequent queries but it’s also a compelling argument against blindly applying indexes against all columns of every table!
When executing a query the database optimizer decides whether it’s better to use an existing index or not.
If you notice by using explain that an index exists for a column but is never actually used, it’s usually advisable to remove that index to clear the overhead.

Creating Indexes

Given the selectivity calculations above, dest_user is looking like a great candidate for an index, lets create one by issuing this command to MySql:
CREATE INDEX msgidx ON message (dest_user);
Now we run explain again, and get the following result:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,message0_,ref, msgidx, msgidx,8,const,1282,"Using where"
 
This is a huge improvement, it means MySql only had to scan 1282 records to produce the result set, this is going to be infinitely faster than scanning the whole table, in fact it tells us there are 1282 messages in the database to this user and MySql will then apply the next criteria in the WHERE clause to narrow the result set down further. The next criteria  is del_by_user. It could be worthwhile to index this column too.

Compound Indexes

A compound index is one index containing multiple columns. So knowing the query above has a WHERE clause of:
WHERE msg.dest_user=489764 AND msg.del_by_user=0
A sensible compound index to match this query would be:
CREATE INDEX msgidxDestusrDel ON message (dest_user, del_by_user);
Running EXPLAIN on the query again now gives:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,message0_,ref," msgidx,msgidxDestusrDel",msgidxDestusrDel,9,"const,const",141,"Using where"
 
This was really worth doing, we’ve got the rows scanned down to 141, which is exactly the number of rows returned by the WHERE clause of this query (excluding the LIMIT). This is always our goal when using EXPLAIN and this should be the point at which we move on to optimizing the next query.
You may be wondering why I chose a compound index here rather than just adding a new standalone index for del_by_user, let’s try that, I remove the compound index msgidxDestusrDel and add:
CREATE INDEX msgidxDel ON message (del_by_dest);
Running EXPLAIN gives:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,message0_,ref, “msgidx, msgidxDel”, msgidx,8,const,1282,"Using where"
 
Although 2 indexes exist the MySql optimizer in this case decides not make use of our new index msgidxDel, this is due to the tree implementation MySql uses to represent indexes. Index implementation differs between database vendors, so this outcome is MySql specific. Due to the overhead I described earlier we’d be better off without this new index.
 

Index Ordering Matters

Given a compound index of:
CREATE INDEX msgidxDestusrDel ON message (dest_user, del_by_user);
This would be useful for a WHERE clause of:
WHERE msg.dest_user=489764 AND msg.del_by_user=0
And
WHERE msg.del_by_user=0 AND msg.dest_user=489764
i.e. MySql is smart enough to select the best index regardless of the ordering within the WHERE clause.
However, if the message table also contains a Boolean msg_read column and the only compound index we create is:
CREATE INDEX msgidxDestusrDel ON message (dest_user, msg_read, del_by_user);
This index would not be useful for the WHERE clause:
WHERE msg.dest_user=489764 AND msg.del_by_user=0
MySql would in fact revert back to using the single index we have defined on the dest_user column, this is because the order of our new compound index is not compatible with this WHERE clause, msg_read is between the two column we are actually searching on.
Had the order of the compound index been:
CREATE INDEX msgidxDestusrDel ON message (dest_user, del_by_user, msg_read);
This would be used, and the last msg_read component would be ignored.

Indexes are also used for Sorting results

If you include your ORDER BY columns after your search columns within a compound index your query will be even faster, e.g.
In the context of our social networking site, imagine we have a table of ratings made by users. It has the following compound index:
CREATE INDEX srcRegRatingDate ON rating (src_user, rating_value, updated);
For the query:
SELECT * FROM rating AS rating0_ WHERE rating0_.src_user=8019 AND rating0_. rating_value=100 ORDER BY rating0_.updated DESC LIMIT 15
MySql can use the index order for the results rather than conducting a separate sorting operation.

The damaging effect of Range operators

Using a range (LIKE or ‘>’ etc) operator on a column in a compound index means that any additional criteria we have for subsequent columns cannot benefit from the index.
e.g. With reference to the above ‘rating’ example. The subtle difference of using a >= range operator on the rating_value rather than just ‘=’ would mean MySql could not use the ‘updated’ column of the index for the sort.

Covering Indexes

Covering indexes are indexes that contain all the columns necessary to completely satisfy the return requirements of the query solely using the index. In this case MySql can provide results without reference to the data table itself. This provides a huge speed boost.
If you name the columns to be returned by your select statement you can selectively ensure these columns appear within your compound index.
The position of the returned columns within the compound index is not important, the return columns merely have to be present.
If you are using an ORM such as hibernate and returning complex objects into your application your chances of using a covering index are more limited. This is because hibernate fetches all columns in every Select statement in order to fully populate the return object. If speed is crucial you may choose to take this responsibility away from hibernate and write your own database fetch routines to maximise the opportunity of covering indexes.

Conclusion

As you can see, designing the most efficient indexes for an application requires great insight to the queries actually being executed by the database. You need to carefully design your indexes around your most important queries, there is no effective catch-all situation.
The performance gains to be made by implementing effective indexing can be enormous. A little engineering effort here is often much more cost-effective than simply scaling-up or scaling-out database hardware.