Problem:
Imagine you run a successful website or app, month after
month your traffic grows until you notice page views are getting slower and
slower and the user experience is suffering. A little investigation reveals the
database is the bottleneck.
Solutions:
This is a common scenario and there are several lines of
attack you can follow to alleviate this problem:
1)
Caching. All enterprise level, high traffic
websites use caching in one form or another to operate at in-memory speeds and
provide a lightning quick user experience. Memcached or Terracotta BigMemory
would be a good solution here.
2)
NoSql databases. Adding a NoSql database tier
such as Cassandra or Solr/Lucene can provide huge advantages in terms of
scalability and specialized search, such a geospatial search. This doesn’t mean ditching your relational
database, many enterprise web-architectures have both Sql and NoSql database
tiers running in harmony together, with the relational database as the authoritative
data repository while something like Solr satisfies the majority of the user
search queries.
3)
Optimize your database queries.
In reality you probably want to employ a combination of all
3 approaches, but for this article I’m going to focus on approach 3.
Optimize your
database queries
All databases (even NoSql ones) have the notion of indexes.
Proper index design is absolutely essential to maintain performance as a table
grows in size. Indexes need to be closely aligned to the actual queries being
executed by the database to be effective. I’m also going to cover several
important nuances of designing effective indexes.
Align your Indexes to
your Queries
Start by capturing the exact queries being sent to the
database. There are several alternative ways:
1)
Enable the MySql General Query Log. This causes MySql to write all queries it
receives to a log.
To activate the log use MySql Workbench or MySql
Monitor to issue the command:
SET
GLOBAL general_log = 1;
Then locate the log on the computer running MySql ,
e.g.
On Windows: C:\ProgramData\MySQL\MySQL
Server 5.5\data\<hostname>.log
On Linux: /var/lib/mysql/
<hostname>.log
2)
Enable the Slow Query Log.
This causes MySql to log all queries taking
longer than a default of 10 secs to execute.
# activate for all queries
# activate log
SET GLOBAL slow_query_log = 1;
Find the log:
Windows:
C:\ProgramData\MySQL\MySQL Server 5.5\data\<hostname>-slow.log
Linux:
/var/lib/mysql/ <hostname>-slow.log
3)
Application Logging
If you are building your queries in your
application, you’ll be able to print them to your application logs before they
execute.
If you’re using an ORM such as Hibernate you
can enable SQL debugging, setting these properties is often a good start:
<property name="hibernate.show_sql"
value="true"/>
<property name="hibernate.format_sql"
value="true"/>
<property name="hibernate.use_sql_comments"
value="true"/>
Which method is best?
I recommend method (2) to capture all queries to the
database over a period of time, e.g. you could enable this on a test server during
a load simulation for 15 mins to capture a reasonable sample of database
queries. You should ensure your load simulation provides an accurate coverage
of queries that are typically executed by your users, if you don’t have this
confidence then you cannot beat capturing a log from your live database server
as the logging is pretty lightweight and should not have a negative effect.
Note: you can even capture slow query logs from Amazon RDS cloud databases.
Processing a Slow
Query Log.
MySql now provides a tool to do this but I recommend using
Percona’s excellent toolkit, specifically the tool pt-query-digest.
After you have installed this, process your captured slow
log with a command like:
~/ptools/bin/pt-query-digest
/var/lib/mysql/mysql-slow.log > ~/pqd.txt
Here pt-query-digest processes
your mysql-slow.log to produce the file: pqd.txt , this is an aggregation of
all the queries in the slow log , it shows you the top 20 slow running queries
thus enabling you to target the lowest hanging fruit first.
Replacing SubSelects
with Joins
If you notice your slowest queries use subselects (a dependent select statement within the where
clause of the main select) then I’d generally advise you to rewrite the query
using Joins. Subselects often scale horribly on tables that contain lots of
data, whereas the database optimizer can use Joins to its benefit to narrow
down a result set very quickly. There is often an orders of magnitude
difference between the 2 techniques with identical outputs.
Using Explain
Prefixing a query with the EXPLAIN keyword causes MySql to
explain how it resolves a query. This is critical in devising the correct
indexes to speed up the query.
For example, suppose we have a social networking site with a
table containing messages between users.
When a user checks his message inbox, this may cause the following query
to run:
SELECT * FROM message AS msg WHERE msg.dest_user=489764 AND
msg.del_by_user=0 ORDER BY msg.priority DESC, msg.created DESC LIMIT 10
The query is running slow, but you have no idea why, EXPLAIN
can help, issue the command:
EXPLAIN SELECT * FROM message AS msg WHERE
msg.dest_user=489764 AND msg.del_by_user=0 ORDER BY msg.priority DESC,
msg.created DESC LIMIT 10
MySql responds with:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,msg,ALL,NULL,NULL,NULL,NULL,1618964,"Using
where; Using filesort"
This shows us 2 main things;
1)
No Key (Index) was used (or available) to
satisfy this query.
2)
1618964 rows were scanned to deliver this result
of up to 10 messages, this is not efficient!
Running SELECT COUNT(*) FROM message;
also returns 1618964 , meaning we did a full-table scan. This is the
worst case scenario, it means the database had to read every record in the
table to determine if it should form part of the result set. This really won’t
scale as the message table gets bigger.
Fortunately, Indexes can really help us here…
Index Selectivity
Indexes are most effective when they have high selectivity, this means that the
column we are indexing has a high number of possible values.
It’s defined as:
Selectivity = of distinct values in a column / number of
records ➔ S = d/n
Thus a good candidate for an index in the above example
would be dest_user. In our example app every new user receives a welcome
message and additionally we don’t actually remove a message from the database
when a user deletes it (hence the del_by_user flag), so it’s possible there’s a
message in the table for each of the 1210778 users in existence. Thus our
selectivity is:
S = 1210778 / 1618964
S is 0.75
As ‘distinct values in
a column’ can never exceed ‘number of
records’ the highest selectivity possible is 1, so 0.75 is a pretty good
candidate for an index on this column.
Contrast this with putting an index on the del_by_user
column. It’s Boolean so there are only 2 possible values. Thus our selectivity
is:
S = 2 / 1618964
S is 0.0000012
Very low, not a particularly effective index, however it
would narrow down the number of records to be scanned to half of the total if
there’s an even distribution of read messages.
The Overhead of
Indexes
Indexes do have an overhead, every time there’s a query that
changes the data being indexed then both the table data and its associated
indexes must be modified. This is a
price worth paying for an effective index that speeds up your most frequent
queries but it’s also a compelling argument against blindly applying indexes
against all columns of every table!
When executing a query the database optimizer decides
whether it’s better to use an existing index or not.
If you notice by using explain
that an index exists for a column but is never actually used, it’s usually
advisable to remove that index to clear the overhead.
Creating Indexes
Given the selectivity calculations above, dest_user is
looking like a great candidate for an index, lets create one by issuing this
command to MySql:
CREATE INDEX msgidx ON message (dest_user);
Now we run explain again, and get the following result:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,message0_,ref, msgidx,
msgidx,8,const,1282,"Using where"
This is a huge
improvement, it means MySql only had to scan 1282 records to produce the result
set, this is going to be infinitely faster than scanning the whole table, in fact
it tells us there are 1282 messages in the database to this user and MySql will
then apply the next criteria in the WHERE clause to narrow the result set down
further. The next criteria is
del_by_user. It could be worthwhile to index this column too.
Compound Indexes
A compound index is one index containing multiple columns.
So knowing the query above has a WHERE clause of:
WHERE msg.dest_user=489764 AND msg.del_by_user=0
A sensible compound index to match this query would be:
CREATE INDEX msgidxDestusrDel ON message (dest_user,
del_by_user);
Running EXPLAIN on the query again now gives:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,message0_,ref," msgidx,msgidxDestusrDel",msgidxDestusrDel,9,"const,const",141,"Using
where"
This was really worth doing, we’ve got the rows scanned down
to 141, which is exactly the number of rows returned by the WHERE clause of
this query (excluding the LIMIT). This is always our goal when using EXPLAIN
and this should be the point at which we move on to optimizing the next query.
You may be wondering why I chose a compound index here
rather than just adding a new standalone index for del_by_user, let’s try that,
I remove the compound index msgidxDestusrDel and add:
CREATE INDEX msgidxDel ON message (del_by_dest);
Running EXPLAIN gives:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,message0_,ref, “msgidx,
msgidxDel”, msgidx,8,const,1282,"Using where"
Although 2 indexes exist the MySql optimizer in this case
decides not make use of our new index msgidxDel, this is due to the tree
implementation MySql uses to represent indexes. Index implementation differs
between database vendors, so this outcome is MySql specific. Due to the
overhead I described earlier we’d be better off without this new index.
Index Ordering
Matters
Given a compound index of:
CREATE INDEX msgidxDestusrDel ON message (dest_user,
del_by_user);
This would be useful for a WHERE clause of:
WHERE msg.dest_user=489764 AND msg.del_by_user=0
And
WHERE msg.del_by_user=0 AND msg.dest_user=489764
i.e. MySql is smart enough to select the best index
regardless of the ordering within the WHERE clause.
However, if the message table also contains a Boolean msg_read
column and the only compound index
we create is:
CREATE INDEX msgidxDestusrDel ON message (dest_user, msg_read,
del_by_user);
This index would not be useful for the WHERE clause:
WHERE msg.dest_user=489764 AND msg.del_by_user=0
MySql would in fact revert back to using the single index we
have defined on the dest_user column, this is because the order of our new
compound index is not compatible with this WHERE clause, msg_read is between
the two column we are actually searching on.
Had the order of the compound index been:
CREATE INDEX msgidxDestusrDel ON message (dest_user,
del_by_user, msg_read);
This would be used, and the last msg_read component would be
ignored.
Indexes are also used
for Sorting results
If you include your ORDER BY columns after your search
columns within a compound index your query will be even faster, e.g.
In the context of our social networking site, imagine we
have a table of ratings made by
users. It has the following compound index:
CREATE INDEX srcRegRatingDate ON rating (src_user, rating_value,
updated);
For the query:
SELECT * FROM rating AS rating0_ WHERE rating0_.src_user=8019
AND rating0_. rating_value=100 ORDER BY rating0_.updated DESC LIMIT 15
MySql can use the index order for the results rather than conducting
a separate sorting operation.
The damaging effect
of Range operators
Using a range
(LIKE or ‘>’ etc) operator on a column in a compound index means that any additional
criteria we have for subsequent columns cannot benefit from the index.
e.g. With reference to the above ‘rating’ example. The subtle difference of using a >= range
operator on the rating_value rather than just ‘=’ would mean MySql could not
use the ‘updated’ column of the index for the sort.
Covering Indexes
Covering indexes
are indexes that contain all the columns necessary to completely satisfy the
return requirements of the query solely using the index. In this case MySql can
provide results without reference to the data table itself. This provides a
huge speed boost.
If you name the columns to be returned by your select statement you can selectively
ensure these columns appear within your compound index.
The position of the returned columns within the compound
index is not important, the return columns merely have to be present.
If you are using an ORM such as hibernate and returning
complex objects into your application your chances of using a covering index
are more limited. This is because hibernate fetches all columns in every Select
statement in order to fully populate the return object. If speed is crucial you
may choose to take this responsibility away from hibernate and write your own
database fetch routines to maximise the opportunity of covering indexes.
Conclusion
As you can see, designing the most efficient indexes for an
application requires great insight to the queries actually being executed by
the database. You need to carefully design your indexes around your most
important queries, there is no effective catch-all situation.
The performance gains to be made by implementing effective
indexing can be enormous. A little engineering effort here is often much more
cost-effective than simply scaling-up or scaling-out database hardware.