MySQL Join GROUP BY Performance

As any experienced SQL user knows there are usually many different ways to extract the data you want from your database. Often times each approach will yield different performance characteristics. In this post I'm going to show three different ways to SELECT data from MySQL when a join is being used in conjunction with a GROUP BY clause.

Playing Along at Home

You will need the following items to play along:

MySQL 5 - A Stored Procedure is used to generate dummy data
MySQL Console
This file - Contains all SQL used here

To run the SQL file all at once log into the MySQL console and type:

mysql> source /path/to/file/join_ex1.sql

Database Creation and Usage

create database join_ex1;
use join_ex1;

Tables and An Index

This example has three tables: site, hit and site_freq. The first table, site, represents a website (ex: Slashdot, Digg). The second table, hit, represents a 'hit' on that website. The third table, site_freq, is used by the Stored Procedure to determine the distribution of hits to sites.

create table site (
  id integer auto_increment not null,
  name varchar(100),
  primary key (id)
) type=innodb;

create table hit (
  id integer auto_increment not null,
  site_id integer,
  primary key (id)
) type=innodb;

create table site_freq (
  site_id integer
) type=innodb;

create index index1 on hit (site_id);

Dummy Data

Just a little bit of data. To adjust the distribution of site hits generated by the Stored Procedure, alter the relative frequencies for each site.id. The configuration below should generate hits with the following distribution: Slashdot (50%), Digg (25%), Engadget (12.5%), Gizmodo (12.5%).

insert into site (name) values ('Slashdot');
insert into site (name) values ('Digg');
insert into site (name) values ('Engadget');
insert into site (name) values ('Gizmodo');

insert into site_freq (site_id) values (1), (1), (1), (1), (2), (2), (3), (4);

Stored Procedure

This Stored Procedure is responsible for generating the large data set needed for the query performance differences to become apparent.

DROP PROCEDURE IF EXISTS insertRandomData;
DELIMITER //
CREATE PROCEDURE insertRandomData(toInsert INT)
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE myId INT;
  WHILE (i < toInsert) DO
    SELECT site_id INTO myId
    FROM site_freq ORDER BY rand() LIMIT 1;

    INSERT INTO hit (site_id) VALUES (myId);
  SET i=i+1;
  END WHILE;
END;
//
DELIMITER ;

Dummy Data Generation

As mentioned above, a sizable amount of data is needed for the query performance differences to become discernible. Let's start with 500,000 records. Feel free to insert as many records as needed (I used a 12,000,000 record set for my tests).

CALL insertRandomData(500000);

Queries

Our goal is to write a query which displays each site's name and its respective hit count. First, the most obvious query choice:

SELECT name, count(*)
FROM hit, site
WHERE site_id = site.id
GROUP BY name;

+----------+----------+
| name     | count(*) |
+----------+----------+
| Digg     |  3003700 | 
| Engadget |  1491784 | 
| Gizmodo  |  1502488 | 
| Slashdot |  6002028 | 
+----------+----------+
4 rows in set (16.48 sec)

Let's try the same query with the Join performed in the FROM clause:

SELECT name, count(*)
FROM hit JOIN site ON site_id = site.id
GROUP BY name;

+----------+----------+
| name     | count(*) |
+----------+----------+
| Digg     |  3003700 | 
| Engadget |  1491784 | 
| Gizmodo  |  1502488 | 
| Slashdot |  6002028 | 
+----------+----------+
4 rows in set (16.63 sec)

Pretty much the same results, now for a slightly more involved query:

SELECT name, mycount FROM (
SELECT site_id, count(*) mycount FROM hit GROUP BY site_id
) a, site
WHERE site.id = a.site_id
ORDER BY site_id;

+----------+---------+
| name     | mycount |
+----------+---------+
| Digg     | 3003700 | 
| Engadget | 1491784 | 
| Gizmodo  | 1502488 | 
| Slashdot | 6002028 | 
+----------+---------+
4 rows in set (4.98 sec)

Conclusion

The query which performs the GROUP BY in the sub-SELECT and then joins in the super-SELECT executes in 31% the time of the first two queries for this data set. I cannot guarantee the same results in your particular situation, however, you should see a marked improvement by applying this pattern in your queries.