Playing Along at Home
You will need the following items to play along:- MySQL 5 - A Stored Procedure is used to generate dummy data
- MySQL Console
- This file - Contains all SQL used here
mysql> source /path/to/file/join_ex1.sql
Database Creation and Usage
create database join_ex1;
use join_ex1;
use join_ex1;
Tables and An Index
This example has three tables: site, hit and site_freq. The first table, site, represents a website (ex: Slashdot, Digg). The second table, hit, represents a 'hit' on that website. The third table, site_freq, is used by the Stored Procedure to determine the distribution of hits to sites.
create table site (
id integer auto_increment not null,
name varchar(100),
primary key (id)
) type=innodb;
create table hit (
id integer auto_increment not null,
site_id integer,
primary key (id)
) type=innodb;
create table site_freq (
site_id integer
) type=innodb;
create index index1 on hit (site_id);
id integer auto_increment not null,
name varchar(100),
primary key (id)
) type=innodb;
create table hit (
id integer auto_increment not null,
site_id integer,
primary key (id)
) type=innodb;
create table site_freq (
site_id integer
) type=innodb;
create index index1 on hit (site_id);
Dummy Data
Just a little bit of data. To adjust the distribution of site hits generated by the Stored Procedure, alter the relative frequencies for each site.id. The configuration below should generate hits with the following distribution: Slashdot (50%), Digg (25%), Engadget (12.5%), Gizmodo (12.5%).
insert into site (name) values ('Slashdot');
insert into site (name) values ('Digg');
insert into site (name) values ('Engadget');
insert into site (name) values ('Gizmodo');
insert into site_freq (site_id) values (1), (1), (1), (1), (2), (2), (3), (4);
insert into site (name) values ('Digg');
insert into site (name) values ('Engadget');
insert into site (name) values ('Gizmodo');
insert into site_freq (site_id) values (1), (1), (1), (1), (2), (2), (3), (4);
Stored Procedure
This Stored Procedure is responsible for generating the large data set needed for the query performance differences to become apparent.
DROP PROCEDURE IF EXISTS insertRandomData;
DELIMITER //
CREATE PROCEDURE insertRandomData(toInsert INT)
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE myId INT;
WHILE (i < toInsert) DO
SELECT site_id INTO myId
FROM site_freq ORDER BY rand() LIMIT 1;
INSERT INTO hit (site_id) VALUES (myId);
SET i=i+1;
END WHILE;
END;
//
DELIMITER ;
DELIMITER //
CREATE PROCEDURE insertRandomData(toInsert INT)
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE myId INT;
WHILE (i < toInsert) DO
SELECT site_id INTO myId
FROM site_freq ORDER BY rand() LIMIT 1;
INSERT INTO hit (site_id) VALUES (myId);
SET i=i+1;
END WHILE;
END;
//
DELIMITER ;
Dummy Data Generation
As mentioned above, a sizable amount of data is needed for the query performance differences to become discernible. Let's start with 500,000 records. Feel free to insert as many records as needed (I used a 12,000,000 record set for my tests).
CALL insertRandomData(500000);
Queries
Our goal is to write a query which displays each site's name and its respective hit count. First, the most obvious query choice:
SELECT name, count(*)
FROM hit, site
WHERE site_id = site.id
GROUP BY name;
FROM hit, site
WHERE site_id = site.id
GROUP BY name;
+----------+----------+ | name | count(*) | +----------+----------+ | Digg | 3003700 | | Engadget | 1491784 | | Gizmodo | 1502488 | | Slashdot | 6002028 | +----------+----------+ 4 rows in set (16.48 sec)
Let's try the same query with the Join performed in the FROM clause:
SELECT name, count(*)
FROM hit JOIN site ON site_id = site.id
GROUP BY name;
FROM hit JOIN site ON site_id = site.id
GROUP BY name;
+----------+----------+ | name | count(*) | +----------+----------+ | Digg | 3003700 | | Engadget | 1491784 | | Gizmodo | 1502488 | | Slashdot | 6002028 | +----------+----------+ 4 rows in set (16.63 sec)
Pretty much the same results, now for a slightly more involved query:
SELECT name, mycount FROM (
SELECT site_id, count(*) mycount FROM hit GROUP BY site_id
) a, site
WHERE site.id = a.site_id
ORDER BY site_id;
SELECT site_id, count(*) mycount FROM hit GROUP BY site_id
) a, site
WHERE site.id = a.site_id
ORDER BY site_id;
+----------+---------+ | name | mycount | +----------+---------+ | Digg | 3003700 | | Engadget | 1491784 | | Gizmodo | 1502488 | | Slashdot | 6002028 | +----------+---------+ 4 rows in set (4.98 sec)