I recently encountered a situation where users were required to perform various
configurable tasks every week and be notified if they failed to meet their
requirements. In addition, the system needed reporting capabilities to help
administrators discover incomplete and not-started tasks. The catch in this situation
is that if the user never began working on their tasks, the data didn't exist. There
are obviously many ways to deal with this situation, but I thought it would be interesting
to write a query to meet this requirement. But how do you query data that doesn't exist?
Enter generate_series
As it turns out Postgres has a nifty function called
generate_series
which you can use to generate integer sequences. You supply the range (min, max) and
generate_series produces a set containing the integer sequence which you can interact with as
though it were a table. A couple of examples:
SELECT * FROM generate_series(-1, 3);
generate_series |
-1 |
0 |
1 |
2 |
3 |
SELECT gs1.a AS n, SUM(gs2.a)
FROM generate_series(1, 10) AS gs1(a),
generate_series(1, 10) AS gs2(a)
WHERE gs2.a <= gs1.a
GROUP BY gs1.a
ORDER BY gs1.a
n |
Summation of all integers in [1, n] i.e. n*(n+1)/2 |
1 |
1 |
2 |
3 |
3 |
6 |
4 |
10 |
5 |
15 |
6 |
21 |
7 |
28 |
8 |
36 |
9 |
45 |
10 |
55 |
Simple enough; now onto something a bit more interesting...
Something a Bit More Interesting :)
Below you will find DDL and inserts which models a simpler version of the original
problem. In short, the expectation table represents the expectation that a report will
be created exactly every 7 days starting on start_date and ending on end_date (or up to 6 days before).
The report table simply contains all reports which have been created, but not necessarily submitted.
CREATE TABLE expectation (
id bigserial not null,
start_date date not null,
end_date date null,
PRIMARY KEY (id)
);
CREATE TABLE report (
id bigserial not null,
expectation_id bigint not null,
due_date date not null,
submitted boolean not null,
data text null,
PRIMARY KEY (id)
);
ALTER TABLE report
ADD FOREIGN KEY (expectation_id) REFERENCES expectation (id);
INSERT INTO expectation (start_date, end_date)
VALUES ('2010-06-01', '2010-07-01');
INSERT INTO expectation (start_date, end_date)
VALUES ('2010-08-12', '2010-09-01');
INSERT INTO expectation (start_date, end_date)
VALUES ('2010-11-01', NULL);
Fabricating All Possible Report Records
The first step to determining missing report records is to simply generate all possible report
records based on expectation configuration. The below query isn't all that complicated,
and as you can see the integer sequence represents the previous and future 5,200 weeks
which ought to be more than sufficient for the purpose of this demonstration.
SELECT NULL AS id, expectation_id, due_date,
false AS submitted, NULL AS data
FROM
(
SELECT
id AS expectation_id,
start_date, end_date,
(start_date + 7*series.a) AS due_date
FROM
expectation,
generate_series(-5200, 5200) AS series(a)
) a
WHERE
due_date >= start_date
AND (due_date <= end_date OR end_date IS NULL)
AND due_date <= current_date
i |
id |
expectation_id |
due_date |
submitted |
data |
1 |
(null) |
1 |
2010-06-01 |
false |
(null) |
2 |
(null) |
1 |
2010-06-08 |
false |
(null) |
3 |
(null) |
1 |
2010-06-15 |
false |
(null) |
4 |
(null) |
1 |
2010-06-22 |
false |
(null) |
5 |
(null) |
1 |
2010-06-29 |
false |
(null) |
6 |
(null) |
2 |
2010-08-12 |
false |
(null) |
7 |
(null) |
2 |
2010-08-19 |
false |
(null) |
8 |
(null) |
2 |
2010-08-26 |
false |
(null) |
9 |
(null) |
3 |
2010-11-01 |
false |
(null) |
10 |
(null) |
3 |
2010-11-08 |
false |
(null) |
11 |
(null) |
3 |
2010-11-15 |
false |
(null) |
12 |
(null) |
3 |
2010-11-22 |
false |
(null) |
13 |
(null) |
3 |
2010-11-29 |
false |
(null) |
Removing Existing Report Records
Now that we have a query to produce all possible report records it would probably
be prudent to prune existing records. But wait, we don't have any report records.
Let's go ahead and create a few and then move right on to the new query which contains
an 'AND NOT EXISTS (' clause to tidy matters right up.
INSERT INTO report (expectation_id, due_date, submitted, data)
VALUES (1, '2010-06-01', 'true', 'blah blah blah...');
INSERT INTO report (expectation_id, due_date, submitted, data)
VALUES (1, '2010-06-15', 'true', 'blah blah blah...');
INSERT INTO report (expectation_id, due_date, submitted, data)
VALUES (1, '2010-06-22', 'true', 'blah blah blah...');
SELECT NULL AS id, expectation_id, due_date,
false AS submitted, NULL AS data
FROM
(
SELECT
id AS expectation_id,
start_date, end_date,
(start_date + 7*series.a) AS due_date
FROM
expectation,
generate_series(-5200, 5200) AS series(a)
) a
WHERE
due_date >= start_date
AND (due_date <= end_date OR end_date IS NULL)
AND due_date <= current_date
AND NOT EXISTS (
SELECT 'x'
FROM report
WHERE report.expectation_id = a.expectation_id
AND report.due_date = a.due_date
)
i |
id |
expectation_id |
due_date |
submitted |
data |
1 |
(null) |
1 |
2010-06-08 |
false |
(null) |
2 |
(null) |
1 |
2010-06-29 |
false |
(null) |
3 |
(null) |
2 |
2010-08-12 |
false |
(null) |
4 |
(null) |
2 |
2010-08-19 |
false |
(null) |
5 |
(null) |
2 |
2010-08-26 |
false |
(null) |
6 |
(null) |
3 |
2010-11-01 |
false |
(null) |
7 |
(null) |
3 |
2010-11-08 |
false |
(null) |
8 |
(null) |
3 |
2010-11-15 |
false |
(null) |
9 |
(null) |
3 |
2010-11-22 |
false |
(null) |
10 |
(null) |
3 |
2010-11-29 |
false |
(null) |
Now What?
That query meets the original requirements and isn't particularly hideous.
Shall we call it a day? Na. For the sake of completeness see the below query
which intermingles imaginary data with real data to produce a complete
data set for the report table. You could easily create a view with this
query and use it for all your imaginary and real report records
querying needs.
(
SELECT id, expectation_id, due_date, submitted, data
FROM report
)
UNION
(
SELECT NULL AS id, expectation_id, due_date,
false AS submitted, NULL AS data
FROM
(
SELECT
id AS expectation_id,
start_date, end_date,
(start_date + 7*series.a) AS due_date
FROM
expectation,
generate_series(-5200, 5200) AS series(a)
) a
WHERE
due_date >= start_date
AND (due_date <= end_date OR end_date IS NULL)
AND due_date <= current_date
AND NOT EXISTS (
SELECT 'x'
FROM report
WHERE report.expectation_id = a.expectation_id
AND report.due_date = a.due_date
)
)
ORDER BY expectation_id, due_date
i |
id |
expectation_id |
due_date |
submitted |
data |
1 |
1 |
1 |
2010-06-01 |
true |
blah blah blah... |
2 |
(null) |
1 |
2010-06-08 |
false |
(null) |
3 |
2 |
1 |
2010-06-15 |
true |
blah blah blah... |
4 |
3 |
1 |
2010-06-22 |
true |
blah blah blah... |
5 |
(null) |
1 |
2010-06-29 |
false |
(null) |
6 |
(null) |
2 |
2010-08-12 |
false |
(null) |
7 |
(null) |
2 |
2010-08-19 |
false |
(null) |
8 |
(null) |
2 |
2010-08-26 |
false |
(null) |
9 |
(null) |
3 |
2010-11-01 |
false |
(null) |
10 |
(null) |
3 |
2010-11-08 |
false |
(null) |
11 |
(null) |
3 |
2010-11-15 |
false |
(null) |
12 |
(null) |
3 |
2010-11-22 |
false |
(null) |
13 |
(null) |
3 |
2010-11-29 |
false |
(null) |