Cycle detection in PostgreSQL
A few months ago, we decided to build a referral program for Mergify. This is a well-known, classical way of bringing more people on your product.
To build this program, we add to define a data model that allowed us to store a list of referrers and referees, with some constraints. Such a list boils down to building a direct acyclic graph (DAG) of your users. A DAG's particularity is to be a directed tree that, therefore, cannot have any loop in it.
There are multiple solutions to detect a loop in a DAG. The straightforward method is to use a depth-first search algorithm: start from a node, keep browsing the tree from node to node, writing down each node encountered. If one of the discovered nodes is already in your stack, then you have a loop.
It is easy enough to find plenty of solutions around this issue on the web, depending on your data structure and programming language.
In our case, we knew we were going to store this data in PostgreSQL, meaning we needed to make sure the database did the check directly on insertion.
Solution Architecture
To make this work in PostgreSQL, we needed to split the problem in three parts:
- The data structure;
- Write a function that can detect if a cycle is present in your data structure;
- Create a trigger that calls the loop detection function on insertion and update.
Data Structure
We went ahead and implemented a straightforward data structure using a single table containing 2 IDs — the vertices of our DAG. Hence, each row in the table stores the link between our vertices (the referrer and its referee).
The PostgreSQL table can be created with this command:
CREATE TABLE public.referral (
referee_id integer NOT NULL PRIMARY KEY,
referrer_id integer NOT NULL,
CONSTRAINT check_referrer_not_referee CHECK ((referrer_id <> referee_id))
);
This table already has a simple built-in loop detection: the check_referrer_not_referee
check makes sure that a user cannot refer itself, which, in tree jargon, mean no vertex can link back to itself.
This check is not enough: while A can't link to A, A could link to B which could link back to A, creating the loop A -> B -> A. Not great.
Another case that is prevented with this table is that since the referee_id
is a primary key, it can't be used multiple times, avoiding the scenario where a referee would have multiple referrer.
Loop Detection Function
The initial table creation solved simple cases where the links between vertices are easy to check. We now need to write a function that walks the entire tree when inserting links in the general case.
This is where you need a PostgreSQL function to detect the loop. The PL/pgSQL language is powerful enough to write such a function:
CREATE OR REPLACE FUNCTION detect_cycle()
RETURNS TRIGGER
LANGUAGE plpgsql AS
$func$
BEGIN
IF EXISTS (
WITH RECURSIVE list_referrers(referrer) AS (
-- Get the parent from the NEW referrer referral to see
-- if the referrer is not already a parent
SELECT r.referrer_id
FROM referral AS r
WHERE r.referee_id = NEW.referrer_id
UNION ALL
SELECT r.referrer_id
FROM referral AS r, list_referrers as lr
WHERE r.referee_id = lr.referrer
) SELECT * FROM list_referrers WHERE list_referrers.referrer = NEW.referee_id LIMIT 1
)
THEN
RAISE EXCEPTION 'Loop detected';
ELSE
RETURN NEW;
END IF;
END
$func$;
The function works in a pretty simple way. It will be used as an INSERT
and UPDATE
trigger, which means it can use the special NEW
table to get access to the newly inserted row.
Then, it uses the WITH RECURSIVE
clause to run a recursive SQL query that can refer to its own output. In that case, PostgreSQL first runs:
SELECT r.referrer_id FROM referral AS r WHERE r.referee_id = NEW.referrer_id
This first query returns the parent, i.e., the vertex that links back to the newly referee (child) node. If the referrer has no parent, it returns nothing, and the query finishes with 0 results.
Otherwise, the second query uses this list to run that same query over and over again:
SELECT r.referrer_id FROM referral AS r, list_referrers as lr WHERE r.referee_id = lr.referrer
This selects the parent of the parent (if any) and continues to do so until there is no more new result. At this stage, all the results are combined together and sent to the new query:
SELECT * FROM list_referrers WHERE list_referrers.referrer = NEW.referee_id LIMIT 1
This query selects the row in the previous list of parents where the parent is the referee. If such a result exists, then it means a loop exists, as the newly added referee is in the referrer's parent list. The LIMIT 1
statement is an optimization to indicate that one result is enough since our outer check is IF EXISTS
.
The IF EXISTS / THEN / RAISE
statement takes care of raising an exception if a loop is detected, preventing the new row from being added.
Setting up the trigger
Once you get this function right for you, there's only a need to add it to your table with this:
DROP TRIGGER IF EXISTS detect_referral_cycle ON referral;
CREATE CONSTRAINT TRIGGER detect_referral_cycle
AFTER INSERT OR UPDATE ON referral
FOR EACH ROW EXECUTE PROCEDURE detect_cycle();
Pretty easy! With all of that setup, there is no need to do any kind of check in your application, and you can simply insert the rows as you wish. Catching the PostgreSQL exception that can be raised by the trigger allows you to report back to the user that they're trying to do something invalid.
PostgreSQL 14 and native cycle detection
Starting with version 14, PostgreSQL added a feature that adds the SEARCH
and CYCLE
clauses to recursive queries to be able to do produce breadth- or depth-first search orders and detect cycles. Hubert Lubaczewski has a great blog post that shows a few examples on how to use those features, which should make writing such loop checks even easier!