As I'm sure a lot of people know already, PostgreSQL 9.4 will eventually be released without "native" support for UPSERT. That said, it's still one of the most common questions on the excellent #postgresql IRC channel, and people are sometimes tempted to try and solve it via TRIGGERs (hell, even RULEs), unsuccessfully.
Even though people are often correctly told to "use a function" and are pointed to the example in the documentation, I often find myself needing some more information about the specific case to give a satisfactory answer. The tricky part is that due to a race condition, a UNIQUE constraint is necessary to correctly implement UPSERTesque operations in Postgres, but unique violations abort the entire transaction unless a subtransaction is created. Unfortunately, entering a subtransaction is often said to be expensive, so the trick is to try and identify cases where you can still implement the operation correctly without entering a subtransaction unless absolutely necessary.
So I got curious about the cases where you can get substantial
performance benefits from not entering one, and benchmarked
different concurrency-safe solutions to three of the common
operations I've identified. The three operations are:
- INSERT or IGNORE: INSERT, but if the row already exists, don't do anything. No result required other than "success".
- INSERT or SELECT: INSERT .. RETURNING, but if the row already exists, return data from the existing row as the result instead.
- INSERT or UPDATE: INSERT .. RETURNING, but if the row already exists, UPDATE .. RETURNING instead.
For testing, I pre-populated a table with N % of rows in a range, and then called the UPSERTy function once for every value in that range. The idea was to get a rough approximation of the performance in "the row exists in N % of operations" for different values of N. You can find all the code I used here. Have eye bleach (or a fork) handy if you plan on looking at the bash scripts, though.
INSERT or IGNORE
For INSERT or IGNORE, I came up with three different solutions. The first variant first sees if the row exists, and if it doesn't it tries the INSERT in a subtransaction. This approach avoids having to enter a subtransaction if the value clearly already exists. The hypothesis was that this approach would be fastest if the value exists at least on every second attempt.
The second approach simply always enters a subtransaction and attempts to INSERT the record, ignoring the unique_violation exception if the row already exists. This is probably the fastest approach if most of the time the row does not already exist.
Finally, the last variant does an INSERT .. WHERE NOT EXISTS in a subtransaction. I expected this to be slightly faster than the previous approach in some cases, but slower in almost all cases.
Here are the results:
As you can see, at ~40% of rows existing avoiding the subtransaction is starting to become a performance gain. Perhaps slightly surprisingly, INSERT .. WHERE NOT EXISTS in a subtransaction doesn't seem to ever make sense.
INSERT or SELECT
To implement INSERT or SELECT, I came up with a total of four solutions. The first three are the equivalents of INSERT or IGNORE, as described above, and the last one uses a LOOP instead of copying the lookup directly into the exception handler block. I had no hypothesis on which one would be faster between the first one and the last one; I was just curious to see if there was any difference at all. The fourth version also supports DELETEs, unlike any of the other versions.
The ideas here look very similar to the INSERT or IGNORE case, though skipping the subtransaction makes sense even earlier, at around 30% of rows pre-existing in the target table. No clear difference between methods one and four, not surprisingly.
INSERT or UPDATE
Very similar results here as well, though methods one and four don't get that big of a benefit anymore from the row existing compared to how fast they are against an empty table. Subtransactions are still expensive, though.
According to these results, you might want to consider avoiding the subtransaction if you expect the target row to exist roughly 30% of the time. Definitely consider that approach if you expect the row to exist in anywhere above 40% of cases.
Please remember that these results can only provide a general guideline. In particular, the behaviour was not tested under any actual concurrency (other than manual tests to verify correctness). If performance is very important for you, it's always up to you to measure different solutions and choose the one that works best for your application!
Update: Joel Jacobson suggested a slightly different variant in the comments, and I've added it to the graphs in the git repository. The results further highlight the fact that it's not the act of entering a subtransaction that's expensive, it's failing one and recovering from it. That also means that benchmarking these variants under concurrent workloads is likely going to give slightly different results.