It Came Back
A few weeks ago we closed out a nasty incident: Prisma's TypeScript-based "rustless" engine was causing intermittent transaction failures under load. We rolled back to the Rust-based engine, errors vanished, incident report filed.
A week later, the same class of failures came back.
The rustless engine was already rolled back, so the obvious suspect was off the table. Something else had been wrong the whole time. We just hadn't seen it until load pushed hard enough.
Reproducing the Unreproducible
The tricky part with this issue was that it only manifested under load. You could poke around in a dev environment all day and never see it. Luckily, our team had already built a load-testing system using k6 for exactly this kind of scenario.
I fired up our auction bid "breaking point" test. It spins up simulated users, starts creating bids, and ramps concurrency until the system buckles.
There it was: a clean, repeatable reproduction.
Down the Rabbit Hole
With a reliable reproduction in hand, the next step was finding the actual root cause. Was this genuinely a Prisma issue, or was it time to take that next ego hit and accept that maybe we didn't have something configured correctly?
We pulled on a few investigation threads:
- Retries amplifying pressure: our retry system for Prisma failures could be re-queuing operations under load and piling more work onto an already-stressed connection pool.
- Overly broad transactions: we lean heavily on database transactions for data integrity. Were some of them wrapping too much and holding connections longer than necessary?
- Under-provisioned pool: the classic one. Were we just not giving the app enough connections for peak concurrency?
We poked at each. Tinkered. Tweaked. Made progress, but nothing felt like the answer.
So, I took a break.
The Moment of Discovery
When I came back, I pulled up our database connection string in the environment vault, just to stare at it really, the way you do when you're out of better ideas. And then I noticed something. A single query parameter sitting quietly at the end of the connection string:
pgbouncer=true
Huh. We don't run PgBouncer on our Aurora database (instance-based, not Aurora Serverless). This flag was telling Prisma to optimize for a connection pooler that didn't exist.
The flag used to make sense. Early on we were on Supabase, which uses
Supavisor (a
PgBouncer-compatible connection pooler) under the hood, and pgbouncer=true was
the correct configuration. When we migrated to RDS Aurora, the flag came along
for the ride. The connection string worked, the app connected, everything seemed
fine. One of those settings that quietly persists through a migration nobody
thinks to audit.
What That Little Flag Actually Does
Turns out, when you tell Prisma you're running behind PgBouncer, it fundamentally changes how it talks to your database:
-
Statement caching gets completely disabled: Prisma normally caches up to 100 prepared statements. With this flag, it caches zero. Every single query gets re-parsed and re-planned by Postgres from scratch. On a platform that processes thousands of auction bids (among many other operations), that's an enormous amount of redundant work.
-
A cleanup command fires before every query: Prisma sends a
DEALLOCATE ALLto wipe prepared statements before executing your actual query. Since there's no PgBouncer shuffling connections around, there's nothing to clean up. It's just wasted work on every operation. -
Every read gets wrapped in a transaction: Even a simple
findManygets aBEGIN/COMMITpair around it. Two extra commands per query, for absolutely no benefit. Over any network connection between the application and the database, this adds up fast.
Here's what a single read actually looks like in the Postgres log, with
log_statement = 'all' enabled on both sides (timestamps and PIDs stripped for
readability):
With pgbouncer=true:
LOG: statement: BEGIN
LOG: statement: DEALLOCATE ALL
LOG: execute s0: SELECT "public"."User"."id" FROM "public"."User" WHERE 1=1 OFFSET $1
LOG: statement: COMMIT
Without the flag:
LOG: execute s1: SELECT "public"."User"."id" FROM "public"."User" WHERE 1=1 OFFSET $1
Three extra operations, on every read, across the whole app.
The Numbers
I removed the flag. I ran the load test again. Same conditions, same concurrency ramp, same auction bid simulation.
During auction bid creation under load:
| Metric | Before (with flag) | After (without flag) |
|---|---|---|
| Avg P95 | 1.27s | 0.52s |
| Avg P99 | 2.84s | 0.91s |
A 59% improvement at P95 and a 68% improvement at P99. From a single configuration change. No code refactoring, no infrastructure scaling. Just removing a flag that was telling our ORM to fight itself.
Want to See It on Your Own Machine?
I built a minimal reproduction to verify the mechanism outside our production
environment: Docker Postgres 16 on localhost, Prisma 6.19.3, a mixed read/write
workload, with only the ?pgbouncer=true URL parameter changing between runs.
2,000 operations at concurrency 20:
| Metric | With pgbouncer=true |
Without flag | Improvement |
|---|---|---|---|
| Avg | 5.84 ms | 2.60 ms | 55% |
| P50 | 4.35 ms | 1.61 ms | 63% |
| P95 | 12.10 ms | 6.79 ms | 44% |
| P99 | 14.38 ms | 8.24 ms | 43% |
| Throughput | 3,407 ops/s | 7,643 ops/s | 2.24× |
Loopback numbers understate production impact: the flag's overhead scales with
per-statement network latency, and on localhost that's ~100 µs vs. 0.5–2 ms
across a VPC to a managed database. Postgres logs with log_statement = 'all'
show the mechanism directly. With the flag, every read is wrapped in
BEGIN/COMMIT and preceded by DEALLOCATE ALL. Without it, the same read is
a single execute against a cached prepared statement. Full repro at
trellisorg/prisma-pgbouncer-flag-repro.
If you're on Prisma 7: I ran the same harness against 7.8.0 and the flag's effect flattens into noise (throughput within ~6% in either direction across runs). Still worth removing if it doesn't apply to your setup (it's misrepresenting your infrastructure), but don't expect a Prisma 6-sized speedup. Results on the prisma-7-test branch.
What We Learned
Configuration is code. Connection strings, environment variables, and
feature flags deserve the same scrutiny as application logic. That little
pgbouncer=true silently taxed every database operation for weeks, and it
wasn't something a traditional code review would ever catch as it lived in our
environment, not the repo. Environment changes need a review process too.
Take the ego hit early. My first instinct was to blame the upstream dependency (Prisma's rustless engine). That was a real issue, but the bigger performance problem was sitting in our own configuration the entire time. The faster you're willing to ask "is this actually our fault?", the faster you find your answer even when the original issue really is upstream.
Alert on what users feel. This regression didn't page anyone because nothing was technically broken, it was just slow. We're setting target P95/P99 response times for our key operations so the next slow regression trips an alert before it chips away at anyone's experience.
Go Check Your Connection Strings
If you're running Prisma on a managed database like Aurora and you're not using PgBouncer (or any external connection pooler), go check your connection strings right now. Hopefully, you are not surprised by what you find.
Want to learn more about how we approach testing and automation at Trellis? Check out this related post:
