The Incident
Recently, we have been seeing pods in our Kubernetes cluster get killed even though in the logs we could see them processing requests without issues and our observability did not point to a resource issue.
It seemed that the pods were failing their liveness checks and were being killed by Kubernetes after they were determined to be unhealthy.
We cannot have pods being killed randomly; our charity partners depend on bids being accepted, checkouts processing, donations being made, checking in donors to the event being smooth among many other core elements of running a successful fundraising event. Thankfully, the blast radius has been small, and the system has stayed operational without widespread issues, but it did speak to a wider problem that needed to be addressed.
So I started investigating. We have built out an Investigation Agent internally here at Trellis using Skills, Subagents, and MCPs that allows an agent to spin off independent subagents to do investigations, all readonly, into: AWS Services, Kubernetes cluster, Databases, Sentry, Linear, BetterStack logging, PostHog, and a general research subagent to search the internet for answers (we have since added a subagent for investigating our Prometheus metrics like event loop lag and other metrics, but we did not have this available to us during these incidents).
Using this investigation agent, we settled on the issue being that the event loop would get blocked long enough that liveness checks could not be responded to in the time required. The investigation surfaced a few initial areas of improvement:
- Moving out logging off of the main thread. We push a lot of logs with context into BetterStack, but right now Winston is running on the main thread, which can hold the event loop up for critical processing. The investigation suggested that we move from Winston to Pino and additionally run Pino in a worker thread.
- Found a bug in the response parsing of one of our internal data access
libraries meant for communicating with an internal service. Somehow we were
JSON.parseing the response data, thenJSON.stringifying it again, and thenJSON.parseing the result a second time... BecauseJSON.parseis a synchronous operation it blocks the event loop. - All the validation happening on request/response parsing for our REST and
GraphQL APIs is using
class-validator.class-validatoris an older library and there are more performant options out there now, like ArkType and Zod. This would require rewriting all of our validation code to use ArkType or Zod and have it integrated into Nest likeclass-validatoris. I did go down this path. However, that is not what this post is about, so I will save that for another time, and since I did the prototype of using ArkType for our GraphQL validations new information has come to light about Nest v12 supporting Standard Schema. - Unbounded GraphQL batching. We don't currently have a limit on the number of GraphQL operations that will be batched together by our clients (beyond whatever a default would be). I don't believe this has anything to do with the issues we are facing, though.
- Lastly, there was a note, almost as an afterthought without a ton of detail, in the investigation about there being some issues around the cache reads from our model cache. More on this later.
So I set off on optimizing the system. I started with the quickest win, #2, cleaning up the double parsing as we know that is excessive and unneeded synchronous work that would block the event loop, even though there is no way it would completely cause the issues. There is just not enough traffic to that service.
This fix was launched to production essentially right away as it was a simple
low-risk change that was verifiable by our
integration
and e2e tests.
I then set off to update our logging from Winston to Pino, #1. Thankfully, this
was super easy and worked essentially out of the box. Just had to write a custom
Rspack plugin (due to
a known incompatibility between pino-webpack-plugin and Rspack)
to compile Pino as a separate module that is loaded and run on a worker thread
so that we only pay the data serialization cost on the main thread. Because this
is a much larger change that also affects our ability to see what is happening
in the system, I opted to let the changes marinate in our develop environment
for a while, even though it is inherently tested when we run our integration
and e2e tests as they push logs to BetterStack as well.
While we were waiting to deploy this, we continued seeing pods, running our
aurora server, get killed by Kubernetes. Jordan on my team did an
investigation and found these lined up with a storm of model cache queries all
in a single microtask. The Model Cache is a Valkey-backed caching system we
built internally for caching data that does not change often to reduce the load
on our databases and speed of querying. So the question becomes: why is the
system whose purpose is to speed the system up causing massive event loop lag?
After some more investigation, a very obvious issue presented itself. When
JSON.parse runs on lots of individual strings, it can cause lag in the loop as
the synchronous operation is blocking. You are probably thinking, isn't
JSON.parse pretty optimized in Node? How many strings are being parsed into
objects? Well, let’s say 50 users all load a silent auction page at the same
time, this auction page has 107 auction items on it, and each auction item has
40 fields. That is 50 · 107 = 5,350 individual JSON.parse calls, each
invoking the reviver callback across ~40 fields, roughly 214,000 reviver
invocations in total, many of which would be happening in the same tick as we
pipeline commands to Valkey automatically and load them in at the same time.
That is a LOT of synchronous work, no wonder the event loop is being blocked.
This isn't the whole story though, JSON.parse is already very optimized in
Node, but we weren't only parsing the JSON string. We were also reviving
Dates and Prisma.Decimals. Now doing new Date() or new Prisma.Decimal()
are not exactly heavy operations, but how do you know what fields need to be
revived to their pre-serialized data types? We originally solved this by
serializing those fields with string prefixes on those field values prior to
JSON.stringify like date:<value.toISOString()> and
decimal:<value.toString()> and then traversing the parsed object for these
fields and reviving them. This solution made sense at the time and served us
incredibly well throughout our busiest year and season ever in the history of
Trellis. We were even able to validate how much benefit we got from the Model
Cache system prior to our busy season with our load tests. That being said, we
hit a different scale in 2026 compared to what we have experienced in our
previous busy seasons.
We traded database load and query latency for event loop blockage. But maybe, just maybe, we could have our cake and eat it too.
Why the existing system was slow
The Model Cache isn't quite as simple as
serialize → write → read → deserialize, the number of operations grows
linearly, but Trellis is growing exponentially. This means the load has more and
more of an effect on the event loop (and pod resources in general) as we serve
more charities that are more successful with more donors interacting with their
fundraisers.
Below is the pseudocode for a single object lifecycle in the Model Cache:
Write side - serializeValue(v)
# sentinel, skip JSON
if v === null → return "__NULL__"
# sentinel, skip JSON
if v === undefined → return "__UNDEFINED__"
# replacer walks tree
return JSON.stringify(v, replacer)
replacer(this, key, value):
# bypass JSON's auto-coercion
original = this[key]
if original is Date → "date:" + original.toISOString()
if original is Prisma.Decimal → "decimal:" + original.toString()
else → value
→ Valkey.SET(cacheKey, string)
Read side - deserializeValue(s)
Valkey.GET(cacheKey) → s
if s === "__NULL__" → return null
if s === "__UNDEFINED__" → return undefined
# reviver walks tree bottom-up
return JSON.parse(s, reviver)
reviver(_key, value):
if typeof value === 'string':
if value.startsWith('decimal:') → new Prisma.Decimal(value.slice(8))
if value.startsWith('date:') → parseISO(value.slice(5))
return value
Read-side sync-op count
Let:
- P = total nodes the reviver visits (every nested property + array element + each object/array itself)
- S = subset of P whose value is a
string - R = subset of S that hits a prefix (gets rehydrated)
- K = number of
startsWithcases in the switch (today there is 2:decimal:,date:) - L = max nesting depth
Formula
ops_read = 2 # __NULL__ / __UNDEFINED__ sentinel compares
+ 1 # JSON.parse invocation (tokenization is O(|s|), counted once)
+ P # one `typeof === 'string'` per visited node
+ Σ over S of k_i # startsWith checks per string; k_i ∈ [1, K]
+ 2·R # per rehydrate: 1 slice/replace + 1 constructor
Bounds
Worst case (every string misses every prefix): 3 + P + S·K
Best case (every string is a decimal, hits on check 1): 3 + P + S + 2·R
Read cost grows linearly in P (total properties, depth-agnostic) with an
additive S·K term that scales with the number of string leaves times the
size of the prefix switch. Depth itself is free, but every new prefixed type
bumps K by 1 → +S ops in the worst case (one extra startsWith per string
leaf in the payload).
Let’s take that previous example of 50 concurrent requests reading 107 auction items with 40 fields each.
Per-request node count (P)
P = 1 (root array)
+ 107 (item objects)
+ 107 x 40 (item fields)
= 1 + 107 + 4,280
= 4,388 reviver invocations per page load
Per-request op count
Assume of the 40 fields per item: ~20 are strings, ~4 need rehydration (a couple
of dates, a couple of decimals, this is typical for an auction item: startsAt,
endsAt, currentBid, minIncrement).
So:
S = 107 x 20 = 2,140 string leaves
R = 107 x 4 = 428 rehydrations
K = 2 (date:, decimal:)
ops_per_request = 3 # sentinel + JSON.parse
+ 4,388 # P (typeof checks)
+ 2,140 x 2 # S·K (startsWith fan-out, worst-ish case)
+ 2 x 428 # 2·R (slice + constructor)
= 3 + 4,388 + 4,280 + 856
≈ 9,527 sync ops per page load
For 50 concurrent users:
ops_total = 50 x 9,527 ≈ 476,350 sync ops
V8's Fast and Slow Parse Paths
Since Node is single threaded, this all stacks up and blocks the event loop.
But the operation count doesn't tell the whole story. Each of those 9,527 ops is
running through V8's slow JSON.parse path, not the fast one. V8 actually has
two parsers under the hood. The fast one builds objects in place with stable
hidden classes (good for downstream code, since predictable shapes mean V8 can
optimize the consumers of our cached data). The slow one walks the parsed tree
after the fact, boxes every primitive into a JS value, and hands each one off to
a user-supplied function.
Want to guess which one you opt into the moment you pass a reviver? The slow one.
And it doesn't matter that our reviver returns values unchanged on the vast
majority of nodes; V8 has no way to know that ahead of time, so the entire parse
runs along the slow path. It gets worse. There is a
known issue (this seems to be a
V8 issue not a Node issue though but as far as I can tell it has not been
reported to the Chromium team.) where even a completely no-op reviver, literally
(k, v) => v, makes the resulting object graph roughly 4x heavier in memory.
The reason is that ES2023 added a third argument to revivers that lets you
access the original source text for each value, and V8 has to retain that
metadata on every primitive in case your reviver decides to use it.
Our reviver doesn't use it, but that doesn't matter. We pay for it anyway. So we aren't just doing more synchronous work per parse; we're doing slower synchronous work, on a heavier object graph, that then puts more pressure on the garbage collector, which then blocks the event loop even more. Blocking the very thing we want to unblock.
Solutions Considered
After some discussion internally (both with humans and AI) we came up with a list of ideas to speed this up, we specifically focused on the deserialization (read) side first as that is where the performance hits us most when there is a storm of users loading fundraiser pages:
- Schema-driven walker: Declare the
Date/Prisma.Decimalleaves upfront, walk only those paths at deserialization time. Skip the per-key reviver entirely. (Inspired byfast-json-stringify's encode-side approach, commit to a schema upfront, take shortcuts in the hot path by only walking/accessing what is necessary). Since we control the exact shapes and serialization, we deterministically know what dot paths need to be rehydrated. - Native JSON parser: Replace V8's
JSON.parsewithsimdjson, C++ bindings around the SIMD-accelerated parser. Since we cannot alter the behaviour ofJSON.parseitself, maybe there is a faster JSON parser out there? - Batch coalescing: Concatenate
NValkeyMGETvalues into a single JSON array and parse it once. One parser invocation instead ofN. - Per-schema codegen walker:
new Function-emitted per model cached by the Model Cache, this would be monomorphic, V8 can inline-cache aggressively per model shape. - Native Date constructor: Swap
date-fns'sparseISOfornew Date. The bare-format encoder writestoISOString()output, which is the canonical ECMAScript layout the C++Date(string)parser fast-paths. - LRUCache: Add an LRUCache in front of the
MGETwith a small TTL that just helps smooth out those bursts of queries, it would cache the deserialized object so we only pay theO(1)lookup cost.
We took these possible options and iterated on them by building benchmarks and attempting to reproduce close to production conditions to see how they affected Event Loop Lag, p9x, CPU, and Heap Size.
Iteration 1: Schema-Driven Walker + simdjson
Now that we had an initial plan of attack, I wrote a benchmark repo that takes the incident object shape and built out four scenarios: small, medium, large, and heavy.
These scenarios were run against a matrix of configurations:
- Baseline: The current system using
JSON.parseand the reviver. simdjsonwith traversal:simdjson.parsefollowed by a full object traversal (walks every property).simdjsonpath-based traversal on each object:simdjson.parsewith trie-driven walker visiting only declaredDate/Prisma.Decimalpaths.- concatenated JSON array with
JSON.parse: Batch-concatenate multiple cache values into a single JSON array string, parse once withJSON.parse. - concatenated JSON with
simdjsonand traversal: Same as #4 but usingsimdjson.parseinstead ofJSON.parse, with full traversal. - concatenated JSON with
simdjsonand path-based: Same as #4 but usingsimdjson.parsewith schema-driven trie walker. - LRU baseline standard: Add an in-memory LRUCache layer with standard deserialization.
- LRU with
simdjsontraversal: LRUCache layer usingsimdjson.parse+ full traversal for cache misses. - LRU with
simdjsonpath-based: LRUCache layer usingsimdjson.parse+ trie walker for cache misses.
We added a dimension to the matrix for the hit rate % of the LRUCache at 0%, 25%, 50%, 75%, and, 100%.
This was all tested against a prototype of the system where we build a new
PathReviver class that was type safe for each instance of a Model Cache such
that it required that each dot path pointing to a Date / Prisma.Decimal leaf
be declared in the ReviverSchema<T>.
export type ReviverSchema<T> = {
readonly [P in DeepRevivablePath<T>]: LeafKindFor<ResolvePathLeaf<T, P>>;
};
A simple version of this for one of our models GuestUser with the schema:
model GuestUser {
id String @id
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
cursor Int @unique @default(autoincrement())
}
Would be:
export const GUEST_USER_REVIVER_SCHEMA: ReviverSchema<GuestUser> = {
createdAt: 'date',
updatedAt: 'date',
};
This ensured that if model shapes ever changed, our reviver schema was
compile-time safe and developers were forced to declare a concrete object that
the PathReviver could use to compile the tries for the most efficient object
walking to ensure we minimize traversal.
Benchmark results:
| Scenario | baseline | simdjson-each-traversal | simdjson-each-path-based | concat-native | concat-simdjson-traversal | concat-simdjson-path-based | lru-baseline-standard | lru-simdjson-traversal | lru-simdjson-path-based |
|---|---|---|---|---|---|---|---|---|---|
| small | 131.8 | 21.9 (+83%) | 20.3 (+85%) | 37.1 (+72%) | 23.4 (+82%) | 23.6 (+82%) | 39.7 (+70%) | 24.3 (+82%) | 21.0 (+84%) |
| medium | 111.1 | 243.3 (-119%) ⚠ | 34.6 (+69%) | 229.2 (-106%) ⚠ | 36.5 (+67%) | 46.8 (+58%) | 105.1 (+5%) | 149.3 (-34%) | 50.0 (+55%) |
| medium-hr025 | 111.1 | 243.3 (-119%) ⚠ | 34.6 (+69%) | 229.2 (-106%) ⚠ | 36.5 (+67%) | 46.8 (+58%) | 124.6 (-12%) | 153.9 (-38%) | 56.5 (+49%) |
| medium-hr050 | 111.1 | 243.3 (-119%) ⚠ | 34.6 (+69%) | 229.2 (-106%) ⚠ | 36.5 (+67%) | 46.8 (+58%) | 110.4 (+1%) | 39.4 (+65%) | 30.9 (+72%) |
| medium-hr075 | 111.1 | 243.3 (-119%) ⚠ | 34.6 (+69%) | 229.2 (-106%) ⚠ | 36.5 (+67%) | 46.8 (+58%) | 86.4 (+22%) | 43.0 (+61%) | 30.0 (+73%) |
| medium-hr100 | 111.1 | 243.3 (-119%) ⚠ | 34.6 (+69%) | 229.2 (-106%) ⚠ | 36.5 (+67%) | 46.8 (+58%) | 10.0 (+91%) | 10.0 (+91%) | 10.0 (+91%) |
| large | 264.6 | 146.4 (+45%) | 60.9 (+77%) | 227.5 (+14%) | 153.2 (+42%) | 368.1 (-39%) ⚠ | 362.5 (-37%) | 105.6 (+60%) | 145.1 (+45%) |
| large-hr025 | 264.6 | 146.4 (+45%) | 60.9 (+77%) | 227.5 (+14%) | 153.2 (+42%) | 368.1 (-39%) ⚠ | 304.3 (-15%) | 151.3 (+43%) | 114.0 (+57%) |
| large-hr050 | 264.6 | 146.4 (+45%) | 60.9 (+77%) | 227.5 (+14%) | 153.2 (+42%) | 368.1 (-39%) ⚠ | 221.1 (+16%) | 134.9 (+49%) | 118.4 (+55%) |
| large-hr075 | 264.6 | 146.4 (+45%) | 60.9 (+77%) | 227.5 (+14%) | 153.2 (+42%) | 368.1 (-39%) ⚠ | 134.9 (+49%) | 62.8 (+76%) | 54.3 (+79%) |
| large-hr100 | 264.6 | 146.4 (+45%) | 60.9 (+77%) | 227.5 (+14%) | 153.2 (+42%) | 368.1 (-39%) ⚠ | 10.1 (+96%) | 10.0 (+96%) | 10.1 (+96%) |
| heavy | 45.8 | 131.3 (-187%) ⚠ | 43.2 (+6%) | 79.6 (-74%) | 30.0 (+34%) | 28.7 (+37%) | 154.8 (-238%) ⚠ | 35.7 (+22%) | 28.3 (+38%) |
Note: ⚠ flags loop.p99 max cells where the value is implausibly worse than
companion runs and probably driven by a single GC-pause/OS scheduling outlier in
the timed phase. The wall p99 and parse p99 companion matrices are 2-4x
lower on those same combos. Cross-strategy ranking is stable despite the
outliers. The path-based reviver is consistently fast.
At this point it seemed like the PathReviver + simdjson approach was a big
enough improvement to warrant us moving forward with it. So we extracted the
implementation from the benchmark repo and implemented it into the Trellis
monorepo and swapped all Model Cache instances to use it.
Bench table: incident-shape (107 items x 50 concurrent resolves)
| Metric | Before (per-key reviver) | After (PathReviver + simdjson) | Delta |
|---|---|---|---|
| Loop p99 max | 264.6 ms | 60.9 ms | +77% |
This looked like the story was over, but we had not yet validated simdjson
against the full end-to-end pipeline. Iteration 2 would reveal that the parser
choice was masking a regression, and we would have to revisit it.
Iteration 2: The Stringifier and the simdjson Reversal
Now that we (thought we) had the deserializer working well, we thought, "why not
do the same thing for the serializer?" We are constantly writing to the cache,
just not as high volume as the deserializer since many can MGET a single
MSET.
At this point we are still using the same JSON.stringify process with a full
object traversal looking for originalValue instanceof Date and
Prisma.Decimal.isDecimal(originalValue).
if (originalValue instanceof Date) {
return serializeDate(originalValue);
}
if (Prisma.Decimal.isDecimal(originalValue)) {
return serializeDecimal(originalValue);
}
This is still megamorphic, and requires traversing every property of the object to find the leaves matching the data type needing special serialization.
Note: Technically at this point we don't actually need special handling for
Date objects since they will automatically use toISOString() when being
serialized but this was a relic of us needing the date:/decimal:
serialization prefixes before we moved to path-based deserialization.
Why bother traversing the object to serialize when we already know exactly what
properties will be deserialized into what types? There isn't a point. At this
point, since we do not need to add the date:/decimal: prefixes anymore, we
can just JSON.stringify the object and write it to Valkey.
But if we can just stringify now, why not use a faster stringifier?
fast-json-stringify is one
of these options, from their documentation:
fast-json-stringifyis significantly faster than JSON.stringify() for small payloads. Its performance advantage shrinks as your payload grows.
This is great for us as we are never stringifying huge objects, rather many small objects.
| Benchmark | JSON.stringify (ops/sec) | fast-json-stringify (ops/sec) |
|---|---|---|
| short string | 12,114,052 | 29,408,175 |
| obj | 4,577,494 | 7,291,157 |
| date | 803,522 | 1,117,776 |
Note: Benchmarks pulled from the fast-json-stringify repo.
The catch is that we cannot just swap fast-json-stringify in for
JSON.stringify, it requires a concrete JSON schema so it knows how to build
the string. Since it uses string concatenation, we need to define the schema we
are stringifying upfront.
Luckily, this is an area that is being well-developed, especially with the
Standard Schema definition being defined. Since we were already considering
ArkType for the validation side, I decided to go with ArkType as its API is very
nice to work with and ArkType has support for exporting the type a JSON schema
with toJsonSchema() and has support for draft-07 which is what
fast-json-stringify expects.
To encapsulate this logic we built a Stringifier<T> class that takes a T
(the model being stringifer) as a generic and accepts an ArkType type() as a
constructor argument. To ensure maximum type safety, the Stringifier class
validates that the type() and T are compatible.
Here is a sample of the Stringifier class and an example (reduced for
brevity):
import { type, type Type } from 'arktype';
import fastJsonStringify from 'fast-json-stringify';
type Equal<X, Y> =
(<T>() => T extends X ? 1 : 2) extends <T>() => T extends Y ? 1 : 2
? true
: false;
export class Stringifier<T, A extends { infer: unknown } = Type<T>> {
private readonly compiled: (value: T) => string;
constructor(
arkType: A & (Equal<A['infer'], T> extends true ? unknown : never)
) {
const jsonSchema = arkType.toJsonSchema({
target: 'draft-07',
fallback: {
// Required for unmapped types, like `Prisma.Decimal`
default: (ctx) => ({ ...ctx.base, type: 'string' }),
},
});
this.compiled = fastJsonStringify(jsonSchema) as (value: T) => string;
}
stringify(value: T): string {
return this.compiled(value);
}
}
export const GUEST_USER_STRINGIFIER = new Stringifier<GuestUser>(
type({
id: 'string',
createdAt: 'Date',
updatedAt: 'Date',
cursor: 'number',
})
);
This will ensure at compile time that the types are compatible and that we are stringifier efficiently with a concrete schema that matches the model that is being cached. If the model ever drifts (like if we alter/add/remove a field in the database), then it will be required for us to handle it here.
The Regression:
At this point we ran an end-to-end benchmark to compare the performance of the serialization and deserialization cycle, at a high concurrency to compare the before and after of these systems. What we found surprised us and did not align with the original benchmarks.
This project's KPI is event-loop block tails and OOM, not median throughput.
Here's how each axis lands today (without the simdjson → JSON.parse swap).
| Axis | PathReviver (current) | Legacy walker | Verdict |
|---|---|---|---|
| Median throughput (deserialize) | 15 521 ops/s | 27 021 ops/s | Worse (0.57x) |
| p99 latency (deserialize) | 223.9 µs | 116.7 µs | Worse (1.92x) |
| Max latency (deserialize) | 17.4 ms | 25.7 ms | Better (0.68x) |
| Peak RSS (deserialize stage) | 206 MB | 426 MB | Better (0.48x) |
| CPU user-ms per 200k ops (deserialize) | 12.2 s | 7.3 s | Worse (1.67x) |
| Round-trip max latency | 7.2 ms | 17.5 ms | Better (0.41x) |
| Serialize max latency | 9.7 ms | 112.7 ms | Massively better (0.09x) |
| Serialize CPU user-ms | 3.2 s | 9.7 s | Better (0.33x) |
Based on 200,000 ops x 128 concurrency headline benchmark.
The median throughput, p99 latency (deserialize), and,
CPU user-ms per 200k ops (deserialize) all regressed... badly.
So what gives? After some investigation, it turns out, simdjson.parse is ~4.4x
slower than V8's JSON.parse on our cached payload sizes. C-binding-per-call
overhead is ~20 µs, and SIMD parsing doesn't recover that cost on documents this
small. simdjson's marketing numbers come from multi-MB documents. This was
likely an error with how our original benchmarks were set up. I probably took
them at face value without digging into them too much and just accepted what I
thought was a win (benchmarking is hard).
| Stage | ops/sec | ns/op | p99 (µs) |
|---|---|---|---|
simdjson.parse |
39 284 | 25 456 | 35.3 |
JSON.parse (bare input) |
171 087 | 5 845 | 7.2 |
JSON.parse (prefix input) |
165 284 | 6 050 | 7.5 |
Some metrics were still better, and anything serialization-related was still
better due to our more performance Stringifier implementation. But, for the
deserialization side, it is much more performant to just rely on JSON.parse
built into V8.
Before switching, though, we had a few hypotheses to test:
- Hypothesis 1: A payload size threshold makes
simdjsonwin. - Hypothesis 2: Coalescing N reads into a single array parse is a win.
- Hypothesis 3: Realistic shape variety closes the gap.
Hypothesis 1 - A payload size threshold makes simdjson win:
The ~3 KB Booking fixture used in the last benchmark showed simdjson is ~4.4x
slower than JSON.parse. But simdjson is built for parsing gigabytes per
second on large documents. The C-binding-per-call overhead amortizes differently
on a 100 KB payload than on a 500 B one.
We sampled the sizes of the payloads we are caching in production to build a
distribution of payload sizes. To run this benchmark against. If simdjson
crosses over to "faster" somewhere in this range, we should pick the parser at
construction time based on a size estimate, not unconditionally swap. The bench
will run against all 10+ shapes and report the crossover point if one exists.
Ultimately, the C-binding-per-call overhead dominates at every size and we saw a
3-5x improvement using JSON.parse over simdjson.
One hypothesis down.
Hypothesis 2 - Coalescing N reads into a single array parse is a win:
The theory here is that since we are pipelining many MGET operations together
then we might be able to do something like:
const stringResults = await pipelinesMGets();
return JSON.parse(`[${stringResults.join(',')}]`);
Here we are essentially concatenating the results of many MGET operations into
a single stringified array and doing a single parse operation on the result.
This result is a bit more intricate, it is not a simple better/worse. Coalescing
wins on fixtures, up to 2.1x. But is neutral to worse on medium to large
fixtures; once we get above ~1KB fixture size, the cost of building the array
wipes out the performance gain from a single JSON.parse. Additionally, this
hides a tail-latency issue; when work is batched together, the maximum
event-loop block per call grows.
We opted to keep the per record JSON.parse approach for now, and leave
dynamically coalescing based on fixture size as a future optimization to keep
things simple right now.
Second hypothesis (mostly) down.
Hypothesis 3 - Realistic shape variety closes the gap:
Using realistic shapes changes the picture but doesn't fix it. The aggregate
"PathReviver is 0.6x" from the Booking fixture (a ~3 KB production payload
used in an interim benchmark to cross-check the size sweep) was specifically a
simdjson artifact, not a "PathReviver is slow on small payloads" story. The
size sweep above shows the trie walker is a consistent ~5-20% win across the
entire production distribution as long as the parser choice is correct.
Two interesting per-shape observations the headline bench missed:
-
GuestUser(~150 B) is the only fixture wherePathReviver+JSON.parseis slightly slower than the legacy walker (0.94x). At this size the trie-compile-cache hit plus the path lookup overhead barely matters. The miss is 6% on a fixture so cheap that the absolute difference is ~340 ns/op. -
Fundraiser(~4 KB, 70 fields) is the biggestPathReviverwin (1.22x). Wide-flat objects with many declared revival paths exactly match what the trie was built for.
The third and final hypothesis, down.
Final Results:
Benchmark: 20,000 ops per cell x concurrency 64
Speed-up per (fixture x scenario) vs the legacy prefix-marker baseline:
| Fixture | Bytes | Serialize | Deserialize | Round-trip |
|---|---|---|---|---|
GuestUser (~150 B) |
~150 B | 1.64x | 0.93x | 1.62x |
AuctionBid (~400 B) |
~400 B | 2.30x | 1.03x | 1.48x |
PaddleTipConfig (~500 B) |
~500 B | 2.78x | 0.99x | 1.46x |
User (~800 B) |
~800 B | 1.67x | 1.06x | 1.30x |
Item (~1.5 KB) |
~1.5 KB | 2.01x | 1.04x | 1.28x |
AuctionItem (~3 KB) |
~3 KB | 1.66x | 1.19x | 1.22x |
GivingLevels (~3 KB) |
~3 KB | 2.43x | 1.04x | 1.95x |
Fundraiser (~15 KB) |
~15 KB | 2.06x | 1.23x | 1.21x |
FeatureFlags (~100 KB) |
~100 KB | 2.62x | 1.15x | 1.73x |
Serialize: 1.6-2.8x faster on every fixture. Stringifier is the largest single win this project has shipped.
Deserialize: matched-or-faster on 8 of 9 fixtures. The one regression and one
tie are at the very smallest end of the distribution (GuestUser 0.93x,
PaddleTipConfig 0.99x) where absolute differences are sub-microsecond and the
trie compile-cache hit per call is the floor.
Note: These benchmarks were run on an M3 Pro with 36GB of RAM locally. We expect the improvements to be more pronounced when running in a more resource constrained environment and at a higher concurrency with more event loop contention.
If we look at the worst max-latency observed under load between the previous and current algorithms:
FeatureFlagsround-trip max: 45.9 ms → 6.4 ms (7.2x)FeatureFlagsdeserialize max: 11.6 ms → 3.1 ms (3.7x)GivingLevelsround-trip max: 3.9 ms → 1.6 ms (2.4x)SponsorBundledeserialize max: 2.1 ms → 0.9 ms (2.3x)AuctionItemdeserialize max: 0.7 ms → 0.2 ms (3.5x)
We can see that the speed increase we get under load when there is high contention is significant, up to 7.2x in the round-trip time for larger fixtures.
Iteration 3: Codegen Walker and Native Date Parsing
At this point we were almost done, but there were two other hypotheses we wanted to test. Just as one last ditch effort to improve the performance even more.
- By using a function codegen per model, we can allow V8 to optimize the calls, and we would end up with a monomorphic function instead of our current megamorphic function.
- Swap
date-fnsparseISOfornew Date(value). In thePathReviverimplementation.
I had high hopes for #2 but not for #1. parseISO is ~250 lines of pure JS date
parsing logic, while new Date(s) resolves directly to a native C++ fast path
in V8, the per-leaf savings seemed obvious. Codegen (#1), by contrast, felt like
a micro-optimization that might not survive the noise floor given that IC warmup
takes many iterations and our schema set is small. We benchmarked both anyway.
Hypothesis 1 - Monomorphic function:
The current PathReviver.walk visits ReviverTrieNode objects generically. At
each call site V8 sees a different hidden class, different schemas produce
nodes with different shapes.
This is megamorphic: V8 cannot build a useful inline cache and must fall back to a generic property lookup every time.
// This is megamorphic: trie walker sees many different node shapes
function walk(node: ReviverTrieNode, target: unknown): void {
// IC misses on every schema
const children = node.children;
for (const [segment, child] of children) {
// different Map instances per schema
if (Array.isArray(target)) {
// megamorphic dispatch
descendArray(child, target, segment);
} else {
// megamorphic dispatch
descendObject(child, target, segment);
}
}
}
function rehydrateObjectLeaf(
node: ReviverTrieNode,
target: Record<string, unknown>,
key: string
): void {
if (node.kind === 'date') {
// 250-LOC date-fns fn, V8 won't inline
target[key] = parseISO(target[key] as string);
} else if (node.kind === 'decimal') {
target[key] = new Decimal(target[key] as string);
}
}
Every Map.get, for-of, string-union branch, and parseISO call runs through
V8's slow path on every cache read. The theory is that we can optimize this by
recomputing these hydration functions per model and caching them so they are
static and can be optimized by the inline cache.
Instead of a WeakMap<schema, ReviverTrieNode>, keep a
WeakMap<schema, (target: object) => void>, a compiled walker built once on
first sight, then reused on every deserialize call for that schema.
Since the function is specialized to one schema (it literally will not work for
any other model since it's tailored to that specific shape), every property
access is against one known hidden class. V8 builds a monomorphic inline
cache on the first hot iteration and never misses again. This requires using
new Date(s) instead of parseISO(s) at the leaf so the call site resolves to
a native C++ fast path and there is no JS frame to enter. The implementation of
this monomorphic function compiler is sort of long and boring, so I will leave
it out, but here is a sample of what a compiled function looks like.
Flat schema for ImmutableAuctionBid that has three Date leaves:
const ImmutableAuctionBidSchema: ReviverSchema<ImmutableAuctionBid> = {
createdAt: 'date',
updatedAt: 'date',
voidedAt: 'date',
};
Emitted function body (new Function body string):
function compiledWalker(target) {
const createdAt_ = target.createdAt;
if (typeof createdAt_ === 'string') target.createdAt = new Date(createdAt_);
const updatedAt_ = target.updatedAt;
if (typeof updatedAt_ === 'string') target.updatedAt = new Date(updatedAt_);
const voidedAt_ = target.voidedAt;
if (typeof voidedAt_ === 'string') target.voidedAt = new Date(voidedAt_);
}
Note: The case with arrays gets a little messier and will make this already long blog post longer, so I am leaving it out for brevity. Feel free to reach out if you are interested, and I can write another post going into the details.
Every property access (target.createdAt, target.updatedAt,
target.voidedAt) is against the shape of the same hidden class
ImmutableAuctionBid. V8 builds a monomorphic IC on the first call, and the
loop runs at near-native speed.
Hypothesis 2 - Use new Date(value) instead of parseISO(value):
Using new Date(value) instead of parseISO(value) seemed like a no-brainer.
It is a native V8 call, will be highly optimized, and, results in a single call
rather than the function call to parseISO that does additional parsing and
logic. Since we control the serialization of dates into the cache, there is no
risk of the data being corrupted (beyond what already existed).
With these two hypotheses in mind, these are the cases we went out to test:
| Walker | Date parser |
Implementation key |
|---|---|---|
| trie | parseISO |
trie+parseISO (current) |
| trie | new Date |
trie+nativeDate |
| codegen | parseISO |
codegen+parseISO |
| codegen | new Date |
codegen+nativeDate (target state) |
| - | - | JSON.parse only (ceiling) |
With these implementation scenarios in hand we ran the options through a gauntlet like we did before:
| fixture | date-swap: trie+Date trie+parseISO |
codegen: cg+parseISO trie+parseISO |
compound: cg+Date cg+parseISO |
vs ceiling: cg+Date parse |
|---|---|---|---|---|
GuestUser |
3.99x | 1.44x | 2.98x | 0.52x |
AuctionBid |
2.29x | 0.99x | 2.43x | 0.59x |
PaddleTipConfig |
2.33x | 1.11x | 2.35x | 0.37x |
User |
2.35x | 1.05x | 2.36x | 0.60x |
Item |
1.83x | 1.01x | 1.83x | 0.75x |
AuctionItem |
1.78x | 1.02x | 1.86x | 0.76x |
GivingLevels |
2.59x | 1.09x | 2.76x | 0.43x |
SponsorBundle |
2.63x | 1.06x | 2.60x | 0.63x |
Fundraiser |
1.31x | 0.89x | 0.84x | 0.39x |
FeatureFlagsLarge |
2.09x | 1.04x | 2.07x | 0.82x |
| median | 2.31x | 1.04x | 2.35x | 0.59x |
The biggest win was swapping parseISO for new Date, regardless of
codegen/trie, that is clear. But what about the codegen+Date scenario compared
to trie+Date scenario?
This yielded positive results, but they were only slightly better than the trie+Date scenario, landing somewhere between a 5–13% improvement over it. Nothing to scoff at, but it also introduces a much larger change to the code as we need to be pre-compiling dynamic functions now. This is being left for a subsequent iteration for the time being as I would want to do extension testing and edge case analysis on it to ensure we are not breaking the caching pipeline.
Where We Ended Up
Overall, we killed the per-key reviver that was being used with JSON.parse and
moved to a pre-compiled trie walker on the deserialization side and built a
custom Stringifier for the serialization side to speed that up.
Additionally, we relied on native date parsing instead of a third-party library since we knew dates were going to be stringified correctly.
The v1 design JSON.stringify(value, replacer) / JSON.parse(value, reviver)
walked every node of the parsed tree and ran a per-key JS callback, with
startsWith('date:') / startsWith('decimal:') probes on every string leaf and
date: / decimal: prefix markers in the encoded bytes.
The v2 design replaces both halves:
Write: a per-model Stringifier owns a compiled fast-json-stringify
function. The wire bytes are ISO 8601 strings/numeric strings, no date: or
decimal: prefix markers.
if v === null → return "__NULL__"
if v === undefined → return "__UNDEFINED__"
if typeof v !== 'object' → return JSON.stringify(v) # scalar bypass
return stringifier.stringify(v) # compiled fast-json-stringify
stringifier.stringify(v):
return this.compiled(v) # straight-line emitted JS
→ Valkey.SET(cacheKey, string) # bytes look like {"createdAt":"2026-01-01T00:00:00.000Z","tipPercent":"0.05",...}
Read: PathReviver.deserialize(value, schema) runs one native JSON.parse
(with no reviver) then walks a precompiled trie of revivable leaf paths.
Iteration bound is the schema's child count, not the target object's key count.
Non-revivable fields are never visited.
Valkey.GET(cacheKey) → s
if s === "__NULL__" → return null
if s === "__UNDEFINED__" → return undefined
return pathReviver.deserialize(s, schema)
PathReviver.deserialize(s, schema):
parsed = JSON.parse(s) # ONE pass, native, no JS callback
if parsed is non-null object:
trie = this.compile(schema) # WeakMap-cached; null for {}
if trie != null: this.walk(parsed, trie) # mutates parsed in place
return parsed
PathReviver.compile(schema): # amortized O(1) after first call
for [dottedPath, kind] in schema: # kind ∈ {'date', 'decimal'}
segments = dottedPath.split('.') # split ONCE at compile time
descend the trie, creating Map<segment, node> as needed
mark terminal node: node.leaf = kind
PathReviver.walk(target, node): # iterates SCHEMA children, not target keys
if target is array:
wildcard = node.children.get('__ARRAY__')
if wildcard: for i in 0..target.length: descendArray(target, i, wildcard)
return
for [segment, child] in node.children:
if child.leaf:
v = target[segment]
if typeof v === 'string':
target[segment] = child.leaf === 'date' ? new Date(v) : new Prisma.Decimal(v)
if child.children and target[segment] is non-null object:
walk(target[segment], child)
Read-side sync-op count:
Let:
- B = payload byte length (input to
JSON.parse) - P = total nodes in the parsed payload (legacy v1 cost driver, no longer relevant to the walker)
- N = number of internal trie nodes visited (one per descent on the way to a leaf; structural, depth-bounded)
- A = sum of array lengths at array-wildcard descents (Σ over each
__ARRAY__site of the array's element count) - R = number of revivable leaves actually present in the payload (Date + Decimal slots, expanded across array elements)
- L = max nesting depth of the schema trie (recursion depth only; does not scale per node)
- K = number of leaf kinds the walker can dispatch (
'date','decimal'); dispatched by a single equality check.
Bounds:
v2 Walker cost = O(N + A + R) # independent of total payload node count P
# independent of non-revivable field count
# K does NOT multiply S (no per-string startsWith fan-out)
Takeaway:
Walker cost is strictly proportional to revivable leaves and the array elements
that gate them. Non-Date/non-Decimal fields are never touched. JSON.parse is
still O(B), but it is now V8-native with no per-node JS callback (the v1
reviver-callback constant factor is eliminated). Nesting depth L shows up only
as recursion frames, not as repeated path.split('.') work, paths are split
once, at compile time and cached as a trie in a WeakMap.
Growth class:
- Walker is linear in
R + A, not inP. JSON.parseis linear inB.- There is no
S · Kterm anymore: the schema tells the walker each leaf's kind exactly, so the prefix-switch is gone from the hot path. Adding a third leaf kind (e.g.bigint) costs nothing on payloads that don't contain that kind, and costsc_leafper occurrence on payloads that do. There is no fan-out across every string leaf.
The only term that still tracks the payload shape is JSON.parse(B), which is
unavoidable for any design that sends JSON over the wire.
Example - 50 users x 107 items x 40 fields:
Same scenario as v1: a silent auction page loads 107 auction items (40 fields
each) for 50 concurrent users. Assume ~4 revivable leaves per item (startsAt,
endsAt, currentBid, minIncrement) and that the schema uses a
items.__ARRAY__.{...} shape so the walker enters the array once.
Per-request node count (P) vs walker visits (N + A + R):
# v1
P (total parsed nodes) = 1 + 107 + 107·40 = 4,388
# v2
N (internal trie nodes) = 2
# root → items → __ARRAY__
A (array iterations) = 107
R (revivable leaves) = 107 x 4 = 428
The v2 walker visits N + A + R = 2 + 107 + 428 = 537 slots. The v1 reviver
visited all 4,388 nodes and ran startsWith on every string among them.
Per-request op count:
v1 ops_per_request ≈ 3
+ 4,388 (P typeof)
+ 4,280 (S·K startsWith)
+ 856 (2·R)
≈ 9,527 walker ops
v2 walker ops ≈ 1
+ 109 (N+A)
+ ~1,284 (≈3·R)
≈ 1,400 walker ops
The walker work drops by ~6.8x for this payload shape. The JSON.parse(value)
term is the same in big-O but its constant factor improves because there is no
JS reviver-callback invoked per node.
50 concurrent users:
ops_total ≈ 50 · [ c_parse · B + ~1,400 walker ops ]
≈ 50 · c_parse · B + ~70,000 walker ops
JSON.parse still runs on the single event loop; with ~4.4k-node payloads it's
the dominant cost. The walker contribution at ~70k ops across 50 requests is
sub-millisecond and effectively disappears against the parse cost.
Summary: what v2 eliminates
| Term | v1 cost | v2 cost |
|---|---|---|
Per-node JS reviver callback (P) |
c_cb · P |
0 (native parse) |
Per-string startsWith fan-out (S·K) |
c · S · K |
0 (schema-typed leaf) |
Per-leaf rehydrate (R) |
c · 2 · R |
c · 3 · R (≈ same) |
Path string split('.') at runtime |
n/a | 0 (split at compile) |
| Schema compile | n/a | O(1) amortized (WeakMap) |
| Walker visits non-revivable fields | yes (all P) |
no (schema-driven) |
Bytes carry date: / decimal: prefix |
yes | no (bare ISO/numeric) |
Net: the walker is now O(R + A) instead of O(P + S·K), with smaller
constants on the JSON.parse pass as well.
So for our case of 50 users x 107 items x 40 fields, we go from ~476,350
walker ops (v1) down to ~70,000 (v2). A saving of roughly ~406,000
operations, and the deserialization is about 2.31x faster because of the
optimized walker.
Together these changes delivered the changes we were looking for from this project: improved serialization and reduced maximum-latency under high concurrency all in the service of relieving event-loop pressure.
