The Incident

Recently, we have been seeing pods in our Kubernetes cluster get killed even though in the logs we could see them processing requests without issues and our observability did not point to a resource issue.

It seemed that the pods were failing their liveness checks and were being killed by Kubernetes after they were determined to be unhealthy.

We cannot have pods being killed randomly; our charity partners depend on bids being accepted, checkouts processing, donations being made, checking in donors to the event being smooth among many other core elements of running a successful fundraising event. Thankfully, the blast radius has been small, and the system has stayed operational without widespread issues, but it did speak to a wider problem that needed to be addressed.

So I started investigating. We have built out an Investigation Agent internally here at Trellis using Skills, Subagents, and MCPs that allows an agent to spin off independent subagents to do investigations, all readonly, into: AWS Services, Kubernetes cluster, Databases, Sentry, Linear, BetterStack logging, PostHog, and a general research subagent to search the internet for answers (we have since added a subagent for investigating our Prometheus metrics like event loop lag and other metrics, but we did not have this available to us during these incidents).

Using this investigation agent, we settled on the issue being that the event loop would get blocked long enough that liveness checks could not be responded to in the time required. The investigation surfaced a few initial areas of improvement:

  1. Moving out logging off of the main thread. We push a lot of logs with context into BetterStack, but right now Winston is running on the main thread, which can hold the event loop up for critical processing. The investigation suggested that we move from Winston to Pino and additionally run Pino in a worker thread.
  2. Found a bug in the response parsing of one of our internal data access libraries meant for communicating with an internal service. Somehow we were JSON.parseing the response data, then JSON.stringifying it again, and then JSON.parseing the result a second time... Because JSON.parse is a synchronous operation it blocks the event loop.
  3. All the validation happening on request/response parsing for our REST and GraphQL APIs is using class-validator. class-validator is an older library and there are more performant options out there now, like ArkType and Zod. This would require rewriting all of our validation code to use ArkType or Zod and have it integrated into Nest like class-validator is. I did go down this path. However, that is not what this post is about, so I will save that for another time, and since I did the prototype of using ArkType for our GraphQL validations new information has come to light about Nest v12 supporting Standard Schema.
  4. Unbounded GraphQL batching. We don't currently have a limit on the number of GraphQL operations that will be batched together by our clients (beyond whatever a default would be). I don't believe this has anything to do with the issues we are facing, though.
  5. Lastly, there was a note, almost as an afterthought without a ton of detail, in the investigation about there being some issues around the cache reads from our model cache. More on this later.

So I set off on optimizing the system. I started with the quickest win, #2, cleaning up the double parsing as we know that is excessive and unneeded synchronous work that would block the event loop, even though there is no way it would completely cause the issues. There is just not enough traffic to that service.

This fix was launched to production essentially right away as it was a simple low-risk change that was verifiable by our integration and e2e tests.

I then set off to update our logging from Winston to Pino, #1. Thankfully, this was super easy and worked essentially out of the box. Just had to write a custom Rspack plugin (due to a known incompatibility between pino-webpack-plugin and Rspack) to compile Pino as a separate module that is loaded and run on a worker thread so that we only pay the data serialization cost on the main thread. Because this is a much larger change that also affects our ability to see what is happening in the system, I opted to let the changes marinate in our develop environment for a while, even though it is inherently tested when we run our integration and e2e tests as they push logs to BetterStack as well.

While we were waiting to deploy this, we continued seeing pods, running our aurora server, get killed by Kubernetes. Jordan on my team did an investigation and found these lined up with a storm of model cache queries all in a single microtask. The Model Cache is a Valkey-backed caching system we built internally for caching data that does not change often to reduce the load on our databases and speed of querying. So the question becomes: why is the system whose purpose is to speed the system up causing massive event loop lag?

After some more investigation, a very obvious issue presented itself. When JSON.parse runs on lots of individual strings, it can cause lag in the loop as the synchronous operation is blocking. You are probably thinking, isn't JSON.parse pretty optimized in Node? How many strings are being parsed into objects? Well, let’s say 50 users all load a silent auction page at the same time, this auction page has 107 auction items on it, and each auction item has 40 fields. That is 50 · 107 = 5,350 individual JSON.parse calls, each invoking the reviver callback across ~40 fields, roughly 214,000 reviver invocations in total, many of which would be happening in the same tick as we pipeline commands to Valkey automatically and load them in at the same time. That is a LOT of synchronous work, no wonder the event loop is being blocked.

This isn't the whole story though, JSON.parse is already very optimized in Node, but we weren't only parsing the JSON string. We were also reviving Dates and Prisma.Decimals. Now doing new Date() or new Prisma.Decimal() are not exactly heavy operations, but how do you know what fields need to be revived to their pre-serialized data types? We originally solved this by serializing those fields with string prefixes on those field values prior to JSON.stringify like date:<value.toISOString()> and decimal:<value.toString()> and then traversing the parsed object for these fields and reviving them. This solution made sense at the time and served us incredibly well throughout our busiest year and season ever in the history of Trellis. We were even able to validate how much benefit we got from the Model Cache system prior to our busy season with our load tests. That being said, we hit a different scale in 2026 compared to what we have experienced in our previous busy seasons.

We traded database load and query latency for event loop blockage. But maybe, just maybe, we could have our cake and eat it too.

Why the existing system was slow

The Model Cache isn't quite as simple as serialize → write → read → deserialize, the number of operations grows linearly, but Trellis is growing exponentially. This means the load has more and more of an effect on the event loop (and pod resources in general) as we serve more charities that are more successful with more donors interacting with their fundraisers.

Below is the pseudocode for a single object lifecycle in the Model Cache:

Write side - serializeValue(v)

# sentinel, skip JSON
if v === null      → return "__NULL__"

# sentinel, skip JSON
if v === undefined → return "__UNDEFINED__"

# replacer walks tree
return JSON.stringify(v, replacer)

replacer(this, key, value):
  # bypass JSON's auto-coercion
  original = this[key]
  if original is Date            → "date:"    + original.toISOString()
  if original is Prisma.Decimal  → "decimal:" + original.toString()
  else                           → value

→ Valkey.SET(cacheKey, string)

Read side - deserializeValue(s)

Valkey.GET(cacheKey) → s
if s === "__NULL__"return null
if s === "__UNDEFINED__"return undefined

# reviver walks tree bottom-up
return JSON.parse(s, reviver)

reviver(_key, value):
  if typeof value === 'string':
    if value.startsWith('decimal:') → new Prisma.Decimal(value.slice(8))
    if value.startsWith('date:')    → parseISO(value.slice(5))
  return value

Read-side sync-op count

Let:

  • P = total nodes the reviver visits (every nested property + array element + each object/array itself)
  • S = subset of P whose value is a string
  • R = subset of S that hits a prefix (gets rehydrated)
  • K = number of startsWith cases in the switch (today there is 2: decimal:, date:)
  • L = max nesting depth

Formula

ops_read =  2                  # __NULL__ / __UNDEFINED__ sentinel compares
         +  1                  # JSON.parse invocation (tokenization is O(|s|), counted once)
         +  P                  # one `typeof === 'string'` per visited node
         +  Σ over S of k_i    # startsWith checks per string; k_i ∈ [1, K]
         +  2·R                # per rehydrate: 1 slice/replace + 1 constructor

Bounds

Worst case (every string misses every prefix):           3 + P + S·K
Best case  (every string is a decimal, hits on check 1): 3 + P + S + 2·R

Read cost grows linearly in P (total properties, depth-agnostic) with an additive S·K term that scales with the number of string leaves times the size of the prefix switch. Depth itself is free, but every new prefixed type bumps K by 1 → +S ops in the worst case (one extra startsWith per string leaf in the payload).

Let’s take that previous example of 50 concurrent requests reading 107 auction items with 40 fields each.

Per-request node count (P)

P = 1 (root array)
  + 107 (item objects)
  + 107 x 40 (item fields)
  = 1 + 107 + 4,280
  = 4,388 reviver invocations per page load

Per-request op count

Assume of the 40 fields per item: ~20 are strings, ~4 need rehydration (a couple of dates, a couple of decimals, this is typical for an auction item: startsAt, endsAt, currentBid, minIncrement).

So:

S = 107 x 20 = 2,140 string leaves
R = 107 x 4  =   428 rehydrations
K = 2        (date:, decimal:)

ops_per_request =  3                # sentinel + JSON.parse
                +  4,388            # P     (typeof checks)
                +  2,140 x 2        # S·K   (startsWith fan-out, worst-ish case)
                +  2   x   428      # 2·R   (slice + constructor)
                =  3 + 4,388 + 4,280 + 8569,527 sync ops per page load

For 50 concurrent users:

ops_total = 50 x 9,527476,350 sync ops

V8's Fast and Slow Parse Paths

Since Node is single threaded, this all stacks up and blocks the event loop.

But the operation count doesn't tell the whole story. Each of those 9,527 ops is running through V8's slow JSON.parse path, not the fast one. V8 actually has two parsers under the hood. The fast one builds objects in place with stable hidden classes (good for downstream code, since predictable shapes mean V8 can optimize the consumers of our cached data). The slow one walks the parsed tree after the fact, boxes every primitive into a JS value, and hands each one off to a user-supplied function.

Want to guess which one you opt into the moment you pass a reviver? The slow one.

And it doesn't matter that our reviver returns values unchanged on the vast majority of nodes; V8 has no way to know that ahead of time, so the entire parse runs along the slow path. It gets worse. There is a known issue (this seems to be a V8 issue not a Node issue though but as far as I can tell it has not been reported to the Chromium team.) where even a completely no-op reviver, literally (k, v) => v, makes the resulting object graph roughly 4x heavier in memory. The reason is that ES2023 added a third argument to revivers that lets you access the original source text for each value, and V8 has to retain that metadata on every primitive in case your reviver decides to use it.

Our reviver doesn't use it, but that doesn't matter. We pay for it anyway. So we aren't just doing more synchronous work per parse; we're doing slower synchronous work, on a heavier object graph, that then puts more pressure on the garbage collector, which then blocks the event loop even more. Blocking the very thing we want to unblock.

Solutions Considered

After some discussion internally (both with humans and AI) we came up with a list of ideas to speed this up, we specifically focused on the deserialization (read) side first as that is where the performance hits us most when there is a storm of users loading fundraiser pages:

  1. Schema-driven walker: Declare the Date / Prisma.Decimal leaves upfront, walk only those paths at deserialization time. Skip the per-key reviver entirely. (Inspired by fast-json-stringify's encode-side approach, commit to a schema upfront, take shortcuts in the hot path by only walking/accessing what is necessary). Since we control the exact shapes and serialization, we deterministically know what dot paths need to be rehydrated.
  2. Native JSON parser: Replace V8's JSON.parse with simdjson, C++ bindings around the SIMD-accelerated parser. Since we cannot alter the behaviour of JSON.parse itself, maybe there is a faster JSON parser out there?
  3. Batch coalescing: Concatenate N Valkey MGET values into a single JSON array and parse it once. One parser invocation instead of N.
  4. Per-schema codegen walker: new Function-emitted per model cached by the Model Cache, this would be monomorphic, V8 can inline-cache aggressively per model shape.
  5. Native Date constructor: Swap date-fns's parseISO for new Date. The bare-format encoder writes toISOString() output, which is the canonical ECMAScript layout the C++ Date(string) parser fast-paths.
  6. LRUCache: Add an LRUCache in front of the MGET with a small TTL that just helps smooth out those bursts of queries, it would cache the deserialized object so we only pay the O(1) lookup cost.

We took these possible options and iterated on them by building benchmarks and attempting to reproduce close to production conditions to see how they affected Event Loop Lag, p9x, CPU, and Heap Size.

Iteration 1: Schema-Driven Walker + simdjson

Now that we had an initial plan of attack, I wrote a benchmark repo that takes the incident object shape and built out four scenarios: small, medium, large, and heavy.

These scenarios were run against a matrix of configurations:

  1. Baseline: The current system using JSON.parse and the reviver.
  2. simdjson with traversal: simdjson.parse followed by a full object traversal (walks every property).
  3. simdjson path-based traversal on each object: simdjson.parse with trie-driven walker visiting only declared Date/Prisma.Decimal paths.
  4. concatenated JSON array with JSON.parse: Batch-concatenate multiple cache values into a single JSON array string, parse once with JSON.parse.
  5. concatenated JSON with simdjson and traversal: Same as #4 but using simdjson.parse instead of JSON.parse, with full traversal.
  6. concatenated JSON with simdjson and path-based: Same as #4 but using simdjson.parse with schema-driven trie walker.
  7. LRU baseline standard: Add an in-memory LRUCache layer with standard deserialization.
  8. LRU with simdjson traversal: LRUCache layer using simdjson.parse + full traversal for cache misses.
  9. LRU with simdjson path-based: LRUCache layer using simdjson.parse + trie walker for cache misses.

We added a dimension to the matrix for the hit rate % of the LRUCache at 0%, 25%, 50%, 75%, and, 100%.

This was all tested against a prototype of the system where we build a new PathReviver class that was type safe for each instance of a Model Cache such that it required that each dot path pointing to a Date / Prisma.Decimal leaf be declared in the ReviverSchema<T>.

export type ReviverSchema<T> = {
    readonly [P in DeepRevivablePath<T>]: LeafKindFor<ResolvePathLeaf<T, P>>;
};

A simple version of this for one of our models GuestUser with the schema:

model GuestUser {
    id        String   @id
    createdAt DateTime @default(now())
    updatedAt DateTime @updatedAt
    cursor    Int      @unique @default(autoincrement())
}

Would be:

export const GUEST_USER_REVIVER_SCHEMA: ReviverSchema<GuestUser> = {
    createdAt: 'date',
    updatedAt: 'date',
};

This ensured that if model shapes ever changed, our reviver schema was compile-time safe and developers were forced to declare a concrete object that the PathReviver could use to compile the tries for the most efficient object walking to ensure we minimize traversal.

Benchmark results:

Scenario baseline simdjson-each-traversal simdjson-each-path-based concat-native concat-simdjson-traversal concat-simdjson-path-based lru-baseline-standard lru-simdjson-traversal lru-simdjson-path-based
small 131.8 21.9 (+83%) 20.3 (+85%) 37.1 (+72%) 23.4 (+82%) 23.6 (+82%) 39.7 (+70%) 24.3 (+82%) 21.0 (+84%)
medium 111.1 243.3 (-119%) ⚠ 34.6 (+69%) 229.2 (-106%) ⚠ 36.5 (+67%) 46.8 (+58%) 105.1 (+5%) 149.3 (-34%) 50.0 (+55%)
medium-hr025 111.1 243.3 (-119%) ⚠ 34.6 (+69%) 229.2 (-106%) ⚠ 36.5 (+67%) 46.8 (+58%) 124.6 (-12%) 153.9 (-38%) 56.5 (+49%)
medium-hr050 111.1 243.3 (-119%) ⚠ 34.6 (+69%) 229.2 (-106%) ⚠ 36.5 (+67%) 46.8 (+58%) 110.4 (+1%) 39.4 (+65%) 30.9 (+72%)
medium-hr075 111.1 243.3 (-119%) ⚠ 34.6 (+69%) 229.2 (-106%) ⚠ 36.5 (+67%) 46.8 (+58%) 86.4 (+22%) 43.0 (+61%) 30.0 (+73%)
medium-hr100 111.1 243.3 (-119%) ⚠ 34.6 (+69%) 229.2 (-106%) ⚠ 36.5 (+67%) 46.8 (+58%) 10.0 (+91%) 10.0 (+91%) 10.0 (+91%)
large 264.6 146.4 (+45%) 60.9 (+77%) 227.5 (+14%) 153.2 (+42%) 368.1 (-39%) ⚠ 362.5 (-37%) 105.6 (+60%) 145.1 (+45%)
large-hr025 264.6 146.4 (+45%) 60.9 (+77%) 227.5 (+14%) 153.2 (+42%) 368.1 (-39%) ⚠ 304.3 (-15%) 151.3 (+43%) 114.0 (+57%)
large-hr050 264.6 146.4 (+45%) 60.9 (+77%) 227.5 (+14%) 153.2 (+42%) 368.1 (-39%) ⚠ 221.1 (+16%) 134.9 (+49%) 118.4 (+55%)
large-hr075 264.6 146.4 (+45%) 60.9 (+77%) 227.5 (+14%) 153.2 (+42%) 368.1 (-39%) ⚠ 134.9 (+49%) 62.8 (+76%) 54.3 (+79%)
large-hr100 264.6 146.4 (+45%) 60.9 (+77%) 227.5 (+14%) 153.2 (+42%) 368.1 (-39%) ⚠ 10.1 (+96%) 10.0 (+96%) 10.1 (+96%)
heavy 45.8 131.3 (-187%) ⚠ 43.2 (+6%) 79.6 (-74%) 30.0 (+34%) 28.7 (+37%) 154.8 (-238%) ⚠ 35.7 (+22%) 28.3 (+38%)

Note: ⚠ flags loop.p99 max cells where the value is implausibly worse than companion runs and probably driven by a single GC-pause/OS scheduling outlier in the timed phase. The wall p99 and parse p99 companion matrices are 2-4x lower on those same combos. Cross-strategy ranking is stable despite the outliers. The path-based reviver is consistently fast.

At this point it seemed like the PathReviver + simdjson approach was a big enough improvement to warrant us moving forward with it. So we extracted the implementation from the benchmark repo and implemented it into the Trellis monorepo and swapped all Model Cache instances to use it.

Bench table: incident-shape (107 items x 50 concurrent resolves)

Metric Before (per-key reviver) After (PathReviver + simdjson) Delta
Loop p99 max 264.6 ms 60.9 ms +77%

This looked like the story was over, but we had not yet validated simdjson against the full end-to-end pipeline. Iteration 2 would reveal that the parser choice was masking a regression, and we would have to revisit it.

Iteration 2: The Stringifier and the simdjson Reversal

Now that we (thought we) had the deserializer working well, we thought, "why not do the same thing for the serializer?" We are constantly writing to the cache, just not as high volume as the deserializer since many can MGET a single MSET.

At this point we are still using the same JSON.stringify process with a full object traversal looking for originalValue instanceof Date and Prisma.Decimal.isDecimal(originalValue).

if (originalValue instanceof Date) {
    return serializeDate(originalValue);
}

if (Prisma.Decimal.isDecimal(originalValue)) {
    return serializeDecimal(originalValue);
}

This is still megamorphic, and requires traversing every property of the object to find the leaves matching the data type needing special serialization.

Note: Technically at this point we don't actually need special handling for Date objects since they will automatically use toISOString() when being serialized but this was a relic of us needing the date:/decimal: serialization prefixes before we moved to path-based deserialization.

Why bother traversing the object to serialize when we already know exactly what properties will be deserialized into what types? There isn't a point. At this point, since we do not need to add the date:/decimal: prefixes anymore, we can just JSON.stringify the object and write it to Valkey.

But if we can just stringify now, why not use a faster stringifier?

fast-json-stringify is one of these options, from their documentation:

fast-json-stringify is significantly faster than JSON.stringify() for small payloads. Its performance advantage shrinks as your payload grows.

This is great for us as we are never stringifying huge objects, rather many small objects.

Benchmark JSON.stringify (ops/sec) fast-json-stringify (ops/sec)
short string 12,114,052 29,408,175
obj 4,577,494 7,291,157
date 803,522 1,117,776

Note: Benchmarks pulled from the fast-json-stringify repo.

The catch is that we cannot just swap fast-json-stringify in for JSON.stringify, it requires a concrete JSON schema so it knows how to build the string. Since it uses string concatenation, we need to define the schema we are stringifying upfront.

Luckily, this is an area that is being well-developed, especially with the Standard Schema definition being defined. Since we were already considering ArkType for the validation side, I decided to go with ArkType as its API is very nice to work with and ArkType has support for exporting the type a JSON schema with toJsonSchema() and has support for draft-07 which is what fast-json-stringify expects.

To encapsulate this logic we built a Stringifier<T> class that takes a T (the model being stringifer) as a generic and accepts an ArkType type() as a constructor argument. To ensure maximum type safety, the Stringifier class validates that the type() and T are compatible.

Here is a sample of the Stringifier class and an example (reduced for brevity):

import { type, type Type } from 'arktype';
import fastJsonStringify from 'fast-json-stringify';

type Equal<X, Y> =
    (<T>() => T extends X ? 1 : 2) extends <T>() => T extends Y ? 1 : 2
        ? true
        : false;

export class Stringifier<T, A extends { infer: unknown } = Type<T>> {
    private readonly compiled: (value: T) => string;

    constructor(
        arkType: A & (Equal<A['infer'], T> extends true ? unknown : never)
    ) {
        const jsonSchema = arkType.toJsonSchema({
            target: 'draft-07',
            fallback: {
                // Required for unmapped types, like `Prisma.Decimal`
                default: (ctx) => ({ ...ctx.base, type: 'string' }),
            },
        });

        this.compiled = fastJsonStringify(jsonSchema) as (value: T) => string;
    }

    stringify(value: T): string {
        return this.compiled(value);
    }
}

export const GUEST_USER_STRINGIFIER = new Stringifier<GuestUser>(
    type({
        id: 'string',
        createdAt: 'Date',
        updatedAt: 'Date',
        cursor: 'number',
    })
);

This will ensure at compile time that the types are compatible and that we are stringifier efficiently with a concrete schema that matches the model that is being cached. If the model ever drifts (like if we alter/add/remove a field in the database), then it will be required for us to handle it here.

The Regression:

At this point we ran an end-to-end benchmark to compare the performance of the serialization and deserialization cycle, at a high concurrency to compare the before and after of these systems. What we found surprised us and did not align with the original benchmarks.

This project's KPI is event-loop block tails and OOM, not median throughput. Here's how each axis lands today (without the simdjsonJSON.parse swap).

Axis PathReviver (current) Legacy walker Verdict
Median throughput (deserialize) 15 521 ops/s 27 021 ops/s Worse (0.57x)
p99 latency (deserialize) 223.9 µs 116.7 µs Worse (1.92x)
Max latency (deserialize) 17.4 ms 25.7 ms Better (0.68x)
Peak RSS (deserialize stage) 206 MB 426 MB Better (0.48x)
CPU user-ms per 200k ops (deserialize) 12.2 s 7.3 s Worse (1.67x)
Round-trip max latency 7.2 ms 17.5 ms Better (0.41x)
Serialize max latency 9.7 ms 112.7 ms Massively better (0.09x)
Serialize CPU user-ms 3.2 s 9.7 s Better (0.33x)

Based on 200,000 ops x 128 concurrency headline benchmark.

The median throughput, p99 latency (deserialize), and, CPU user-ms per 200k ops (deserialize) all regressed... badly.

So what gives? After some investigation, it turns out, simdjson.parse is ~4.4x slower than V8's JSON.parse on our cached payload sizes. C-binding-per-call overhead is ~20 µs, and SIMD parsing doesn't recover that cost on documents this small. simdjson's marketing numbers come from multi-MB documents. This was likely an error with how our original benchmarks were set up. I probably took them at face value without digging into them too much and just accepted what I thought was a win (benchmarking is hard).

Stage ops/sec ns/op p99 (µs)
simdjson.parse 39 284 25 456 35.3
JSON.parse (bare input) 171 087 5 845 7.2
JSON.parse (prefix input) 165 284 6 050 7.5

Some metrics were still better, and anything serialization-related was still better due to our more performance Stringifier implementation. But, for the deserialization side, it is much more performant to just rely on JSON.parse built into V8.

Before switching, though, we had a few hypotheses to test:

  1. Hypothesis 1: A payload size threshold makes simdjson win.
  2. Hypothesis 2: Coalescing N reads into a single array parse is a win.
  3. Hypothesis 3: Realistic shape variety closes the gap.

Hypothesis 1 - A payload size threshold makes simdjson win:

The ~3 KB Booking fixture used in the last benchmark showed simdjson is ~4.4x slower than JSON.parse. But simdjson is built for parsing gigabytes per second on large documents. The C-binding-per-call overhead amortizes differently on a 100 KB payload than on a 500 B one.

We sampled the sizes of the payloads we are caching in production to build a distribution of payload sizes. To run this benchmark against. If simdjson crosses over to "faster" somewhere in this range, we should pick the parser at construction time based on a size estimate, not unconditionally swap. The bench will run against all 10+ shapes and report the crossover point if one exists.

Ultimately, the C-binding-per-call overhead dominates at every size and we saw a 3-5x improvement using JSON.parse over simdjson.

One hypothesis down.

Hypothesis 2 - Coalescing N reads into a single array parse is a win:

The theory here is that since we are pipelining many MGET operations together then we might be able to do something like:

const stringResults = await pipelinesMGets();

return JSON.parse(`[${stringResults.join(',')}]`);

Here we are essentially concatenating the results of many MGET operations into a single stringified array and doing a single parse operation on the result.

This result is a bit more intricate, it is not a simple better/worse. Coalescing wins on fixtures, up to 2.1x. But is neutral to worse on medium to large fixtures; once we get above ~1KB fixture size, the cost of building the array wipes out the performance gain from a single JSON.parse. Additionally, this hides a tail-latency issue; when work is batched together, the maximum event-loop block per call grows.

We opted to keep the per record JSON.parse approach for now, and leave dynamically coalescing based on fixture size as a future optimization to keep things simple right now.

Second hypothesis (mostly) down.

Hypothesis 3 - Realistic shape variety closes the gap:

Using realistic shapes changes the picture but doesn't fix it. The aggregate "PathReviver is 0.6x" from the Booking fixture (a ~3 KB production payload used in an interim benchmark to cross-check the size sweep) was specifically a simdjson artifact, not a "PathReviver is slow on small payloads" story. The size sweep above shows the trie walker is a consistent ~5-20% win across the entire production distribution as long as the parser choice is correct.

Two interesting per-shape observations the headline bench missed:

  • GuestUser (~150 B) is the only fixture where PathReviver+JSON.parse is slightly slower than the legacy walker (0.94x). At this size the trie-compile-cache hit plus the path lookup overhead barely matters. The miss is 6% on a fixture so cheap that the absolute difference is ~340 ns/op.

  • Fundraiser (~4 KB, 70 fields) is the biggest PathReviver win (1.22x). Wide-flat objects with many declared revival paths exactly match what the trie was built for.

The third and final hypothesis, down.

Final Results:

Benchmark: 20,000 ops per cell x concurrency 64

Speed-up per (fixture x scenario) vs the legacy prefix-marker baseline:

Fixture Bytes Serialize Deserialize Round-trip
GuestUser (~150 B) ~150 B 1.64x 0.93x 1.62x
AuctionBid (~400 B) ~400 B 2.30x 1.03x 1.48x
PaddleTipConfig (~500 B) ~500 B 2.78x 0.99x 1.46x
User (~800 B) ~800 B 1.67x 1.06x 1.30x
Item (~1.5 KB) ~1.5 KB 2.01x 1.04x 1.28x
AuctionItem (~3 KB) ~3 KB 1.66x 1.19x 1.22x
GivingLevels (~3 KB) ~3 KB 2.43x 1.04x 1.95x
Fundraiser (~15 KB) ~15 KB 2.06x 1.23x 1.21x
FeatureFlags (~100 KB) ~100 KB 2.62x 1.15x 1.73x

Serialize: 1.6-2.8x faster on every fixture. Stringifier is the largest single win this project has shipped.

Deserialize: matched-or-faster on 8 of 9 fixtures. The one regression and one tie are at the very smallest end of the distribution (GuestUser 0.93x, PaddleTipConfig 0.99x) where absolute differences are sub-microsecond and the trie compile-cache hit per call is the floor.

Note: These benchmarks were run on an M3 Pro with 36GB of RAM locally. We expect the improvements to be more pronounced when running in a more resource constrained environment and at a higher concurrency with more event loop contention.

If we look at the worst max-latency observed under load between the previous and current algorithms:

  • FeatureFlags round-trip max: 45.9 ms → 6.4 ms (7.2x)
  • FeatureFlags deserialize max: 11.6 ms → 3.1 ms (3.7x)
  • GivingLevels round-trip max: 3.9 ms → 1.6 ms (2.4x)
  • SponsorBundle deserialize max: 2.1 ms → 0.9 ms (2.3x)
  • AuctionItem deserialize max: 0.7 ms → 0.2 ms (3.5x)

We can see that the speed increase we get under load when there is high contention is significant, up to 7.2x in the round-trip time for larger fixtures.

Iteration 3: Codegen Walker and Native Date Parsing

At this point we were almost done, but there were two other hypotheses we wanted to test. Just as one last ditch effort to improve the performance even more.

  1. By using a function codegen per model, we can allow V8 to optimize the calls, and we would end up with a monomorphic function instead of our current megamorphic function.
  2. Swap date-fns parseISO for new Date(value). In the PathReviver implementation.

I had high hopes for #2 but not for #1. parseISO is ~250 lines of pure JS date parsing logic, while new Date(s) resolves directly to a native C++ fast path in V8, the per-leaf savings seemed obvious. Codegen (#1), by contrast, felt like a micro-optimization that might not survive the noise floor given that IC warmup takes many iterations and our schema set is small. We benchmarked both anyway.

Hypothesis 1 - Monomorphic function:

The current PathReviver.walk visits ReviverTrieNode objects generically. At each call site V8 sees a different hidden class, different schemas produce nodes with different shapes.

This is megamorphic: V8 cannot build a useful inline cache and must fall back to a generic property lookup every time.

// This is megamorphic: trie walker sees many different node shapes
function walk(node: ReviverTrieNode, target: unknown): void {
    // IC misses on every schema
    const children = node.children;
    for (const [segment, child] of children) {
        // different Map instances per schema
        if (Array.isArray(target)) {
            // megamorphic dispatch
            descendArray(child, target, segment);
        } else {
            // megamorphic dispatch
            descendObject(child, target, segment);
        }
    }
}

function rehydrateObjectLeaf(
    node: ReviverTrieNode,
    target: Record<string, unknown>,
    key: string
): void {
    if (node.kind === 'date') {
        // 250-LOC date-fns fn, V8 won't inline
        target[key] = parseISO(target[key] as string);
    } else if (node.kind === 'decimal') {
        target[key] = new Decimal(target[key] as string);
    }
}

Every Map.get, for-of, string-union branch, and parseISO call runs through V8's slow path on every cache read. The theory is that we can optimize this by recomputing these hydration functions per model and caching them so they are static and can be optimized by the inline cache.

Instead of a WeakMap<schema, ReviverTrieNode>, keep a WeakMap<schema, (target: object) => void>, a compiled walker built once on first sight, then reused on every deserialize call for that schema.

Since the function is specialized to one schema (it literally will not work for any other model since it's tailored to that specific shape), every property access is against one known hidden class. V8 builds a monomorphic inline cache on the first hot iteration and never misses again. This requires using new Date(s) instead of parseISO(s) at the leaf so the call site resolves to a native C++ fast path and there is no JS frame to enter. The implementation of this monomorphic function compiler is sort of long and boring, so I will leave it out, but here is a sample of what a compiled function looks like.

Flat schema for ImmutableAuctionBid that has three Date leaves:

const ImmutableAuctionBidSchema: ReviverSchema<ImmutableAuctionBid> = {
    createdAt: 'date',
    updatedAt: 'date',
    voidedAt: 'date',
};

Emitted function body (new Function body string):

function compiledWalker(target) {
    const createdAt_ = target.createdAt;
    if (typeof createdAt_ === 'string') target.createdAt = new Date(createdAt_);
    const updatedAt_ = target.updatedAt;
    if (typeof updatedAt_ === 'string') target.updatedAt = new Date(updatedAt_);
    const voidedAt_ = target.voidedAt;
    if (typeof voidedAt_ === 'string') target.voidedAt = new Date(voidedAt_);
}

Note: The case with arrays gets a little messier and will make this already long blog post longer, so I am leaving it out for brevity. Feel free to reach out if you are interested, and I can write another post going into the details.

Every property access (target.createdAt, target.updatedAt, target.voidedAt) is against the shape of the same hidden class ImmutableAuctionBid. V8 builds a monomorphic IC on the first call, and the loop runs at near-native speed.

Hypothesis 2 - Use new Date(value) instead of parseISO(value):

Using new Date(value) instead of parseISO(value) seemed like a no-brainer. It is a native V8 call, will be highly optimized, and, results in a single call rather than the function call to parseISO that does additional parsing and logic. Since we control the serialization of dates into the cache, there is no risk of the data being corrupted (beyond what already existed).

With these two hypotheses in mind, these are the cases we went out to test:

Walker Date parser Implementation key
trie parseISO trie+parseISO (current)
trie new Date trie+nativeDate
codegen parseISO codegen+parseISO
codegen new Date codegen+nativeDate (target state)
- - JSON.parse only (ceiling)

With these implementation scenarios in hand we ran the options through a gauntlet like we did before:

fixture date-swap:
trie+Date
trie+parseISO
codegen:
cg+parseISO
trie+parseISO
compound:
cg+Date
cg+parseISO
vs ceiling:
cg+Date
parse
GuestUser 3.99x 1.44x 2.98x 0.52x
AuctionBid 2.29x 0.99x 2.43x 0.59x
PaddleTipConfig 2.33x 1.11x 2.35x 0.37x
User 2.35x 1.05x 2.36x 0.60x
Item 1.83x 1.01x 1.83x 0.75x
AuctionItem 1.78x 1.02x 1.86x 0.76x
GivingLevels 2.59x 1.09x 2.76x 0.43x
SponsorBundle 2.63x 1.06x 2.60x 0.63x
Fundraiser 1.31x 0.89x 0.84x 0.39x
FeatureFlagsLarge 2.09x 1.04x 2.07x 0.82x
median 2.31x 1.04x 2.35x 0.59x

The biggest win was swapping parseISO for new Date, regardless of codegen/trie, that is clear. But what about the codegen+Date scenario compared to trie+Date scenario?

This yielded positive results, but they were only slightly better than the trie+Date scenario, landing somewhere between a 5–13% improvement over it. Nothing to scoff at, but it also introduces a much larger change to the code as we need to be pre-compiling dynamic functions now. This is being left for a subsequent iteration for the time being as I would want to do extension testing and edge case analysis on it to ensure we are not breaking the caching pipeline.

Where We Ended Up

Overall, we killed the per-key reviver that was being used with JSON.parse and moved to a pre-compiled trie walker on the deserialization side and built a custom Stringifier for the serialization side to speed that up.

Additionally, we relied on native date parsing instead of a third-party library since we knew dates were going to be stringified correctly.

The v1 design JSON.stringify(value, replacer) / JSON.parse(value, reviver) walked every node of the parsed tree and ran a per-key JS callback, with startsWith('date:') / startsWith('decimal:') probes on every string leaf and date: / decimal: prefix markers in the encoded bytes.

The v2 design replaces both halves:

Write: a per-model Stringifier owns a compiled fast-json-stringify function. The wire bytes are ISO 8601 strings/numeric strings, no date: or decimal: prefix markers.

if v === null              → return "__NULL__"
if v === undefined         → return "__UNDEFINED__"
if typeof v !== 'object'return JSON.stringify(v)        # scalar bypass
return stringifier.stringify(v)                              # compiled fast-json-stringify

stringifier.stringify(v):
  return this.compiled(v)                                    # straight-line emitted JS

→ Valkey.SET(cacheKey, string)        # bytes look like {"createdAt":"2026-01-01T00:00:00.000Z","tipPercent":"0.05",...}

Read: PathReviver.deserialize(value, schema) runs one native JSON.parse (with no reviver) then walks a precompiled trie of revivable leaf paths. Iteration bound is the schema's child count, not the target object's key count. Non-revivable fields are never visited.

Valkey.GET(cacheKey) → s
if s === "__NULL__"return null
if s === "__UNDEFINED__"return undefined
return pathReviver.deserialize(s, schema)

PathReviver.deserialize(s, schema):
  parsed = JSON.parse(s)                              # ONE pass, native, no JS callback
  if parsed is non-null object:
    trie = this.compile(schema)                       # WeakMap-cached; null for {}
    if trie != null: this.walk(parsed, trie)          # mutates parsed in place
  return parsed

PathReviver.compile(schema):                          # amortized O(1) after first call
  for [dottedPath, kind] in schema:                   # kind ∈ {'date', 'decimal'}
    segments = dottedPath.split('.')                  # split ONCE at compile time
    descend the trie, creating Map<segment, node> as needed
    mark terminal node: node.leaf = kind

PathReviver.walk(target, node):                       # iterates SCHEMA children, not target keys
  if target is array:
    wildcard = node.children.get('__ARRAY__')
    if wildcard: for i in 0..target.length: descendArray(target, i, wildcard)
    return
  for [segment, child] in node.children:
    if child.leaf:
      v = target[segment]
      if typeof v === 'string':
        target[segment] = child.leaf === 'date' ? new Date(v) : new Prisma.Decimal(v)
    if child.children and target[segment] is non-null object:
      walk(target[segment], child)

Read-side sync-op count:

Let:

  • B = payload byte length (input to JSON.parse)
  • P = total nodes in the parsed payload (legacy v1 cost driver, no longer relevant to the walker)
  • N = number of internal trie nodes visited (one per descent on the way to a leaf; structural, depth-bounded)
  • A = sum of array lengths at array-wildcard descents (Σ over each __ARRAY__ site of the array's element count)
  • R = number of revivable leaves actually present in the payload (Date + Decimal slots, expanded across array elements)
  • L = max nesting depth of the schema trie (recursion depth only; does not scale per node)
  • K = number of leaf kinds the walker can dispatch ('date', 'decimal'); dispatched by a single equality check.

Bounds:

v2 Walker cost  =  O(N + A + R)        # independent of total payload node count P
                                    # independent of non-revivable field count
                                    # K does NOT multiply S (no per-string startsWith fan-out)

Takeaway:

Walker cost is strictly proportional to revivable leaves and the array elements that gate them. Non-Date/non-Decimal fields are never touched. JSON.parse is still O(B), but it is now V8-native with no per-node JS callback (the v1 reviver-callback constant factor is eliminated). Nesting depth L shows up only as recursion frames, not as repeated path.split('.') work, paths are split once, at compile time and cached as a trie in a WeakMap.

Growth class:

  1. Walker is linear in R + A, not in P.
  2. JSON.parse is linear in B.
  3. There is no S · K term anymore: the schema tells the walker each leaf's kind exactly, so the prefix-switch is gone from the hot path. Adding a third leaf kind (e.g. bigint) costs nothing on payloads that don't contain that kind, and costs c_leaf per occurrence on payloads that do. There is no fan-out across every string leaf.

The only term that still tracks the payload shape is JSON.parse(B), which is unavoidable for any design that sends JSON over the wire.

Example - 50 users x 107 items x 40 fields:

Same scenario as v1: a silent auction page loads 107 auction items (40 fields each) for 50 concurrent users. Assume ~4 revivable leaves per item (startsAt, endsAt, currentBid, minIncrement) and that the schema uses a items.__ARRAY__.{...} shape so the walker enters the array once.

Per-request node count (P) vs walker visits (N + A + R):

# v1
P (total parsed nodes)     = 1 + 107 + 107·40  = 4,388

# v2
N (internal trie nodes)    = 2
# root → items → __ARRAY__
A (array iterations)       = 107
R (revivable leaves)       = 107 x 4 = 428

The v2 walker visits N + A + R = 2 + 107 + 428 = 537 slots. The v1 reviver visited all 4,388 nodes and ran startsWith on every string among them.

Per-request op count:

v1 ops_per_request ≈ 3
                    + 4,388 (P typeof)
                    + 4,280 (S·K startsWith)
                    + 856 (2·R)9,527 walker ops

v2 walker ops      ≈ 1
                    + 109 (N+A)
                    + ~1,284 (3·R)1,400 walker ops

The walker work drops by ~6.8x for this payload shape. The JSON.parse(value) term is the same in big-O but its constant factor improves because there is no JS reviver-callback invoked per node.

50 concurrent users:

ops_total ≈ 50 · [ c_parse · B  +  ~1,400 walker ops ]50 · c_parse · B  +  ~70,000 walker ops

JSON.parse still runs on the single event loop; with ~4.4k-node payloads it's the dominant cost. The walker contribution at ~70k ops across 50 requests is sub-millisecond and effectively disappears against the parse cost.

Summary: what v2 eliminates

Term v1 cost v2 cost
Per-node JS reviver callback (P) c_cb · P 0 (native parse)
Per-string startsWith fan-out (S·K) c · S · K 0 (schema-typed leaf)
Per-leaf rehydrate (R) c · 2 · R c · 3 · R (≈ same)
Path string split('.') at runtime n/a 0 (split at compile)
Schema compile n/a O(1) amortized (WeakMap)
Walker visits non-revivable fields yes (all P) no (schema-driven)
Bytes carry date: / decimal: prefix yes no (bare ISO/numeric)

Net: the walker is now O(R + A) instead of O(P + S·K), with smaller constants on the JSON.parse pass as well.

So for our case of 50 users x 107 items x 40 fields, we go from ~476,350 walker ops (v1) down to ~70,000 (v2). A saving of roughly ~406,000 operations, and the deserialization is about 2.31x faster because of the optimized walker.

Together these changes delivered the changes we were looking for from this project: improved serialization and reduced maximum-latency under high concurrency all in the service of relieving event-loop pressure.