Engineering
Updated
7 min read

Making Arcjet's Wasm bot detector smaller and faster

Reducing WebAssembly bundle size: how Arcjet shrank its Rust bot detector 27% with Aho-Corasick, keeping per-request memory isolation and using Wizer snapshots.

Making Arcjet's Wasm bot detector smaller and faster

Arcjet ships security analysis code inside our SDKs. We write that core logic in safe Rust, compile it to WebAssembly, and expose it through JavaScript, Python, Go SDKs installed by our customers. The point of using Wasm is to get a small, sandboxed, cross-language runtime for security analysis.

We have written about this architecture before. The short version is that Wasm lets us run the same security logic in multiple places without rewriting the analysis engine for every language. That also means our implementation choices become part of a customer's application bundle.

We recently had a customer blocked from deploying Next.js middleware: our Wasm runtime plus their application code exceeded Vercel's 1 MB gzipped Hobby-plan Edge limit (Node serverless gets 250 MB uncompressed).

This is annoying for someone testing Arcjet for free, but smaller artifacts also help cold starts in serverless environments. And upgrade cycles are slow, so old constraints stay live in customer deployments long after the platform moves on.

The Edge runtime itself is fading: it's much less common in the application paths we see, the ecosystem is moving back toward Node.js, and Next.js has since made the Node middleware runtime stable (15.5) and renamed Middleware to Proxy, which defaults to Node.

The constraint has changed, but it has not disappeared.

We spent some time working through bundle size optimizations, which I'll walk through in this blog post. The interesting thing is that it was not the bot detection logic itself - it was the prebuilt data snapshot of the user-agent parser.

Wizer made runtime fast but the bundle too large

Arcjet's first, fast bot check uses a table of well-known crawler and automation user-agent patterns. The source of truth is arcjet/well-known-bots, specifically the well-known-bots.json source file where the user-agent selectors and regexes live. We have an internal code generator which converts those definitions into Rust and TypeScript code used by the SDKs.

User-agent matching is a cheap first pass in a layered analysis pipeline that can also use IP verification, request context, rate limits, and Advanced Bot Signals where browser telemetry is available. But the user-agent check is on the hot path, so it has to be fast, deterministic, and small enough to ship in serverless environments.

Before this change, the UserAgentParser compiled roughly 643 patterns into one Rust RegexSet. Compiling that parser on every request would be wasteful, so we used Wizer.

Wizer runs an initialization function at build time and snapshots the resulting WebAssembly linear memory. When the module starts later, the parser is already there. We have used this pattern before in the Go/wazero runtime path: do expensive setup once at build or service startup where possible, so request handling does not pay the cost repeatedly. That is especially important when Wasm is on the security decision path.

That is exactly what we wanted for latency, but the compiled RegexSet added about 1.5 MB raw, or roughly 350 KB gzipped, to the Wasm. The full component was around 3.06 MB raw and about 944 KB gzipped, enough that the Wasm asset alone used most of Vercel's 1 MB Edge budget before the rest of the application bundle was counted.

Our first attempt to optimize this was to split the build:

  • arcjet-js got a no-Wizer variant, so the Wasm stayed small enough for Vercel Edge. Our target JS applications - those deployed on Vercel - run in serverless environments where there is a higher chance of a cold start. There are many long-running Node.js servers of course, so this is always a balance when considering which target to optimize for.
  • arcjet-py and arcjet-go kept the Wizer'd variant, because they do not have the same bundle-size constraint and can do the initialization on first startup. Go and Python servers tend to run for a longer period so cold starts are less of an issue.

This approach would solve the immediate deployment problem, but it was not ideal from a maintenance perspective. It meant two Wasm artifacts from one source, two paths to keep documented, and a tradeoff we did not like: every request had to rebuild the parser because our SDK design creates a fresh Wasm instance per request.

Why not just cache the Wasm instance?

The obvious performance fix would be to keep one Wasm instance around and reuse it - build the parser once, then make every request share it. Unfortunately doing so would eliminate one of the security benefits we get from using Wasm.

Arcjet's SDKs instantiate Wasm per request because a fresh instance gives each request fresh linear memory. That memory can temporarily contain request data: headers, IP addresses, email addresses, and strings passed into sensitive information detection. Reusing one instance would be faster, but it would keep prior request data in the same linear memory until it happened to be overwritten.

Safe Rust shouldn't read another request's freed heap by accident, so this isn't a known leak. The risk we're actually defending against is a future bug - a use-after-free or an uninitialized read - that exposes stale buffer contents from a prior request. Per-request instantiation makes that class of bug structurally impossible rather than something we have to catch in review. For a security product, defense-in-depth against our own mistakes is worth the instantiation cost.

Each request still gets its own instance - the parser is present in that instance at startup, but request data does not persist across requests in shared linear memory.

So we redesigned the approach: keep per-request instantiation, keep Wizer if we can, and make the parser snapshot small enough to fit.

Most of the regexes were not really regexes

The user-agent table looked like a regex problem because it was stored as regex patterns. But most of the patterns are just literal substrings:

  • Amazonbot
  • AhrefsBot\/
  • archive\.org_bot

Only a small set needed real regex semantics, such as anchors, character classes, and alternation:

  • ^curl
  • [wW]get
  • [cC]laude(?:[bB]ot|-[Ww]eb)
  • AdsBot-Google([^-]|$)

In the current table, 619 of the 643 patterns are literals. Only 24 need to stay in a RegexSet.

That suggested a better split - put the literal patterns into an Aho-Corasick automaton, keep the true regex patterns in a small RegexSet, then merge the results back into the same bot-detection logic.

Aho-Corasick is built for this exact job: search for many literal strings in one pass over the input. It also snapshots much smaller than a compiled RegexSet, because a trie of shared literal prefixes is a far more compact structure than the state machine the regex engine builds to handle full regex semantics it doesn't need here.

User agents are untrusted input

User agents are attacker-controlled, so the matcher has to stay bounded no matter what it's fed. The new code records matches in a fixed-width bitset - one bit per pattern - which is both how it preserves result ordering and a deliberate security choice.

Aho-Corasick can report every occurrence of a literal, so a user agent that repeats Amazonbot ten thousand times would, in a naive implementation that pushed each occurrence onto a vector, make CPU and memory scale with attacker-controlled repetition. Marking a pattern index is idempotent and bounded by the size of the pattern table.

The remaining 24 regexes aren't a denial-of-service surface either because the Rust regex crate is linear-time and doesn't backtrack, so there's no catastrophic-backtracking risk in that bucket. The only attacker-controlled blowup is match enumeration, which the bitset removes

The implementation classifies every committed pattern into one of two buckets. A small byte scanner treats a pattern as a literal only when it contains no unescaped regex metacharacters and any escapes resolve to literal bytes. Escaped .and / become literal . and /. A dangling backslash, a character class, an anchor, a group, alternation, or a regex escape like \s stays in the regex bucket.

This avoids parsing every pattern with regex-syntax on each fresh Wasm instance, which would eat most of the Aho-Corasick build saving. But we still use regex-syntax in tests to cross-check the byte scanner against an authoritative parser for every committed pattern.

The ordering this preserves matters: Arcjet's bot table groups allowed and forbidden patterns by bot, and that order is part of the algorithm. Walking the bitset in table order gives the same ordering as the old RegexSet::matches() path.

We added adversarial tests to ensure that potential attacks are covered:

  • invalid UTF-8 and embedded NUL bytes
  • long non-matching inputs
  • long prefixes before anchored regexes like ^curl
  • repeated matching literals
  • repeated distinct literals
  • boundary-sensitive patterns like AdsBot-Google([^-]|$)
  • empty, whitespace, and control-only inputs

A differential test was useful to compare the new Aho-Corasick plus small RegexSet matcher against the previous all-RegexSet algorithm across roughly 3,200 generated and real-world-ish user agents. The result is byte-identical to the original.

What the experiment showed

The best result was not what we originally expected. The earlier investigation suggested that Aho-Corasick would make the no-Wizer parser cheap enough to build on every request. It did help, but not enough to make lazy per-request builds the best option. The Aho-Corasick automaton and the remaining small RegexSet still cost real time to build on every fresh instance.

Instead, Aho-Corasick made the Wizer snapshot small enough to ship again.

Benchmarked on the transpiled js_req component under Node 24 / V8, with a fresh instance per call, the result was:

variant per-request parser build full detectBot for curl gzipped Wasm
Wizer + old RegexSet ~0 ms (in snapshot) ~0.9 ms ~944 KB
no-Wizer + old RegexSet ~3.4 ms ~3.5 ms ~565 KB
no-Wizer + Aho-Corasick ~2.3 ms ~2.4 ms ~570 KB
Wizer + Aho-Corasick ~0.03 ms ~0.4 ms ~689 KB

First-call latency runs higher than the warm median because of V8 JIT warmup rather than parser build - on the Wizer + old RegexSet path the first call measured ~7–9 ms before settling to the ~0.9 ms warm median.

The smallest artifact is still the no-Wizer build, but the Wizer'd Aho-Corasick build is the better product tradeoff: it eliminates the per-request parser build, keeps the Wasm comfortably below the size limit, and lets JavaScript, Python, and Go all consume one generated artifact .

The lesson

The original RegexSet was correct, but it made the Wizer snapshot the wrong structure. Dropping Wizer fixed the size issue, but moved cost into every request. Caching the Wasm instance would have fixed the latency issue, but weakened a memory-isolation property we care about. Aho-Corasick changed the underlying data shape so we no longer had to choose between those constraints.

That is the kind of optimization that we like to see. Not just faster. Not just smaller. Faster and smaller while keeping the invariants that made the original design safe.

Related articles

Designing a CLI for AI agents
Engineering
9 min read

Designing a CLI for AI agents

How we designed the Arcjet CLI in Go as a stable, defensive interface for humans and AI agents: predictable commands, machine-readable output, strict validation, and confirmation before production changes.

Subscribe by email

Get the full posts by email every week.