· 6 min read ·

The Performance Gap jq Cannot Close By Design

Source: hackernews

jq has been the standard tool for JSON processing on the command line since Stephen Dolan released it around 2012. It ships with most Linux distributions, it is on every developer’s laptop, and it shows up in shell scripts, CI pipelines, and Makefiles across the industry. The problem it solved was real: JSON is the de facto format for APIs and configuration, but traditional Unix text tools do not understand structure.

Fourteen years later, the frustrations with jq are just as real as the problems it solved. The DSL is expressive but cryptic. A filter like .[] | select(.status == "active") | {name: .name, id: .id} reads clearly once you know the language; until then it does not, and the manual is long. Performance is the other complaint. For small files it is fine. For gigabyte-scale JSON logs or event exports, jq becomes a bottleneck that is hard to route around.

jsongrep is one of several recent tools attempting to address the performance side by applying grep’s model to JSON. Understanding why it is faster requires understanding what jq actually does when it runs.

How jq processes a document

jq parses its entire input into an in-memory tree before executing a single filter. The architecture is a compiler/VM pair: your filter expression is compiled to bytecode, and that bytecode runs against the parsed JSON tree on a stack-based virtual machine. This design makes sense for the expressiveness goals. jq’s language is Turing-complete; it has reduce, label-break, recursive descent with recurse, generators, and try-catch. You can write genuinely complex transformations in it.

The cost of that expressiveness is that the entire input must be in memory before any output is produced. If you are filtering a 2 GB log file for records where level == "error", jq allocates memory proportional to all 2 GB before it can begin evaluating the filter. On a machine running several concurrent pipelines, that allocation pattern compounds quickly.

jq 1.6 introduced a --stream flag that partially addresses this. With --stream, jq emits path-value pairs as it parses, enabling a streaming approach. The catch is that the filter language changes completely in streaming mode. You are no longer writing .records[] | select(.status == "active"); you are writing something closer to [.,2] | . as $input | if $input[0][1] == "records".... The flag exists, almost no one uses it, and the documentation for it is sparse.

What grep does differently

The reason grep can scan multi-gigabyte files in seconds is that it never builds a document model. It reads a buffer, attempts a match, emits or discards matching lines, reads the next buffer. The memory footprint is proportional to the buffer size, not the file size. GNU grep uses a combination of Boyer-Moore-Horspool and SIMD instructions to scan buffers at rates that can exceed 5 GB/s on modern hardware for simple patterns.

That model does not translate directly to JSON because JSON is not line-oriented. A single JSON object can span thousands of lines, and the path .user.address.city is not meaningful to a line scanner. But the core insight does translate: you can write a streaming JSON parser that emits events — start object, key, value, end object — as it reads bytes, and apply pattern matching to those events without ever building a complete tree.

This is what a grep-style JSON tool like jsongrep implements. Instead of compiling your query to bytecode and running it against a parsed tree, you compile the query to a state machine and run it against a stream of parser events. For the common case of “find all objects where key X has value Y,” this approach can match grep’s performance profile much more closely than jq can.

The expressiveness tradeoff

The performance gain comes with real constraints. A streaming event-based approach handles path filters and value pattern matching well. It cannot express arbitrary transformations. If you want to reshape JSON — extract fields, join arrays, compute derived values — you need access to the full document or at least to complete sub-documents. A streaming evaluator can buffer sub-documents up to some depth, but the guarantee of low memory use erodes as soon as you need to process deeply nested or cross-referenced structures.

jq can do things that a grep-model tool simply cannot. jq 'group_by(.user_id) | map({user: .[0].user_id, count: length})' groups records and aggregates them in a single pass. That requires holding all records with the same user_id simultaneously. No streaming model handles this without additional state, and once you add that state you have partially rebuilt what jq is.

The bet that tools like jsongrep make is that for a large fraction of real-world command-line use, you do not need that power. You need to find records matching a condition, maybe extract a few fields, and pass them downstream. For that subset, the grep model is faster and uses less memory, and the queries are simpler to write.

What fast JSON parsing actually looks like

The ceiling for JSON parsing performance is set by tools like simdjson, which uses SIMD instructions to parse at 2-3 GB/s by processing multiple bytes in parallel and converting structural characters to a bitset in a single vectorized pass. simdjson has bindings for multiple languages and its techniques have influenced several newer tools in this space.

A grep-style JSON tool built on a fast streaming parser can approach those throughput numbers for simple pattern matching because the bottleneck shifts to I/O rather than compute. Reading from a modern NVMe SSD at 3-5 GB/s means a SIMD-based parser can keep up. jq’s throughput on representative queries is commonly benchmarked in the 50-300 MB/s range depending on query complexity, which means it is often the constraint even on fast storage.

The implementation language matters here. jq is C, which should be fast, but the allocator pressure from constructing the full parse tree is the primary constraint, not parse speed. A tool written in Rust with a streaming parser and careful arena allocation can sustain much higher throughput even without SIMD, because it avoids the allocation pattern that jq’s document model requires.

The broader ecosystem

jsongrep enters a crowded space. gron takes a different angle: it transforms JSON into flat, greppable lines like json.users[0].name = "alice", lets you run the full grep toolchain against that output, then reconverts with gron --ungron. It is clever and composable, but the two-pass approach adds latency and the intermediate format is unusual.

fx and jless prioritize interactive exploration over batch processing — they are good for browsing unfamiliar JSON schemas but are not designed for pipeline use. gojq is a Go reimplementation of jq targeting full compatibility; it is not faster for large files because it shares the same document-model architecture, but it is easier to distribute as a single binary and has stricter spec compliance in several edge cases.

JMESPath is the query language behind the AWS CLI, with formal specification and implementations in most major languages. In 2024, RFC 9535 standardized JSONPath, the XPath-inspired notation that tools like Python’s jsonpath-ng implement. Having an RFC-backed standard matters for interoperability across implementations, even if the tooling is still maturing relative to jq’s ecosystem.

When to use what

For interactive exploration and one-off transformations on reasonably sized JSON, jq remains the most powerful option. The DSL is genuinely expressive once learned, and jqplay makes it easy to iterate on filters interactively. Nothing in the alternatives ecosystem matches jq for complex reshaping.

For filtering large files by path and value conditions, a grep-style tool measurably reduces wall time and memory pressure. The queries map more directly onto what you are actually trying to do — “give me all the records where this field matches this pattern” — without requiring knowledge of jq’s functional programming model.

For building programmatic pipelines that process JSON at scale, the right answer is usually a library rather than a command-line tool: simdjson for C++, serde_json or sonic-rs for Rust, orjson for Python. Command-line tools are convenient wrappers for ad-hoc use; library APIs give you the full expressiveness of a programming language without requiring a separate DSL.

The fact that jsongrep reached the front page of Hacker News with over 300 points and more than 200 comments reflects something real. jq is one of those tools that developers use constantly and have quietly wanted a better version of for years. The performance problem is genuine, the learning curve is genuine, and any tool that addresses both with a familiar mental model will get attention. Whether the specific design choices hold up at the scale of your actual workloads is worth testing directly, but the architectural argument for why grep-style JSON querying is faster than jq is sound.

Was this interesting?