· 6 min read ·

jq Is Powerful, But Power Has a Price: The Case for Grep-Style JSON Tools

Source: hackernews

Every developer who works with APIs or structured logs has a jq one-liner in their muscle memory. It’s expressive, composable, and handles surprisingly complex transformations. But if you’ve ever run jq on a multi-gigabyte NDJSON log file and watched it sit there thinking, you’ve encountered the mismatch at the heart of jq’s design: it was built for transformation, and you’re using it for search.

Micah Kepe’s writeup on jsongrep surfaces this tension directly, arguing that for grep-style workloads, jq’s architecture is working against you. To understand why, it helps to know what jq is actually doing when you invoke it.

What jq Does Under the Hood

jq was written by Stephen Dolan in C, first appearing around 2012. Its filter language is a small functional programming language: filters compose, pipe, and transform JSON values. When you run jq '.results[] | select(.status == "error") | .id', jq compiles that filter expression to a sequence of bytecode instructions, then executes those instructions against a virtual machine that operates on jq values (jv types).

The jv system is a reference-counted union type that represents any JSON value. Every string, number, object, and array in your input document becomes a jv allocation. On a 200MB JSON file with thousands of nested objects, that’s a substantial amount of allocator traffic before jq has evaluated a single filter step.

Parsing happens entirely up front. By default, jq reads and parses the complete input into memory before running the filter. The --stream flag exists for streaming mode, but it changes the filter semantics in ways that make it impractical for most use cases, and it’s rarely seen in production scripts.

For a tool doing complex restructuring, this design makes sense. If you’re transforming one JSON shape into another, you often need random access to the parsed tree. But for the query does this log line contain an object where level equals error, you don’t need any of that.

The Grep Mental Model

Grep doesn’t parse files. It scans byte streams, matches patterns, and emits lines. That simplicity is precisely what makes it fast on large inputs: there’s no intermediate representation, no allocation per input element, no virtual machine.

For JSON search workloads, the equivalent operation would be: scan the byte stream for structural patterns, extract matching regions, emit them. You’d sacrifice the ability to run arbitrary jq filters, but for containment queries and value matching, you’d have an order of magnitude less work to do.

This is the insight behind gron, the Go tool that flattens JSON into grep-able assignment statements:

$ gron data.json | grep 'status = "error"'
json[42].status = "error";
json[107].status = "error";

gron’s philosophy is to make JSON a first-class citizen for standard Unix pipelines. The round-trip (gron + grep + ungron) has overhead of its own, but the mental model maps directly onto workflows that developers already know.

Where SIMD Changes the Calculus

Modern CPUs can process 32 bytes per cycle with AVX2 instructions. simdjson, the C++ library from Daniel Lemire’s group, exploits this by turning JSON parsing into a two-phase SIMD operation: a structural stage that finds all {, }, [, ], :, and " positions using bitmasking, followed by a value stage that extracts the data. The result is sustained parsing throughput around 2 to 4 GB/s on real hardware, compared to jq’s typical 50 to 200 MB/s.

The structural stage is particularly relevant for grep-style tools. Finding the byte positions of structural characters is essentially the same operation as scanning for SIMD-friendly patterns. Once you know where all the structural characters are, finding an object that contains a specific key-value pair is a range query over a sorted array, not a tree traversal.

Rust’s simd-json crate brings this approach to the Rust ecosystem. jaq, a Rust reimplementation of jq that aims for filter language compatibility, achieves roughly 3x throughput improvement over jq on common benchmarks, largely because Rust’s ownership model eliminates the reference counting overhead that jv carries in jq’s C implementation.

But jaq is still doing the full parse-then-filter pipeline. A grep-style tool can go further by not building a parsed representation at all for documents that don’t match.

The Architectural Trade-Off

The performance advantage of grep-style JSON tools is real, but it comes with a specific constraint: the query model must be expressible without full tree construction. Queries that work well include:

  • Value containment: find all objects where a specific field equals a specific value
  • Key existence: find all objects that have a specific key
  • Type filtering: find all objects where a field is an array of length greater than N
  • Pattern matching: find all string values matching a regex

Queries that don’t work well:

  • Restructuring: take field A from one level and combine it with field B from another level
  • Aggregation: sum all values of a specific numeric field
  • Recursive descent: find all matching values at any depth in an arbitrary structure

This is the same trade-off that exists between grep and awk, or between awk and Python. More expressive tools have more overhead, and the right choice depends on which side of the query distribution your workload sits on.

For log analysis, most queries are on the grep side. You’re looking for error patterns, specific request IDs, or threshold violations. For API response manipulation in a build pipeline, you’re on the jq side.

The Current Ecosystem

jq remains the standard for good reasons: it’s available everywhere, the filter language has real expressive power, and the man page is actually readable. For transformations that involve reshaping data, nothing in the CLI ecosystem matches it.

dasel (Go) targets multi-format documents and works across JSON, YAML, TOML, and XML with a unified selector syntax. Useful when you’re dealing with multiple config formats in the same pipeline, though its JSON performance isn’t a primary design goal.

yq started as a YAML-focused tool but now handles JSON reasonably well. Like dasel, the multi-format support is the main value proposition rather than raw speed.

fx takes a JavaScript-based approach with an interactive interface, making it excellent for exploration but not for scripted pipelines.

The tools in the jsongrep category occupy a different niche: fast pattern-matching over potentially large JSON inputs, where the query can be expressed as a structural search rather than a transformation. That’s a real workflow gap, and it’s one where the architecture of traditional JSON tools genuinely holds them back.

Choosing the Right Tool

For most interactive use, jq’s speed is not the bottleneck. You’re running it on API responses measured in kilobytes, and the human thinking time between commands dwarfs any parse overhead. jq’s expressiveness is the relevant property there.

The calculus shifts when JSON inputs are large, queries are repeated in tight loops, or you’re processing streams in real time. A tool that scans rather than parses can outperform jq by a significant factor in those scenarios, not because jq is poorly implemented, but because it’s doing more work than the query requires.

The HN discussion around jsongrep follows a familiar pattern for this category of tool: the comments mix genuine interest with references to jaq, simdjson-based tools, and the perennial observation that for truly large JSON, the format choice itself might be the problem. All of those points are fair. But the grep mental model for structured data is legitimate on its own terms, and tooling that takes it seriously is a useful addition to the ecosystem.

jq will remain the default. Grep-style JSON tools will remain the right choice when the input is large and the query is a search rather than a transformation. Knowing which situation you’re in is the only thing that matters.

Was this interesting?