Scanning a byte stream for special characters is one of those problems that appears deceptively simple and turns out to touch almost every layer of the CPU pipeline. A JSON parser needs to find quote characters and backslashes. A URL validator hunts for percent signs and reserved delimiters. An HTTP/1.1 request splitter scans for carriage returns and colons. The naive solution is a loop with a branch per byte, which works correctly but leaves a great deal of hardware idle.
Daniel Lemire’s recent post on character matching on ARM processors benchmarks several approaches head to head. What makes it interesting beyond the numbers is that ARM’s instruction set forces a different problem decomposition than x86 does. The techniques that work on ARM are not simply ports of the x86 approach; some of them are structurally different, and a few are genuinely more general.
What the Scalar Baseline Actually Costs
Before reaching for SIMD, it is worth understanding why the scalar loop is slow. The conventional approach is something like:
const char *find_special(const char *p, const char *end) {
while (p < end) {
uint8_t c = (uint8_t)*p;
if (c == '"' || c == '\\' || c < 0x20) return p;
p++;
}
return end;
}
On modern superscalar processors, this loop is bottlenecked by branch mispredictions when special characters are sparse, and by serial dependency chains that prevent the CPU from doing more than one iteration of useful work per cycle. For inputs where special characters appear rarely (which is the common case in well-formed ASCII data), the loop runs for many iterations between hits, but the CPU has no way to look ahead across those iterations independently.
The branch predictor helps when the pattern is regular, but the nature of the problem is that special characters appear at irregular intervals determined by the content. You get the worst of both worlds: predictable loops that the CPU learns to optimize, interrupted by unpredictable special cases.
SWAR: Parallelism Before SIMD
The first technique that genuinely helps is SWAR, which stands for SIMD Within A Register. The insight is that an ordinary 64-bit general-purpose register already contains eight bytes, and bitwise arithmetic can test all eight simultaneously without any SIMD instructions at all.
Detecting whether any byte in a 64-bit word equals a target value works like this:
// Returns nonzero if any byte in 'word' equals 'target'
uint64_t has_byte_equal(uint64_t word, uint8_t target) {
uint64_t t = target * UINT64_C(0x0101010101010101);
uint64_t x = word ^ t; // Zero out matching bytes
// Detect zero bytes using the classic has_zero_byte trick
return (x - UINT64_C(0x0101010101010101)) & ~x & UINT64_C(0x8080808080808080);
}
The constant 0x0101010101010101 replicates the target byte across all eight positions. XOR zeros out matching bytes. The subsequent arithmetic exploits the fact that subtracting 1 from a zero byte causes borrow to propagate into the high bit, while the ~x mask ensures we only flag bytes that were originally zero (not bytes that happened to have their high bit set after subtraction).
This technique has been in glibc’s memchr and strlen for decades, and it appears in CPython’s string scanning code. It processes eight bytes per iteration with no SIMD instruction set required, which matters on systems where you cannot assume NEON availability or where the data is too short to amortize SIMD setup costs.
For detecting multiple characters, you OR together the results of multiple SWAR passes. Detecting '"', '\\', and bytes below 0x20 requires three passes, but all three can be computed on the same loaded word and combined with bitwise OR before checking the result.
NEON: 16 Bytes at a Time
ARM NEON registers are 128 bits wide, holding 16 bytes per register. The direct NEON equivalent of the scalar comparison is vceqq_u8, which compares two Q-registers element-by-element and produces an all-ones byte where elements match and all-zeros where they do not:
#include <arm_neon.h>
const char *find_quote_neon(const char *p, const char *end) {
uint8x16_t quote_vec = vdupq_n_u8('"');
uint8x16_t slash_vec = vdupq_n_u8('\\');
uint8x16_t thresh_vec = vdupq_n_u8(0x20);
while (p + 16 <= end) {
uint8x16_t chunk = vld1q_u8((const uint8_t *)p);
uint8x16_t eq_quote = vceqq_u8(chunk, quote_vec);
uint8x16_t eq_slash = vceqq_u8(chunk, slash_vec);
uint8x16_t lt_thresh = vcltq_u8(chunk, thresh_vec);
uint8x16_t any = vorrq_u8(vorrq_u8(eq_quote, eq_slash), lt_thresh);
if (vmaxvq_u8(any)) return p + /* position from bitmask */;
p += 16;
}
// scalar tail
return find_special(p, end);
}
The key reduction instruction here is vmaxvq_u8, which computes the horizontal maximum across all 16 lanes and returns a scalar. If any lane matched, the result is 0xFF; otherwise it is 0. This is an O(1) check: the CPU does not branch on individual lanes, it computes a single value and branches once per 16-byte block.
Finding the exact position within the matching block requires more work. One approach is vshrn_n_u16, which narrows 16-bit lanes to 8-bit lanes, effectively collapsing the 16-byte mask into an 8-byte value where each bit corresponds to two original bytes. Another approach, common in simdjson, is to convert the NEON comparison result to a bitmask using a sequence of shifts and moves that produces a 16-bit integer with one bit per input byte. The exact sequence depends on whether you need the position immediately or can defer it.
The Nibble Trick: Matching Arbitrary Character Sets
Matching against a small, fixed set of characters (like the three categories above) works fine with a few vceqq_u8 passes. But what if the set is larger or more complex? Adding more comparisons scales linearly with the number of characters to match, which gets expensive.
The NEON table lookup instruction vqtbl1q_u8 (or vtbl1_u8 for 64-bit variants) enables a different approach. vqtbl1q_u8 takes a 16-byte lookup table and a 16-byte index vector; for each index byte, it returns the corresponding table entry (or zero if the index is out of range). This is a parallel lookup across 16 bytes simultaneously.
The nibble technique uses this to check set membership for arbitrary character sets:
// Check if any byte in 'input' belongs to the target character set
// 'low_table' and 'high_table' are 16-byte lookup tables keyed on nibbles
uint8x16_t match_character_set(uint8x16_t input,
uint8x16_t low_table,
uint8x16_t high_table) {
// Extract low 4 bits of each byte (indices into low_table)
uint8x16_t low_nibbles = vandq_u8(input, vdupq_n_u8(0x0F));
// Extract high 4 bits (shift right by 4, giving indices into high_table)
uint8x16_t high_nibbles = vshrq_n_u8(input, 4);
// Table lookup: get bitmask contributions from each nibble
uint8x16_t low_result = vqtbl1q_u8(low_table, low_nibbles);
uint8x16_t high_result = vqtbl1q_u8(high_table, high_nibbles);
// A byte matches if BOTH nibble lookups agree (AND of bitmask bits)
return vandq_u8(low_result, high_result);
}
The tables encode the set membership as bitmasks: for each possible low nibble (0-15) or high nibble (0-15), the table entry records which bit patterns are valid. A byte is in the target set if the bitmask bits from both its low and high nibble lookups share a set bit. Constructing the tables requires offline analysis of the character set, but that is a one-time cost.
This technique, used heavily in simdjson and simdutf, classifies 16 bytes against an arbitrary character set in roughly four to six instructions, regardless of how many characters are in the set. It is one of the situations where ARM’s TBL instruction provides a genuine structural advantage: x86 has _mm_shuffle_epi8 (PSHUFB in SSSE3) which does the same thing, but the NEON version has a slightly more convenient interface for larger tables (NEON’s vqtbl4q_u8 supports 64-byte tables, covering the full 8-bit range).
What x86 Does That ARM Cannot (Directly)
SSE4.2 introduced PCMPISTRI and PCMPISTRM, string comparison instructions that encode complex matching logic directly in the opcode. A single _mm_cmpistri call with the right immediate can find the first byte in a 16-byte chunk that belongs to a specified character set, or find the first mismatch between two strings. These instructions do the work of several NEON instructions combined, and on Nehalem and later Intel CPUs they execute with low latency.
ARM has no equivalent. The NEON instruction set was designed for multimedia processing rather than string parsing, and the closest instruction to PCMPISTRI is absent. This asymmetry is why the NEON nibble technique matters: it is not just a port of the x86 approach, it is the best available substitute.
In practice, the gap is narrower than it sounds. Modern x86 cores do not execute PCMPISTRI especially fast; on recent Intel microarchitectures the instruction takes 3-7 cycles depending on the operation. A tight sequence of NEON instructions, pipelined effectively, can match or beat that on Apple Silicon and Cortex-X series cores. The nibble approach also generalizes to character sets too large for PCMPISTRI’s 16-byte operand limit.
SVE2 Changes the Equation
The Scalable Vector Extension 2 (SVE2), available on Cortex-X2, X3, and expected in future Apple Silicon generations, introduces svmatch. This instruction takes two SVE vectors and returns a predicate mask where the first vector’s elements match any element of the second vector. It is, in effect, a SIMD set membership instruction where the set itself lives in a vector register.
// SVE2 character matching (conceptual, actual API differs)
svbool_t find_special_sve2(const uint8_t *p, uint64_t len) {
svuint8_t special = svdupq_u8('"', '\\', '\n', '\r',
0x00, 0x01, 0x02, 0x03, ...);
svuint8_t chunk = svld1_u8(svptrue_b8(), p);
return svmatch_u8(svptrue_b8(), chunk, special);
}
The vector width is hardware-defined: 128 bits on Cortex-X2, potentially 256 or 512 bits on future designs. The same source code runs on all SVE2 implementations, scaling with the available vector width automatically. This is the architectural response to PCMPISTRI, and it is more flexible: the match set is a full vector register rather than a fixed 16-byte operand, and the scalable width means wider hardware gets proportionally more throughput for free.
For the near term, most deployed ARM hardware does not have SVE2, so the NEON approaches remain the practical choice. But simdutf and simdjson already maintain SVE2 code paths behind compile-time guards, and those paths show meaningful gains on the hardware that supports them.
Applying This in Practice
For a concrete use case, consider implementing a fast JSON string scanner in Rust. The core loop needs to find '"', '\', and bytes below 32:
#[cfg(target_arch = "aarch64")]
unsafe fn find_escape_aarch64(s: &[u8]) -> Option<usize> {
use std::arch::aarch64::*;
let quote = vdupq_n_u8(b'"');
let slash = vdupq_n_u8(b'\\');
let ctrl = vdupq_n_u8(0x20u8); // threshold for control chars
let mut i = 0;
while i + 16 <= s.len() {
let chunk = vld1q_u8(s.as_ptr().add(i));
let m = vorrq_u8(
vorrq_u8(vceqq_u8(chunk, quote), vceqq_u8(chunk, slash)),
vcltq_u8(chunk, ctrl),
);
if vmaxvq_u8(m) != 0 {
// Scan the 16 bytes for the first match position
let mask = vreinterpretq_u64_u8(m);
let lo = vgetq_lane_u64(mask, 0);
let hi = vgetq_lane_u64(mask, 1);
let pos = if lo != 0 {
lo.trailing_zeros() / 8
} else {
8 + hi.trailing_zeros() / 8
};
return Some(i + pos as usize);
}
i += 16;
}
s[i..].iter().position(|&b| b == b'"' || b == b'\\' || b < 0x20)
.map(|p| i + p)
}
The trailing_zeros() / 8 trick works because each matching byte sets all 8 of its bits, so trailing zero bits come in groups of eight, each group corresponding to one byte position. This is cheaper than constructing an explicit bitmask through NEON shuffle operations.
Why This Matters Beyond JSON
Character matching is a fundamental primitive. HTTP header parsing, CSV tokenization, HTML attribute scanning, URL normalization, UTF-8 validation (which simdutf performs at 10+ GB/s on ARM), base64 decoding: all of these reduce to efficiently finding bytes that require special treatment within a larger byte stream. The scalar-to-SWAR-to-NEON-to-SVE2 progression is not just an academic exercise; it is the practical path that high-throughput parsers follow.
Lemire’s benchmarks make the case concretely with measured cycles per byte across implementations. The general pattern holds: each level of the hierarchy yields meaningful gains over the previous one, and the best NEON implementations come within a factor of two of theoretical memory bandwidth limits on modern ARM cores.
For most application code, none of this matters. String operations are not the bottleneck in a Discord bot handling a few thousand messages per second. But in infrastructure code, parsers, network protocol implementations, and anywhere that byte-stream processing sits on the critical path, these techniques are the difference between needing three servers and needing one.