· 2 min read ·

The Byte Arithmetic Behind Unicode String Iteration

Source: isocpp

If you’ve ever processed text character by character, you’ve probably assumed that one index step equals one character. That assumption holds for ASCII, but Unicode uses variable-width encoding in both UTF-8 and UTF-16, which means a simple i++ will land you mid-sequence on any multibyte input.

Giovanni Dicanio’s article on isocpp.org, published last December, walks through the mechanics of advancing to the next code point in both encodings. It’s a concrete reference for anyone working with text at the byte level.

How UTF-8 Encodes Width

UTF-8 stores code points using 1 to 4 bytes. The leading byte determines the sequence length: a byte starting with 0 is a standalone ASCII character, while leading bytes starting with 110, 1110, or 11110 indicate sequences of 2, 3, or 4 bytes respectively. Continuation bytes always begin with 10.

To advance past a code point, you read the leading byte and skip its continuation bytes:

const char* next_utf8(const char* p) {
    unsigned char c = static_cast<unsigned char>(*p++);
    if (c >= 0xF0) p += 3;      // 4-byte sequence
    else if (c >= 0xE0) p += 2; // 3-byte sequence
    else if (c >= 0xC0) p += 1; // 2-byte sequence
    return p;
}

This is straightforward once the bit patterns are clear, but it’s the kind of logic that needs to be correct the first time.

How UTF-16 Differs

UTF-16 stores most characters as a single 16-bit code unit. Characters outside the Basic Multilingual Plane, above U+FFFF, use surrogate pairs: a high surrogate (0xD800-0xDBFF) followed by a low surrogate (0xDC00-0xDFFF). To advance, you check whether the current unit is a high surrogate and skip two units if so, one otherwise.

Windows APIs use UTF-16 throughout. File operations, clipboard handling, shell integration, all of it comes in WCHAR. Most network-facing code and everything JSON-adjacent uses UTF-8. The mismatch between the two is a persistent source of corruption when text crosses that boundary without proper conversion.

Where This Shows Up in Practice

Discord payloads arrive as UTF-8 JSON. If a bot is slicing message content by position, searching for substrings with a byte-level loop, or computing string lengths by counting code units, it will get wrong answers for any input that includes characters outside ASCII. Emoji in usernames, non-Latin scripts, characters above U+FFFF in message content, all of these expose the assumption.

For production code, a library like ICU or the lighter-weight simdutf handles this correctly and efficiently. simdutf in particular is used in Node.js and performs UTF-8 to UTF-16 conversion with SIMD acceleration, which matters when processing large text volumes.

Understanding the encoding mechanics still matters even when using a library. When something produces garbled output, you need a mental model of what went wrong. Dicanio’s article provides that grounding clearly, without treating the encoding rules as an implementation detail someone else should care about.

Was this interesting?