The Byte Arithmetic Behind Unicode String Iteration

If you’ve ever processed text character by character, you’ve probably assumed that one index step equals one character. That assumption holds for ASCII, but Unicode uses variable-width encoding in both UTF-8 and UTF-16, which means a simple i++ will land you mid-sequence on any multibyte input.

Giovanni Dicanio’s article on isocpp.org, published last December, walks through the mechanics of advancing to the next code point in both encodings. It’s a concrete reference for anyone working with text at the byte level.

How UTF-8 Encodes Width

UTF-8 stores code points using 1 to 4 bytes. The leading byte determines the sequence length: a byte starting with 0 is a standalone ASCII character, while leading bytes starting with 110, 1110, or 11110 indicate sequences of 2, 3, or 4 bytes respectively. Continuation bytes always begin with 10.

To advance past a code point, you read the leading byte and skip its continuation bytes:

const char* next_utf8(const char* p) {
    unsigned char c = static_cast<unsigned char>(*p++);
    if (c >= 0xF0) p += 3;      // 4-byte sequence
    else if (c >= 0xE0) p += 2; // 3-byte sequence
    else if (c >= 0xC0) p += 1; // 2-byte sequence
    return p;
}

This is straightforward once the bit patterns are clear, but it’s the kind of logic that needs to be correct the first time.

How UTF-16 Differs

UTF-16 stores most characters as a single 16-bit code unit. Characters outside the Basic Multilingual Plane, above U+FFFF, use surrogate pairs: a high surrogate (0xD800-0xDBFF) followed by a low surrogate (0xDC00-0xDFFF). To advance, you check whether the current unit is a high surrogate and skip two units if so, one otherwise.

Windows APIs use UTF-16 throughout. File operations, clipboard handling, shell integration, all of it comes in WCHAR. Most network-facing code and everything JSON-adjacent uses UTF-8. The mismatch between the two is a persistent source of corruption when text crosses that boundary without proper conversion.

Where This Shows Up in Practice

Discord payloads arrive as UTF-8 JSON. If a bot is slicing message content by position, searching for substrings with a byte-level loop, or computing string lengths by counting code units, it will get wrong answers for any input that includes characters outside ASCII. Emoji in usernames, non-Latin scripts, characters above U+FFFF in message content, all of these expose the assumption.

For production code, a library like ICU or the lighter-weight simdutf handles this correctly and efficiently. simdutf in particular is used in Node.js and performs UTF-8 to UTF-16 conversion with SIMD acceleration, which matters when processing large text volumes.

Understanding the encoding mechanics still matters even when using a library. When something produces garbled output, you need a mental model of what went wrong. Dicanio’s article provides that grounding clearly, without treating the encoding rules as an implementation detail someone else should care about.