If you’ve ever processed text character by character, you’ve probably assumed that one index step equals one character. That assumption holds for ASCII, but Unicode uses variable-width encoding in both UTF-8 and UTF-16, which means a simple i++ will land you mid-sequence on any multibyte input.
Giovanni Dicanio’s article on isocpp.org, published last December, walks through the mechanics of advancing to the next code point in both encodings. It’s a concrete reference for anyone working with text at the byte level.
How UTF-8 Encodes Width
UTF-8 stores code points using 1 to 4 bytes. The leading byte determines the sequence length: a byte starting with 0 is a standalone ASCII character, while leading bytes starting with 110, 1110, or 11110 indicate sequences of 2, 3, or 4 bytes respectively. Continuation bytes always begin with 10.
To advance past a code point, you read the leading byte and skip its continuation bytes:
const char* next_utf8(const char* p) {
unsigned char c = static_cast<unsigned char>(*p++);
if (c >= 0xF0) p += 3; // 4-byte sequence
else if (c >= 0xE0) p += 2; // 3-byte sequence
else if (c >= 0xC0) p += 1; // 2-byte sequence
return p;
}
This is straightforward once the bit patterns are clear, but it’s the kind of logic that needs to be correct the first time.
How UTF-16 Differs
UTF-16 stores most characters as a single 16-bit code unit. Characters outside the Basic Multilingual Plane, above U+FFFF, use surrogate pairs: a high surrogate (0xD800-0xDBFF) followed by a low surrogate (0xDC00-0xDFFF). To advance, you check whether the current unit is a high surrogate and skip two units if so, one otherwise.
Windows APIs use UTF-16 throughout. File operations, clipboard handling, shell integration, all of it comes in WCHAR. Most network-facing code and everything JSON-adjacent uses UTF-8. The mismatch between the two is a persistent source of corruption when text crosses that boundary without proper conversion.
Where This Shows Up in Practice
Discord payloads arrive as UTF-8 JSON. If a bot is slicing message content by position, searching for substrings with a byte-level loop, or computing string lengths by counting code units, it will get wrong answers for any input that includes characters outside ASCII. Emoji in usernames, non-Latin scripts, characters above U+FFFF in message content, all of these expose the assumption.
For production code, a library like ICU or the lighter-weight simdutf handles this correctly and efficiently. simdutf in particular is used in Node.js and performs UTF-8 to UTF-16 conversion with SIMD acceleration, which matters when processing large text volumes.
Understanding the encoding mechanics still matters even when using a library. When something produces garbled output, you need a mental model of what went wrong. Dicanio’s article provides that grounding clearly, without treating the encoding rules as an implementation detail someone else should care about.