· 2 min read ·

When i++ Stops Being Enough: Iterating Through Unicode Code Points

Source: isocpp

The assumption embedded in i++ is that every character in a string occupies exactly one unit of storage. For ASCII, this holds. Every character maps to a single byte, every byte maps to a character, and pointer arithmetic is a direct translation of intent.

Unicode breaks this assumption in two different ways depending on which encoding you are using. Giovanni Dicanio’s article on isocpp.org, published late December 2025, walks through both cases with the kind of precision that is worth revisiting.

UTF-8: Reading the Leading Byte

UTF-8 encodes code points using a variable number of bytes, one through four, depending on the code point’s value. The leading byte of each sequence tells you how many bytes follow.

0xxxxxxx                             — 1 byte  (U+0000 to U+007F)
110xxxxx 10xxxxxx                    — 2 bytes (U+0080 to U+07FF)
1110xxxx 10xxxxxx 10xxxxxx           — 3 bytes (U+0800 to U+FFFF)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  — 4 bytes (U+10000 to U+10FFFF)

Continuation bytes all start with 10xxxxxx. This makes forward iteration straightforward: read the leading byte, determine the sequence length, skip that many bytes. You never need to backtrack to figure out where you are.

const char* next_utf8(const char* p) {
    unsigned char c = static_cast<unsigned char>(*p);
    if      (c < 0x80)  return p + 1;
    else if (c < 0xE0)  return p + 2;
    else if (c < 0xF0)  return p + 3;
    else                return p + 4;
}

Backward iteration is messier, but forward is clean because of the self-synchronizing property of those continuation bytes.

UTF-16: Watching for Surrogates

UTF-16 uses 16-bit code units. Characters in the Basic Multilingual Plane (U+0000 to U+FFFF) fit in one unit. Characters beyond that range require two units, called a surrogate pair. The high surrogate falls in 0xD800 to 0xDBFF; if you see one, the next unit is the low surrogate (0xDC00 to 0xDFFF), and together they encode a single code point.

const char16_t* next_utf16(const char16_t* p) {
    if (*p >= 0xD800 && *p <= 0xDBFF) {
        return p + 2;  // surrogate pair
    }
    return p + 1;
}

The branching is simpler than UTF-8, but surrogates are the detail that gets forgotten until emoji appear in user input and substring logic starts returning garbled text.

Why It Matters Beyond C++

If you are building anything that processes text from real users, you will encounter characters outside ASCII. Discord messages carry emoji, CJK characters, combining diacritics; all of it encoded in whatever the client sends. The bot runtime handles the transport layer, but manual string slicing, truncating to fit a character limit, or parsing structured tokens all require knowing what “one character forward” actually means in the encoding you are working with.

The mechanics here are encoding-level, not language-level. The same reasoning applies in Rust with str::chars(), in Python with str iteration, or in any context where you put something resembling a pointer to a string and move it forward. The encoding model matters, and ignoring it tends to surface in the most user-visible ways possible.

Was this interesting?