The Variable-Length Problem: What Unicode Iteration Costs in Practice

The C++ standard library is quiet about a distinction that matters a great deal in practice: std::string::size() returns the number of bytes, not the number of Unicode characters. Iterating a std::string with a range-based for loop gives you bytes. None of this is wrong, but it is a footgun for code that needs to reason about text at the character level.

Giovanni Dicanio covered this problem in detail late in December 2025, walking through the mechanics of advancing past a single code point in both UTF-8 and UTF-16. It is worth a look as a technical reference, and worth revisiting here because the practical consequences tend to get underestimated.

What the Encodings Require

UTF-8 uses variable-length sequences. A leading byte in the range 0x00-0x7F means one byte. Higher values encode the sequence length in the top bits: 110xxxxx for two bytes, 1110xxxx for three, 11110xxx for four. Continuation bytes all begin with 10xxxxxx. So moving to the next code point means reading the leading byte and skipping the right number of bytes forward.

UTF-16 is conceptually simpler but has its own wrinkle. Most code points fit in a single 16-bit unit. Code points above U+FFFF use surrogate pairs: a high surrogate (U+D800 to U+DBFF) followed by a low surrogate (U+DC00 to U+DFFF). If the current unit is a high surrogate, advance by two; otherwise, advance by one.

Both encodings have self-describing structure, which is why forward iteration is tractable without any lookahead.

Where This Goes Wrong

The bugs tend to be invisible until they are not. Slicing a UTF-8 string to fit inside a byte limit without checking whether the cut falls in the middle of a multi-byte sequence produces malformed output. Truncating a UTF-16 string between a surrogate pair corrupts the last character. Measuring string length by size() and using that for display width produces incorrect UI in any locale that uses characters outside ASCII.

For anything involving user input in 2026, this is not an edge case. Emoji alone span the supplementary planes, which means UTF-16 surrogate pairs and four-byte UTF-8 sequences are routine.

The Library Situation

The standard library does not give you code point iteration out of the box. The C++23 std::ranges additions are moving things forward, and std::u8string formalizes the UTF-8 type, but full code point traversal through the standard library is still incomplete.

In practice, the options are ICU for anything serious, utfcpp for a lightweight header-only approach, or platform APIs (which on Windows speak UTF-16 everywhere). None of these are painful to integrate, but you do have to choose one rather than assuming the standard library will handle it.

Dicanio’s article is a solid companion for understanding what these libraries are abstracting. The mechanics are not complicated, but they are easy to skip over until something breaks.