Windows UTF-16 Conversion: The API Flags Most C++ Code Gets Wrong

Every C++ developer who has spent time on Windows has written the same conversion boilerplate at least once. You need a std::string from a std::wstring, or the other way around, and you reach for WideCharToMultiByte or MultiByteToWideChar. The Giovanni Di article on isocpp.org walks through the mechanics of doing this correctly. What it gestures at but does not fully unpack is how many ways the naive version of this code fails silently, and why Windows ended up in this position to begin with.

How Windows Ended Up With UTF-16

When Microsoft designed Windows NT in the early 1990s, the Unicode Consortium had made a promise that all of human writing would fit in 65,536 code points. A 16-bit character type was the natural choice. Windows NT 3.1 shipped in 1993 with WCHAR as the native string unit, and the entire Win32 API was built around it.

UTF-8 existed. Rob Pike and Ken Thompson designed it in 1992, and it would have been a reasonable choice. But Windows had already committed, and UTF-8’s variable-width encoding was seen as a complication for an era of fixed-width simplicity.

The Unicode Consortium’s promise turned out to be wrong. By the time Unicode 2.0 shipped in 1996, the encoding had been extended to over a million code points, requiring surrogate pairs in UTF-16. Windows 2000 upgraded from UCS-2 (the original fixed-width 16-bit encoding) to full UTF-16, which means wchar_t on Windows is now a variable-width encoding, just one with 16-bit code units instead of 8-bit ones. The complication Microsoft tried to avoid with UTF-8 arrived anyway, just in a different form.

On Linux and macOS, wchar_t is 32 bits, which means it holds UTF-32 and is genuinely fixed-width. On Windows, it is 16 bits. This difference alone breaks any cross-platform code that assumes sizeof(wchar_t) tells you something useful about string behavior.

The Two-Pass Pattern

The canonical way to convert between UTF-16 and UTF-8 on Windows uses MultiByteToWideChar and WideCharToMultiByte from <windows.h>. Both functions follow the same two-pass pattern: call once with a null output buffer to get the required size, allocate, call again to fill the buffer.

// UTF-8 (std::string) to UTF-16 (std::wstring)
std::wstring utf8_to_utf16(std::string_view utf8) {
    if (utf8.empty()) return {};

    int required = MultiByteToWideChar(
        CP_UTF8,
        MB_ERR_INVALID_CHARS,
        utf8.data(),
        static_cast<int>(utf8.size()),
        nullptr,
        0
    );
    if (required == 0) {
        throw std::system_error(
            static_cast<int>(GetLastError()),
            std::system_category()
        );
    }

    std::wstring result(required, L'\0');
    int written = MultiByteToWideChar(
        CP_UTF8,
        MB_ERR_INVALID_CHARS,
        utf8.data(),
        static_cast<int>(utf8.size()),
        result.data(),
        required
    );
    if (written == 0) {
        throw std::system_error(
            static_cast<int>(GetLastError()),
            std::system_category()
        );
    }
    return result;
}

The reverse direction looks symmetric:

// UTF-16 (std::wstring) to UTF-8 (std::string)
std::string utf16_to_utf8(std::wstring_view utf16) {
    if (utf16.empty()) return {};

    int required = WideCharToMultiByte(
        CP_UTF8,
        WC_ERR_INVALID_CHARS,
        utf16.data(),
        static_cast<int>(utf16.size()),
        nullptr,
        0,
        nullptr,  // must be nullptr for CP_UTF8
        nullptr   // must be nullptr for CP_UTF8
    );
    if (required == 0) {
        throw std::system_error(
            static_cast<int>(GetLastError()),
            std::system_category()
        );
    }

    std::string result(required, '\0');
    int written = WideCharToMultiByte(
        CP_UTF8,
        WC_ERR_INVALID_CHARS,
        utf16.data(),
        static_cast<int>(utf16.size()),
        result.data(),
        required,
        nullptr,
        nullptr
    );
    if (written == 0) {
        throw std::system_error(
            static_cast<int>(GetLastError()),
            std::system_category()
        );
    }
    return result;
}

This is not complicated, but it has several failure modes that are easy to stumble into.

The Flags That Actually Matter

The MB_ERR_INVALID_CHARS and WC_ERR_INVALID_CHARS flags are where most production code quietly goes wrong. Without them, both functions succeed on invalid input by substituting replacement characters or silently skipping bad sequences. The converted string looks plausible, but it no longer round-trips.

WC_ERR_INVALID_CHARS in WideCharToMultiByte causes the function to fail with ERROR_NO_UNICODE_TRANSLATION when the input contains an invalid surrogate pair, for instance a lone high surrogate without a following low surrogate. Without this flag, the function either replaces the lone surrogate with the Unicode replacement character (U+FFFD) or silently drops it depending on the Windows version and the specific code unit involved.

The last two parameters of WideCharToMultiByte, lpDefaultChar and lpUsedDefaultChar, exist for converting to ANSI code pages where some Unicode characters have no representation. When CodePage is CP_UTF8, passing anything other than nullptr for these parameters causes the function to return zero with ERROR_INVALID_PARAMETER. This is a silent failure in code that reuses the same wrapper for both UTF-8 and ANSI conversions.

For MultiByteToWideChar, the MB_ERR_INVALID_CHARS flag rejects overlong encodings and invalid byte sequences in the UTF-8 input. Without it, the function will accept byte sequences that are not valid UTF-8 and produce garbage wide characters.

The `int` Size Limit

Both functions take and return int for sizes, not size_t. The maximum input is INT_MAX bytes, roughly 2 GB. For most strings in practice this is not a constraint, but if you are processing file contents or network payloads and pass a size_t cast directly to int, values above 2 GB silently truncate. The cast to static_cast<int> in the examples above should be paired with a bounds check in any code that handles untrusted input.

What Went Wrong With std::codecvt

Before C++17, the idiomatic C++ answer to this problem was std::wstring_convert combined with std::codecvt_utf8_utf16. The code was more readable:

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string utf8 = converter.to_bytes(utf16_string);
std::wstring utf16 = converter.from_bytes(utf8_string);

C++17 deprecated both std::wstring_convert and the std::codecvt_utf8* facets. The reasons were multiple: the error handling design was poor (the conversion either throws or returns an empty string depending on how the converter was constructed, with no clean way to distinguish empty input from error), the API did not compose well with the rest of the standard library, and the implementation quality varied significantly across compilers. Most importantly, std::codecvt_utf8_utf16<wchar_t> on Windows has historically produced incorrect results for characters outside the Basic Multilingual Plane because of ambiguities in how wchar_t is treated on different platforms.

The types are still present in most standard library implementations as of 2025, deprecated but not removed. Code that relies on them will compile and often run, but the behavior is not guaranteed, the error handling is weak, and the feature will eventually disappear.

Isolating the Conversion Boundary

The UTF-8 Everywhere manifesto makes the practical case for this: write all internal application logic using std::string with UTF-8, and convert to and from UTF-16 only at the exact point where you call a Windows API. This means the conversion functions appear in a small number of places, they are thoroughly tested, and the rest of the codebase never has to think about wchar_t.

The concrete consequence is that you never store std::wstring in data structures, never pass it through function parameters in your own code, and never use it as a return type except in the thin wrapper around Win32 calls. Every CreateFileW, RegOpenKeyExW, or ShellExecuteW call gets a narrow wrapper that accepts std::string_view in UTF-8 and handles the conversion internally.

HANDLE open_file(std::string_view path, DWORD access, DWORD share_mode) {
    return CreateFileW(
        utf8_to_utf16(path).c_str(),
        access,
        share_mode,
        nullptr,
        OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        nullptr
    );
}

This is repetitive but mechanical. The alternative, passing std::wstring through your entire call stack, distributes the encoding concern everywhere and makes cross-platform code structurally harder.

What Other Languages Do

The comparison with Rust is instructive. Rust’s OsString on Windows uses WTF-8 internally, a superset of UTF-8 that can encode lone surrogates. This means Rust can represent any sequence of Windows path characters, including the ones that are technically invalid UTF-16, without data loss. The conversion to and from &str (which must be valid UTF-8) is explicit, and the type system prevents you from treating an OsString as UTF-8 without acknowledging the conversion.

Go takes the opposite approach: strings are byte slices, and the golang.org/x/sys/windows package provides utilities for converting to and from UTF-16 when calling Windows APIs. Neither language exposes the Windows WCHAR type as a native string primitive.

C++ has no such help from the type system. std::string and std::wstring are structurally identical containers of different character types, with no enforcement that the contents are valid in any particular encoding.

Where C++ Standardization Stands

The C++ standards committee has been working on proper Unicode support for some time. P1885, which introduced std::text_encoding to let programs query the system’s active encoding at runtime, was accepted into C++26. This helps with detection but not conversion.

Proposals like P2728 aim at a proper Unicode string type with encoding guarantees, but that work is ongoing and the design space is large. The realistic near-term answer for Windows C++ code remains WideCharToMultiByte and MultiByteToWideChar, wrapped carefully, with the correct flags, with proper error propagation, and called only at system boundaries.

The boilerplate is ugly. It is also correct when written carefully, and that correctness matters more than the aesthetics.