· 4 min read ·

UTF-16 to UTF-8 Conversion on Windows: Getting the Win32 API Right

Source: isocpp

Windows has used UTF-16 as its native string encoding since the NT era. The Win32 API surface, including every function that ends in W, operates on WCHAR strings, which are 16-bit code units. This was a reasonable decision in the early 1990s when Unicode was young and the entire code point space was expected to fit in 16 bits. The world moved on; Windows did not, or at least not completely.

Meanwhile, UTF-8 became the dominant encoding everywhere else. HTTP headers, JSON payloads, file systems on Linux and macOS, Git repositories, and the content flowing through Discord’s API all use UTF-8. If you write a Windows C++ application that talks to any of these systems, you will be converting between encodings constantly. Getting this wrong produces garbled output at best and silent data corruption at worst.

Giovanni Dicanio’s article on isocpp.org, published December 2025, covers the mechanics of this conversion using the Win32 API. It is worth reading carefully, because the API is old, verbose, and has several non-obvious failure modes that most sample code ignores.

The Two Win32 Functions

The Windows API provides two functions for encoding conversion:

  • MultiByteToWideChar converts from a multibyte encoding (UTF-8 included) to UTF-16.
  • WideCharToMultiByte converts from UTF-16 to a multibyte encoding.

Both functions use an integer code page to identify the source or destination encoding. For UTF-8, that code page is CP_UTF8.

The Two-Call Pattern

The canonical way to use these functions is with two calls. The first call sizes the output buffer; the second does the actual conversion. You do this by passing zero for the output buffer size and NULL for the output buffer pointer, which tells the function to calculate and return the required size.

Here is the UTF-8 to UTF-16 conversion using this pattern:

#include <windows.h>
#include <stdexcept>
#include <string>

std::wstring Utf8ToUtf16(const std::string& utf8) {
    if (utf8.empty()) {
        return {};
    }

    // First call: determine required buffer size.
    const int sizeNeeded = MultiByteToWideChar(
        CP_UTF8,
        MB_ERR_INVALID_CHARS,
        utf8.data(),
        static_cast<int>(utf8.size()),
        nullptr,
        0
    );

    if (sizeNeeded == 0) {
        throw std::runtime_error("MultiByteToWideChar sizing failed");
    }

    std::wstring result(sizeNeeded, L'\0');

    // Second call: perform the conversion.
    const int written = MultiByteToWideChar(
        CP_UTF8,
        MB_ERR_INVALID_CHARS,
        utf8.data(),
        static_cast<int>(utf8.size()),
        result.data(),
        sizeNeeded
    );

    if (written == 0) {
        throw std::runtime_error("MultiByteToWideChar conversion failed");
    }

    return result;
}

The same two-call structure applies to WideCharToMultiByte for the reverse direction. The function signatures differ slightly; WideCharToMultiByte has two extra output parameters for default character substitution that you should pass as nullptr when using CP_UTF8, since substitution does not apply to UTF-8.

The Error Flags You Should Always Set

MB_ERR_INVALID_CHARS is the flag passed to MultiByteToWideChar in the example above. Without it, the function silently skips or substitutes invalid byte sequences rather than returning an error. The equivalent for WideCharToMultiByte with CP_UTF8 is WC_ERR_INVALID_CHARS.

Silent substitution might seem convenient, but it means corrupted input passes through undetected. If you are processing user-provided UTF-8 from a network source, an invalid sequence could indicate truncated data, an encoding mismatch, or deliberate malformation. Failing loudly on invalid input and handling the error explicitly is almost always preferable to silently mangling the string and propagating the corruption further.

Note that WC_ERR_INVALID_CHARS is only valid with CP_UTF8; passing it with other code pages will cause the function to fail. The asymmetry in flag naming and behavior between the two functions is one of several rough edges in this API that rewards reading the documentation carefully.

Windows 10 1903 and the UTF-8 Code Page

Windows 10 version 1903 added support for setting UTF-8 as the process code page via an application manifest entry:

<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>

With this set, functions that previously operated in the system ANSI code page, including fopen, system, and the narrow-string variants of Win32 functions like CreateFileA, will interpret narrow strings as UTF-8. This can simplify code that mixes CRT and Win32 calls, but it does not change the underlying wide-character nature of the Win32 API. You still need explicit conversion when calling W variants directly.

When to Use a Wrapper Library

The two-call pattern is not complicated to write once, but it is verbose, and the error handling adds more lines. If your codebase does a significant volume of encoding conversion, centralizing this into a small set of helper functions is worth the investment. Libraries like utf8cpp or the conversion utilities in WIL (Windows Implementation Library) provide this and handle edge cases that are easy to miss in hand-rolled implementations.

For a Discord bot running on Windows that bridges the UTF-16 Win32 world with UTF-8 JSON and HTTP payloads, having reliable conversion helpers at the foundation matters more than it might appear from the call-site simplicity. Incorrect encoding is one of those bugs that surfaces inconsistently depending on the specific characters in user messages, which makes it harder to catch in testing than a straightforward crash.

Was this interesting?