· 6 min read ·

Building a Programming Language in Hangul: Han, Unicode, and the Non-English Tradition

Source: hackernews

The Han programming language, posted to Hacker News this week by its author, is a statically-typed language where every keyword is written in Hangul, the Korean writing system. The compiler is written in Rust and covers the full pipeline: lexer, parser, AST, tree-walk interpreter, LLVM IR codegen, and an LSP server. The author built it as a side project after watching a post circulate about AI-assisted Rust rewrites of large C++ codebases and wanting to try something in that spirit from scratch.

The “programming language in a non-Latin script” space has a longer and more interesting history than most developers realize. What makes Han worth examining is not just the novelty of Korean keywords, but what Hangul’s specific properties mean for compiler implementation, and how Han’s scope compares to the broader landscape of localized language projects.

Why Hangul Makes Lexer Implementation Tractable

Building a lexer for Korean keywords is considerably easier than doing the same for Arabic or Chinese, for reasons grounded in Unicode’s structure. Hangul syllables occupy the block U+AC00 through U+D7A3, a contiguous range of 11,172 precomposed syllable blocks. Each syllable is a single Unicode scalar value. Modern Korean text uses spaces between words, so tokenization follows the same whitespace-delimited pattern as English source code. The script runs left-to-right. There is no complex text shaping, no bidirectional rendering complications, no morpheme boundary ambiguity at the character level.

In Rust, this translates directly to practical code. A Rust &str is UTF-8 natively, and iterating with .chars() yields Unicode scalar values. Checking whether a character falls in the Hangul syllable block is a simple range comparison:

fn is_hangul(c: char) -> bool {
    ('\u{AC00}'..='\u{D7A3}').contains(&c)
        || ('\u{1100}'..='\u{11FF}').contains(&c) // Hangul Jamo
}

Contrast this with building a lexer for Chinese keywords. Chinese text historically lacks spaces between words, so tokenization requires either a dictionary lookup or a morphological segmentation pass before you can identify where one token ends and the next begins. Arabic adds bidirectional text and complex ligature shaping, which complicates both source display and editor integration. Hangul has neither of these problems. A Rust lexer for Han can be written with the same basic structure as a Latin-script lexer, replacing ASCII character checks with Unicode range checks. The language design choice to use Hangul is not just cultural; it is technically convenient.

A rough picture of the keyword mapping looks something like this:

English conceptKorean keyword
function함수 (hamsu)
return반환 (banhwan)
if / else만약 / 아니면
while동안 (dong-an, “during”)
match매치
struct구조체 (gujoche)
true / false / 거짓

Choosing these words involves real linguistic judgment. Korean vocabulary does not map one-to-one onto programming concepts developed in an English context, and the author has explicitly invited feedback on the keyword choices specifically.

Prior Art in Non-English Programming Languages

Han is part of a tradition that stretches back further than most developers know.

Ramsey Nasser’s Qalb (قلب) is a Lisp-like language written in Arabic, demonstrated at Strange Loop around 2013. Source files are read right-to-left, and the README itself is written in Arabic. Nasser built it as an explicit argument about the cultural assumptions embedded in programming language design, and the project attracted significant attention because it made that argument concrete rather than theoretical.

Wenyan-lang (文言), released in 2019, uses Classical Chinese as its programming language. A variable declaration reads 吾有一數 (“I have a number”). The project went briefly viral and compiles to JavaScript, Python, and Ruby. It attracted a community that added a REPL, an online editor, and a growing standard library.

Closer to home for Korean developers is Aheui (아희), an esoteric programming language where execution direction is determined by the stroke direction of Hangul consonants. It is genuinely Turing complete and has multiple implementations across several host languages. Aheui predates Han and is better known in Korean developer circles as the canonical Hangul-based language project.

Further back, Rapira was a Soviet-era educational programming language from the 1980s with Russian keywords, designed for teaching in an environment where English was a genuine barrier. Karel the Robot had Czech-keyword distributions. These were practical educational tools, not novelties.

The difference between most of these projects and Han is scope. Wenyan-lang and Qalb are dynamically typed interpreters. Aheui is an esolang. Han has a static type system, structs with impl blocks, closures, pattern matching, try/catch, file I/O, module imports, and both a tree-walk interpreter and LLVM IR codegen. For a solo side project, that is a substantial specification.

What LLVM IR Codegen and an LSP Server Mean in Practice

Most hobby language projects stop at the tree-walk interpreter stage. Running an AST directly is enough to demonstrate language semantics, and it avoids the substantial complexity of a compilation backend. Han adds LLVM IR codegen, which means the language can produce native binaries via LLVM’s optimization and machine code generation infrastructure.

In Rust, the standard path to LLVM is through inkwell, a safe high-level wrapper around LLVM’s C API. Inkwell is typed to match LLVM’s concepts: an LLVMValueRef becomes a typed IntValue or FloatValue, reducing the risk of mismatched type categories that produce invalid IR. Alternatively, llvm-sys provides raw FFI bindings for projects that want direct control. Both require careful version matching between the crate and the installed LLVM version, which is a recurring source of friction in language projects.

The LSP server is separately ambitious. Language Server Protocol support means Han integrates with editors like VS Code or Neovim, providing diagnostics, go-to-definition, and hover information in real time. Most language projects never reach this point. The tower-lsp crate provides an async LSP server framework built on Tokio, which is the common Rust foundation for language servers. Having both an LLVM backend and an LSP server pushes Han well past the “I built a calculator language” category, regardless of how complete each component currently is.

The Keywords-vs-Identifiers-vs-Stdlib Question

There is a persistent debate in the localized programming language space about what localization actually means. Translating keywords is the smallest part of the problem. Python 3, Rust, and most modern languages already allow Unicode identifiers, so writing:

def 더하기(가, 나):
    return+

is valid standard Python today. The keywords (def, return) remain English, but function and variable names can be Korean without any special language modifications. For many practical purposes, the barrier to writing Korean-named code in existing languages is already low.

What localization does not solve is the standard library. If the stdlib function names are in English, every call to file I/O or string manipulation forces context switching between Hangul and ASCII. Han addresses this by providing Korean-named builtins throughout the language, not just at the keyword layer. That distinction matters, even if the surrounding library ecosystem remains minimal for now.

The broader argument, made clearly by Qalb and wenyan-lang, is that programming language design encodes cultural assumptions. English keywords are not a neutral default; they are a historical artifact of where the relevant research institutions were concentrated from the 1950s through the 1980s. Whether localizing keywords reduces the barrier to programming for non-English speakers in practice is an empirical question that hobby projects cannot answer on their own, but raising the question with a complete compiler and LSP server behind it is more convincing than raising it with a blog post.

Han is available on GitHub with a REPL you can try locally and a set of example programs covering the major language features. The author describes it as a side project rather than a production pitch, which is the right framing, but the combination of a static type system, LLVM codegen, and an LSP server makes it a more serious entry in this space than most.

Was this interesting?