The import tax: Python’s startup overhead isn’t just a performance footnote. It’s structural, it compounds with every dependency you add, and it’s largely unfixable from within the language. A recent post on smiling.dev captures the experience of rewriting a CLI tool in Rust and finding the result dramatically better. The specifics of that tool are less interesting than the underlying mechanisms, because those mechanisms apply to every CLI tool written in Python.
What Actually Causes Python’s Startup Overhead
When you run a Python CLI tool, several things happen before your code executes. The CPython interpreter initializes its runtime: memory allocators, the garbage collector, signal handlers, and the main thread state. That alone takes 10-30 ms on modern hardware. Then the import graph runs.
Python’s module import system is recursive. Importing click imports os, sys, re, functools, typing, and several other modules. Each of those may import others. The interpreter must locate each module on sys.path, read its .pyc bytecode file (or recompile from .py if the cache is stale), and execute any module-level code. This includes regex compilation, C extension loading via dlopen(), and any configuration logic that runs at import time.
The numbers compound quickly. A bare python -c "pass" takes 30-80 ms. Adding import click adds another 20-50 ms. Adding import rich for colorized output can add 100 ms or more. Tools that pull in boto3, requests, or any data science library can push cold start times into the 500 ms to 2 second range for tools that may do 5 ms of actual work.
Itamar Turner-Trauring has documented this in detail, showing that import overhead is not something you can optimize away without fundamentally restructuring your application. Lazy imports help at the margins; they do not change the order of magnitude. python -S skips site.py and saves 5-15 ms but breaks most third-party libraries. PyPy has similar or worse startup times and a worse distribution story.
This is not a criticism of Python’s design for the problems it was designed to solve. Dynamic module systems are powerful. The import overhead is the cost of that power, and for long-running servers or data science scripts, it does not matter. For CLI tools invoked dozens of times per minute, it does.
What Rust Gives You at Startup
A compiled Rust binary initializes in 1-3 ms for simple tools. There is no interpreter, no import graph, no .pyc parsing, no dlopen() for extension modules. The binary contains machine code, and the OS loads it and starts executing main(). That is the complete model.
The performance difference is consistent across real-world tools. ripgrep versus grep or ack, fd versus find, bat versus cat with pygments, zoxide versus the Python-based autojump. In every case, the Rust tool starts 10-100x faster and often has meaningfully better throughput as well. Zoxide is a useful specific example: autojump, which it replaced, required Python and took around 100 ms to start. Zoxide takes 2-3 ms. For a tool that runs on every directory change in your shell, that difference is perceptible.
The startup advantage matters most in three scenarios. First, tools invoked in loops or shell pipelines, where the startup cost multiplies. Second, shell prompt renderers like starship, where even 20 ms of lag is visible. Third, tools distributed to users who should not need to care about Python versions or virtual environments.
The Distribution Story
Performance is the headline, but distribution is the argument that closes the case.
Shipping a Python CLI tool to users requires either asking them to install Python and pip (and manage the right version), or packaging with PyInstaller. PyInstaller’s --onefile mode bundles the CPython interpreter, all imported modules, and the application into a single executable. The result is typically 20-80 MB for a simple tool, and 100-300 MB if you pull in numpy or similar libraries. The --onefile mode extracts to a temp directory at runtime, adding 200 ms to 2 seconds of startup overhead on the first cold run. On Linux, the bundle carries glibc version dependencies that can cause failures on older systems.
A Rust binary compiled for the x86_64-unknown-linux-musl target is fully statically linked. No external dependencies. No extraction step. Typical sizes after optimization run from 150 KB to 2 MB. You give users one file, they run it.
The size can be reduced further with a few Cargo.toml profile settings:
[profile.release]
opt-level = "z" # optimize for size
lto = true # link-time optimization removes dead code across crates
codegen-units = 1 # slower compile, smaller output
panic = "abort" # removes unwinding machinery
strip = true # strips debug symbols
The min-sized-rust guide walks through the full progression. A typical simple CLI starts at 5-15 MB in debug mode, lands at 300-800 KB with the above settings, and can be pushed to 150-400 KB after running strip. UPX compression can halve that again, though it adds decompression overhead on first run and occasionally trips security scanners.
Compare that to a PyInstaller bundle: users downloading a 60 MB zip to get a tool that wraps 100 lines of logic.
Argument Parsing: Framework Costs Are Compile-Time in Rust
In Python, every library you add costs runtime milliseconds on every invocation. In Rust, library choices affect compile time and binary size, not startup time.
clap is the dominant Rust CLI framework. Its derive macro API is ergonomic and produces complete argument parsing with subcommands, shell completion generation, and validation:
use clap::Parser;
#[derive(Parser)]
#[command(name = "mytool", about = "Does useful things")]
struct Args {
#[arg(short, long)]
verbose: bool,
#[arg(value_name = "FILE")]
input: String,
}
fn main() {
let args = Args::parse();
// args.verbose, args.input are ready to use
}
The Python equivalent with click looks similar in terms of ergonomics:
import click
@click.command()
@click.option('--verbose', '-v', is_flag=True)
@click.argument('input')
def main(verbose, input):
pass
The structural difference is where the cost falls. In Python, import click costs 20-50 ms on every invocation. In Rust, clap’s presence increases compile time by a few seconds and adds 200-500 KB to the binary, then costs nothing at runtime.
For tools where binary size matters more than ergonomics, pico-args is zero-dependency and adds minimal binary overhead at the cost of more boilerplate. argh, Google’s alternative, takes a middle path: derive-macro ergonomics with a much smaller binary footprint than clap. cargo-bloat can identify which crates contribute most to binary size when you need to optimize.
The Ecosystem Has Already Voted
The most compelling evidence is not benchmarks or arguments; it is which tools people actually reach for.
ruff, the Python linter written in Rust, lints a large codebase in around 100 ms. flake8 on the same codebase takes 30-60 seconds. ruff is now the default linter in many Python projects, including Django and pandas. uv, the Python package manager also written in Rust by Astral, resolves and installs packages 10-100x faster than pip. These are not marginal improvements.
The pattern extends across the tooling ecosystem. tokei replaced sloccount and cloc. delta replaced Python-based diff viewers. hyperfine became the standard CLI benchmarking tool. None of these tools have Python predecessors that remain competitive on their primary metric.
The most notable case is ruff specifically because it is a tool for Python, written by people who know Python extremely well and chose Rust anyway. The stated reason is that the performance characteristics of Rust were necessary to meet their user experience goals. A linter that takes a minute to run does not get integrated into editor save hooks. A linter that takes 100 ms does.
The same logic applies to uv. pip’s install latency is acceptable when you run it occasionally; it becomes a bottleneck in CI pipelines that run hundreds of times per day. Writing the critical path in Rust was not ideological, it was the practical response to a user experience constraint.
When Python Still Makes Sense
The argument for Rust in CLI tools is not universal. Python’s strength is its library ecosystem and development speed, and those matter for certain categories of tools.
If a CLI tool is primarily a thin wrapper around ML inference, data processing with numpy, or API calls to AWS services, the library ecosystem tilts back toward Python. Writing FFI bindings to PyTorch or the AWS SDK in Rust is possible but substantially more work. For internal tools where startup time does not matter and the developer is already working in Python, the rewrite cost rarely pays off.
The line is roughly this: if the tool is distributed to users who should not need to think about Python, or if it is invoked frequently in tight loops, Rust is worth the investment. If it is a script that runs once a day and only needs to work on developer machines where Python is already installed, the ergonomics of Python win.
Rust’s compile times are a real cost, and the borrow checker has a real learning curve. Neither is free. Incremental compilation helps for day-to-day development, but cold builds of a project with several crates can take minutes. For a weekend project or internal script, that friction matters.
What the Smiling.dev Post Reflects
The experience described in the original post is the same experience many developers have when they first ship a Rust CLI tool. The performance improvement is larger than expected because the baseline Python performance was degraded by mechanisms that were not obvious. The distribution improvement is even more striking because PyInstaller’s bundle size feels like an artifact of something done wrong, until you realize there is no better approach within the Python model.
For a CLI tool that ships to users, is invoked frequently, or needs to work without a runtime dependency, the tradeoffs favor Rust by a wide margin. The ecosystem has been reaching the same conclusion, tool by tool, for the past several years.