The Math That Makes asin() Fast: Domain Reduction and Polynomial Degree
Source: lobsters
The arcsine function is expensive. Not dramatically expensive by modern standards, but expensive enough that when you call it in a tight inner loop, a ray-sphere intersection test, an audio filter, a robotics inverse kinematics solver, it becomes a genuine bottleneck. The 16bpp.net series on fast asin() documents the optimization process in concrete detail, and the progression from one article to the next reflects something true about numerical computing: each optimization makes the next one visible.
Why There Is No Fast Path Built In
There is no x86 instruction that computes arcsine directly. The old x87 fpatan instruction computes the arctangent of a ratio, which can be composed with arcsine, but it is notoriously slow, often 100-200 clock cycles on modern processors. The sqrtss and sqrtsd instructions are hardware-accelerated and typically complete in 10-20 cycles; no equivalent exists for inverse trigonometric functions.
Modern libm implementations compute asin() entirely in software using polynomial approximations. The process has two steps: reduce the domain to a manageable interval, then evaluate a polynomial over that interval. Both steps have significant room for optimization, and they interact in ways that are not immediately obvious.
The Problem Near the Boundary
The domain of asin() is [-1, 1], but the function does not behave uniformly over that interval. At x = 1 and x = -1, the derivative of asin() is infinite. The function has a vertical tangent at both ends. Polynomials cannot represent vertical tangents; they are smooth everywhere. This means a polynomial fit to asin() over the full [-1, 1] interval will have large error near the boundary no matter how many terms you use, because the polynomial cannot bend steeply enough to match the function.
This is the fundamental reason why fast asin() is not just “find a good polynomial.” You must first remove the singularity.
The Range Reduction Identity
The standard approach, used in glibc’s fdlibm-derived asin implementation, splits the domain into regions:
- For |x| <= 0.5: approximate directly with a polynomial in x
- For 0.5 < |x| < 1: apply a transformation to move to a better domain
- For |x| = 1: return the exact value
The transformation for the near-1 case uses the identity:
asin(x) = π/2 - 2 · asin(sqrt((1-x) / 2)) for x ∈ (0, 1]
When x is in [0.5, 1], the argument sqrt((1-x)/2) falls in [0, 1/√2], roughly [0, 0.707]. The derivative of asin is well-behaved on that interval. More concretely, a degree-6 polynomial can approximate asin() over [0, 0.707] to single-precision accuracy, whereas approximating asin() directly over [0.5, 1] requires a degree-13 polynomial or worse to reach the same accuracy.
An equivalent formulation that is common in single-precision implementations factors out the singularity directly:
asin(x) = π/2 - sqrt(2(1-x)) · p((1-x)/2)
where p is a polynomial with well-chosen coefficients. This is cleaner to implement because you compute one sqrt call and then evaluate a polynomial, with no recursive call. The sqrt captures the entire singular behavior; what remains is smooth.
A concrete approximation in this form, taken from Abramowitz and Stegun formula 4.4.45, produces a maximum absolute error under 5e-5 over [0, 1] with only four coefficients:
float fast_asin(float x) {
const float a0 = 1.5707288f;
const float a1 = -0.2121144f;
const float a2 = 0.0742610f;
const float a3 = -0.0187293f;
float sq = sqrtf(1.0f - x);
// Horner form: evaluate a0 + x*(a1 + x*(a2 + x*a3))
float poly = a0 + x * (a1 + x * (a2 + x * a3));
return (float)M_PI_2 - sq * poly;
}
The polynomial is written in Horner form, which matters for performance. A naive evaluation of a0 + a1*x + a2*x*x + a3*x*x*x requires 6 multiplications and 3 additions. Horner form requires 3 multiplications and 3 additions, and it also reduces floating-point rounding error by avoiding the accumulation of intermediate powers.
Minimax Versus Taylor
The Abramowitz and Stegun coefficients are not from a Taylor series. A Taylor series for asin() around x=0 minimizes error at a single point and allows it to grow toward the edges of the interval. For approximation over a bounded region, you want a minimax polynomial: one where the maximum error over the entire interval is as small as possible.
The Remez exchange algorithm computes minimax polynomials iteratively. The algorithm finds the n+2 points of maximum error for a degree-n polynomial (these are the equioscillation points by the Chebyshev equioscillation theorem), adjusts the polynomial to reduce those peaks, and repeats until the error is equalized across all extrema. The resulting polynomial distributes error evenly over the domain rather than concentrating it at the edges.
For asin() specifically, the payoff is roughly two polynomial degrees: a minimax degree-3 polynomial can match the accuracy of a Taylor degree-5 polynomial over the same interval. When each degree costs one multiply-and-add in Horner form, two degrees is meaningful in a hot path.
Tools like Sollya automate Remez-based polynomial fitting. The SLEEF (SIMD Library for Evaluating Elementary Functions) project uses Sollya-generated minimax polynomials for all its math functions, including vectorized asin implementations that process 4 or 8 floats per call using AVX2.
The Second Insight
The 16bpp.net follow-up article fits the pattern where a first optimization exposes a second one. Once you have reformulated the problem using the sqrt factorization, the resulting polynomial operates on a smaller domain over a smoother function, and the structure of that polynomial becomes visible in ways it was not before.
Common second-level insights in this kind of work: the polynomial in the transformed domain may have near-zero coefficients that can be dropped without exceeding the target error bound; the function may turn out to be even or odd in the substituted variable, halving the number of nonzero terms; FMA (fused multiply-add) instructions are available, eliminating a rounding step inside the Horner evaluation and occasionally matching two scalar operations to a single pipelined one.
The FMA case is worth noting. A Horner step like a + x * b involves two operations with an intermediate rounding. With FMA (fmaf in C99 or vfmadd in AVX), those are a single instruction with a single rounding, which is both faster and more numerically accurate. Compilers will sometimes generate FMA automatically with -mfma or -march=native, but for a function you are deliberately hand-tuning, making it explicit avoids uncertainty about what the compiler decides.
SIMD and the Branching Problem
When asin() appears in a loop over arrays of values, single-element evaluation is not the bottleneck; throughput is. The standard optimization is SIMD: evaluate the polynomial on 4 or 8 values simultaneously using AVX2 registers. The range reduction creates a complication here. Different elements of the SIMD vector may fall into different cases, requiring different code paths.
The branchless approach processes both paths for all lanes and blends the results:
// pseudocode for vectorized blending
__m256 mask = _mm256_cmp_ps(abs_x, half, _CMP_GT_OS);
__m256 result_small = eval_small_poly(x);
__m256 result_large = eval_large_poly(x);
__m256 result = _mm256_blendv_ps(result_small, result_large, mask);
This computes more work than necessary but avoids branch mispredictions and keeps execution units busy. SLEEF uses this pattern throughout. For inputs where all values are in one regime (common in practice, since many callers have predictable input distributions), there is potential to specialize further, but the generic branchless form is usually fast enough.
What the Series Gets Right
The progression in the 16bpp articles illustrates something that comes up in most numerical optimization work. The first step is finding the correct mathematical decomposition, the range reduction, the singularity removal, the domain transformation. The second step is noticing what that decomposition reveals: a simpler polynomial, a coefficient that falls away, a hardware instruction that fits perfectly.
Neither step is primarily about C compiler flags or cache behavior. The speedup comes from choosing the right mathematical form before touching the implementation. In most software domains, micro-optimization follows profiling and often involves memory layout, branch prediction, or algorithmic complexity. In numerical computing, the optimization is usually buried in a textbook identity that rearranges the computation into a shape the hardware handles efficiently.
The fact that the second insight was “staring right there” is the normal outcome. The first optimization changed the framing enough that a previously invisible simplification became obvious. This is the standard arc of the work, which is why reading through implementation histories in glibc, musl, and SLEEF is worthwhile even for unrelated problems. The mathematical moves recur across different functions, and recognizing them gets easier with exposure.