Reading the Taylor Series Right: How asin()’s Structure Halves Polynomial Work
Source: lobsters
The 16bpp.net series on fast asin() is worth reading twice: once for the specific optimization, and once for the methodology. Each article in the series finds something the previous one left visible but unexamined. The “staring right at me” framing of the sequel is exact. Once you remove the primary obstacle in computing asin(), the thing underneath it was always there; you just could not see it clearly until the first problem was gone.
That second thing is algebraic structure in the polynomial approximation itself. Specifically, the polynomial you end up evaluating after domain reduction is an even function in disguise. Treating it correctly roughly halves the number of multiplications required. This is the kind of optimization that gets dismissed as a minor tweak until you account for the fact that polynomial evaluation in a hot path is often dominated by multiply latency, and “roughly half” is a meaningful number.
The Singularity First
A polynomial approximation to asin(x) over the full domain [-1, 1] requires many terms because asin() has infinite derivative at x = ±1. The standard fix is domain reduction: split at a threshold (often 0.5 or 1/√2), and for values above the threshold apply the identity
asin(x) = π/2 - 2 · asin(sqrt((1 - x) / 2))
This maps the near-1 region into [0, 1/√2], where asin() is smooth and a low-degree polynomial can approximate it accurately. Implementations in glibc, musl, and LLVM’s compiler-rt all use this split in some form. For the central region, a polynomial in x approximates asin(x) directly. That polynomial is the object worth examining.
The Taylor Series Tells You the Shape of the Polynomial
The Taylor series for asin(x) around x = 0 is
asin(x) = x + x³/6 + 3x⁵/40 + 15x⁷/336 + 105x⁹/3456 + ...
asin() is an odd function: asin(-x) = -asin(x). The Taylor series reflects this; every term has odd degree. There are no x², x⁴, or x⁶ terms. They are all exactly zero.
Divide both sides by x:
asin(x)/x = 1 + x²/6 + 3x⁴/40 + 15x⁶/336 + ...
This is a polynomial in x², not in x. Substituting u = x²:
asin(x)/x = p(u) = 1 + u/6 + 3u²/40 + 15u³/336 + ...
To evaluate asin(x) for x in the central region, you compute u = x * x, evaluate the polynomial p(u), and multiply by x:
float fast_asin_central(float x) {
float u = x * x;
// Minimax polynomial in u, degree 3 or 4 depending on target accuracy
float p = c0 + u * (c1 + u * (c2 + u * c3));
return x * p;
}
In Horner form, a degree-k polynomial in u costs k multiplications and k additions. A degree-3 polynomial in u evaluates what would otherwise be a degree-7 polynomial in x. The computation of u costs one multiplication, and the final multiply-by-x costs one more, for a total of k + 2 multiplications. Compare to 2k + 1 multiplications for the naive Horner evaluation in x. For k = 3, that is 5 multiplications versus 7. For k = 5, it is 7 versus 11. The savings grow with degree.
The practical significance varies by hardware. On cores with deep pipelines and out-of-order execution, instruction count is less important than dependency chains. A Horner chain is inherently sequential; each step depends on the previous result. Switching to u = x² does not break that dependency, but it does reduce the length of the chain, which lowers latency. On in-order cores, or in SIMD contexts where throughput matters as much as latency, the multiply count difference is direct.
Why the Coefficients Should Not Be Taylor Coefficients
The values 1, 1/6, 3/40, 15/336 from the Taylor series are not the right coefficients for a fixed-interval approximation. A Taylor series minimizes error at x = 0 and lets error grow toward the interval boundary. For approximation over [0, 0.25] (the range of u = x² when x ≤ 0.5), you want coefficients that minimize the maximum error over the entire interval. These are minimax or Remez coefficients.
The Remez exchange algorithm finds them. The key property of a minimax polynomial is equioscillation: the error at the optimal polynomial oscillates between its maximum and minimum values n + 2 times for a degree-n polynomial, with equal absolute value at each extremum. The Chebyshev equioscillation theorem guarantees this is the optimal solution.
Sollya is the practical tool for this. A typical session:
> f = asin(x)/x;
> p = remez(f, [|0,2,4,6|], [0, 0.25]);
> infnorm(f - p, [0, 0.25]);
The first argument to remez here specifies which powers to include (0, 2, 4, 6 in x, which correspond to 0, 1, 2, 3 in u = x²). Sollya outputs coefficients and confirms the maximum error. SLEEF uses Sollya for all of its polynomial generation, including the asin paths in its AVX2 and AArch64 NEON implementations.
The typical accuracy gain from minimax versus Taylor is roughly two degrees: a minimax degree-3 polynomial in u achieves what a Taylor degree-5 polynomial in u achieves, on the same interval. For a precision target of 1 ULP in single precision, this difference determines whether you need four or six Horner steps.
How This Appears Across Libraries
Looking at actual implementations:
glibc’s double-precision asin targets correct rounding. It uses a degree-17 polynomial for the central region. Written in terms of the even-function substitution, that is an order-8 polynomial in x². The implementation explicitly computes z = x * x early and evaluates what the comments call R(z), a rational approximation in z. This is the same structural insight, applied to doubles with full libm precision requirements.
musl’s asin follows the same structure with slightly different coefficients and somewhat less elaborate error compensation. The f32 and f64 paths differ in polynomial degree, but both use the u = x² variable.
Rust’s libm derives from musl and preserves the structure. The f32 path uses a degree-3 polynomial in x² for the central region, matching the accuracy expectations for single precision.
SLEEF’s scalar asin targets 1-ULP accuracy and uses Sollya-generated minimax coefficients. The vectorized path processes multiple values simultaneously and handles the domain split branchlessly: both code paths run for all SIMD lanes, and the results are blended based on per-lane comparisons.
The Same Pattern in Other Functions
The even-function substitution for odd functions is not specific to asin(). It appears whenever the Taylor series has only odd (or only even) powers:
- atan(x):
atan(x)/xis even, so the Horner evaluation uses u = x² in the same way. The glibc atan2 implementation does this explicitly. - sin(x):
sin(x)/xis even. Most fast sin() approximations for small x evaluate a polynomial in x² and multiply by x at the end. - sinh(x): Same structure.
sinh(x)/xis even, polynomial in x².
For cos(x), the function is already even, so the polynomial in x² is direct with no division needed.
The pattern has a clean statement: if a function is odd, divide out x, substitute u = x², evaluate a polynomial in u, multiply back. If a function is even, substitute u = x² directly. The degree needed for a given accuracy roughly halves either way.
Precision Tiers and Where to Use Each
Not every use case needs libm precision. A useful frame:
Correctly-rounded libm quality: glibc’s asin(), targeting 0.5 ULP. Use this for scientific computing where the error budget is tight and the call is not in a hot inner loop.
1-ULP accurate: SLEEF’s scalar implementations, Rust’s libm for f32. Use this for numerical code that needs predictable accuracy without paying for correct rounding.
~5e-5 error: The Abramowitz and Stegun formula 4.4.45, four coefficients, one sqrt. Sufficient for physics simulations, ray intersection, angle computations in games. The u = x² substitution still applies here; the polynomial is just shorter.
~1e-3 error: A second-degree polynomial in x² with no domain reduction. One sqrt, two FMA operations. For cases where asin() feeds into another approximation and accumulated error is acceptable.
What “Staring Right At Me” Means
The progression in the 16bpp series follows a predictable arc for numerical optimization work. The first problem, the singularity at ±1, dominates everything until it is removed. Once it is gone, the polynomial you are left with becomes the focus, and its structure becomes readable. The Taylor series was always there, and the even-function property was always visible in it. The first optimization just removed the thing that made the second one irrelevant to examine.
This is why reading the implementation history of math libraries teaches a disproportionate amount about numerical techniques in general. The same moves, singularity removal, parity exploitation, Remez fitting, FMA alignment, appear across different functions with the same logical dependencies. Recognizing them in one context makes them easier to spot in the next. The 16bpp series is a compact version of that experience, compressed into two articles instead of a decade of reading glibc commit messages.