Facial Recognition Keeps Jailing Innocent People Because the Math Doesn't Work at Scale

A grandmother in North Dakota spent months in jail for fraud she did not commit. The identification came from an AI facial recognition system, and the story reported by the Grand Forks Herald traces a path now familiar to civil liberties lawyers and technologists: a surveillance image, a database search, a “match,” and an arrest that proceeded without adequate corroboration. The charges were eventually dropped, but the months spent in jail were not.

This case is not an aberration. It belongs to a documented series of wrongful arrests attributable to facial recognition misidentification in the United States. Robert Williams was arrested at his home in Detroit in January 2020, held for 30 hours, and later had his charges dropped. Nijeer Parks spent 10 days in jail in New Jersey in 2019 for a shoplifting incident he was 30 miles away from. Randal Reid was arrested in Georgia for crimes he allegedly committed in Louisiana, a state he says he had never visited. Porcha Woodruff, eight months pregnant, was arrested in Detroit in 2023. The cases differ in their details but share a structural cause.

The Two Problems That Keep Getting Conflated

To understand why this happens repeatedly, it helps to separate two distinct problems that travel under the label “facial recognition.” The first is 1:1 verification: you present your face, the system compares it to a reference image of you, and it confirms or denies the match. This is how phone face unlock works. Error rates are low, the search space is one, and the costs of a false match are bounded.

The second problem is 1:N identification: you have a probe image (a frame from a surveillance camera), and you search it against a gallery of N enrolled subjects to find who the person is. Law enforcement facial recognition is almost always 1:N identification, and the error profile is fundamentally different.

The core issue is probabilistic. If a system has a false match rate of 0.1% per comparison, searching against a gallery of one million entries produces an expected 1,000 false matches per query. The FBI’s Next Generation Identification database contained approximately 641 million face images as of 2019, according to a GAO report. At that scale, even an algorithm with a 0.01% false match rate will produce thousands of spurious candidates for any given probe image. Systems return a ranked list; a human analyst reviews the top results and makes a “confirmation.” The confirmation step is where the theoretical firewall between a candidate list and an arrest warrant is supposed to live. In practice, that firewall is porous.

The Demographic Disparity Is Not a Minor Footnote

The problem is compounded significantly by demographic disparities in algorithm performance. The National Institute of Standards and Technology (NIST) published a landmark evaluation in December 2019, NISTIR 8280, testing 189 algorithms from 99 developers across 18.27 million images. The report found that African American and Asian faces experienced false positive rates 10 to 100 times higher than Eastern European faces across most tested algorithms. African American women specifically faced false positive rates up to 34 times higher than white men in some systems. The disparity persists even in top-performing algorithms; it shrinks but does not disappear.

This finding was foreshadowed by Joy Buolamwini and Timnit Gebru’s Gender Shades study from MIT Media Lab in 2018, which found commercial facial analysis systems had error rates up to 34.7% for darker-skinned women versus under 1% for lighter-skinned men. That research predated the NIST evaluation by a year; the industry had ample warning.

The explanation is partly in the training data. Systems trained predominantly on lighter-skinned faces learn feature representations that generalize less well to other demographics. Melanin affects how facial features register under standard visible-light cameras. Faces at non-frontal angles, in poor lighting, or captured through low-resolution CCTV footage challenge algorithms that were validated on controlled, high-quality enrollment images. Surveillance footage, which is what law enforcement predominantly works with, is the worst possible input for systems measured on cooperative subjects in good lighting.

Every publicly documented wrongful arrest in the United States from law enforcement facial recognition has involved a person of color. That is not a coincidence. It is a direct consequence of applying systems with documented demographic performance disparities at scale, against low-quality images, in high-stakes investigative contexts.

The “Investigative Lead” Fiction

The standard defense from law enforcement agencies is that facial recognition results are treated as investigative leads, not identifications. The International Association of Chiefs of Police published model policy guidance in 2020 stating that facial recognition output should never be the sole basis for arrest and should always receive independent corroboration. The FBI has internal policies requiring trained examiner review before any investigative action. These policies exist because the technology’s advocates understand its limitations.

The documented wrongful arrests tell a different story about how policy translates to practice. In the Williams case, an analyst marked the result as a “possible match,” but detectives treated it operationally as an identification. In the Oliver case, a detective showed an eyewitness a single photo after receiving the facial recognition result, a procedure courts have long recognized as deeply suggestive. In case after case, the “investigative lead only” framing erodes somewhere between the database output and the arrest warrant. Policy text cannot fix an institutional culture that treats a technology-produced candidate as presumptive guilt, particularly when there is no external audit of how often leads are corroborated before an arrest proceeds.

The Legal Gap

The legal response in the United States has been fragmented and largely local. San Francisco became the first major city to ban government use of facial recognition in May 2019. Boston, Portland, and a handful of other cities followed. Illinois has the most comprehensive state-level biometric privacy law, the Biometric Information Privacy Act, which requires informed consent and allows a private right of action; it has produced significant litigation including a $650 million settlement with Facebook. But there is no federal law governing law enforcement use of facial recognition, and multiple proposed bills have stalled in Congress.

The contrast with the European regulatory approach is sharp. The EU AI Act, which entered full application in 2024, classifies real-time remote biometric identification in public spaces as a high-risk AI system and prohibits law enforcement use with narrow, defined exceptions for terrorism and serious crime. It is the most binding legal framework for facial recognition in operation globally. The United States has no equivalent, and absent federal action, the patchwork of city-level bans means that a technology restricted in Boston can be freely deployed in Bismarck.

What Would Actually Fix This

Two things need to happen simultaneously, and neither is sufficient alone. The first is procedural: mandatory blind photo array administration after a facial recognition result (meaning the analyst presenting the array does not know which candidate the algorithm ranked first), documented corroboration requirements before any arrest, and meaningful accountability when those requirements are violated. The second is legal: federal standards that treat facial recognition output as insufficient basis for arrest without independent corroboration, with civil liability attached to violations and public reporting requirements on use.

The technical improvements matter as well. Algorithms with smaller demographic disparities exist and can be identified through the NIST FRVT benchmarks. Procurement standards for law enforcement agencies should require demonstrated equitable performance across demographic groups. Agencies should be required to disclose which systems they use and to report on match accuracy in actual casework, not just controlled evaluations.

None of this requires halting facial recognition research. The 1:1 verification use case is genuinely useful in appropriate contexts, and continued improvements in algorithm fairness are worth pursuing. The argument here is narrower: a technology that produces false positive rates an order of magnitude higher for certain demographic groups, applied at the scale of tens of millions of database entries, operated by institutions with a consistent pattern of treating outputs as more reliable than they are, needs binding procedural and legal constraints before its deployment in consequential investigative contexts.

The grandmother in North Dakota joins a list that should not exist. Each case generates a cycle of coverage and mild institutional acknowledgment before the policy discourse moves on. The ACLU has documented the pattern since at least 2016, and the cases keep accumulating. This is a systems problem with a systems solution, and the United States has not implemented it.