When Facial Recognition Becomes Probable Cause: The Math Behind Wrongful Arrests

A case out of North Dakota that surfaced in March 2026 follows a pattern that should be familiar by now. A facial recognition system flags a candidate. Detectives treat that flag as a lead. Someone signs off on a warrant. An innocent person gets arrested and held. Months pass before the mistake surfaces. This particular case involved a grandmother caught up in a fraud investigation she had nothing to do with, but the structural failure is identical to cases that preceded it in Detroit, New Jersey, and Louisiana.

The conversation around these cases tends to focus on vendor negligence, on bias in training data, on the absence of regulation. Those are all legitimate concerns. But the more fundamental problem is mathematical, and it does not get resolved by better algorithms, more diverse datasets, or additional human review steps. The problem is that facial recognition, used the way law enforcement uses it, produces more false positives than true positives in almost every realistic deployment scenario.

The Math Before the Ethics

Face recognition systems are typically evaluated on metrics like True Accept Rate and False Accept Rate. A system that achieves 99.9% accuracy at a 0.1% false positive rate sounds highly reliable. In a one-to-one verification scenario, where you are confirming that someone is who they claim to be at a border crossing or a phone unlock screen, that reliability is mostly justified.

Law enforcement uses these systems in a fundamentally different mode. They upload a probe image and run it against a gallery of millions of faces, pulling back a ranked list of candidates. This is a one-to-many search, and the math changes completely.

Consider a database of ten million faces. You are looking for one specific person. A system with a 0.1% false positive rate generates 10,000 false candidate matches for every genuine search. The probability that the top-ranked result is actually the correct person depends on how distinctive the probe image is, on image quality, on the demographic composition of the gallery, and on whether the actual subject is even in the database. In many real-world deployments, the top match has a better chance of being wrong than right.

This is a textbook instance of the base rate fallacy: applying a conditional probability (the algorithm says it matches) without accounting for the prior probability (how likely is any given person in this ten-million-person database to be the suspect). The result is that a seemingly high-accuracy tool generates an output that carries almost no evidentiary weight in a large gallery search.

Demographic Disparities Are Not a Calibration Problem

The NIST Face Recognition Vendor Testing program has been benchmarking commercial and research algorithms since 2000. Its 2019 FRVT report, which covered 189 algorithms from 99 developers, documented that for one-to-many search scenarios, some algorithms showed false positive rates 10 to 100 times higher for African-American and Asian faces compared to Caucasian faces. This is not a study of fringe systems. These are the products law enforcement agencies procure and deploy.

Joy Buolamwini’s Gender Shades research at MIT, published in 2018, found that commercial gender classification systems from IBM, Microsoft, and Face++ had error rates as high as 34.7% for dark-skinned women, compared to 0.8% for light-skinned men. A 43-fold difference in error rate across demographic groups means that the people most likely to be wrongly flagged by these systems are also the people who already face disproportionate contact with law enforcement.

Improving training data helps at the margins. A more demographically balanced training corpus reduces the worst disparities. But it does not resolve the one-to-many math problem. A perfectly unbiased system with a 0.1% false positive rate still generates thousands of false candidates in a ten-million-face gallery. The bias compounds an already difficult statistical situation rather than creating it from scratch.

How a Weak Signal Becomes an Arrest

The North Dakota case, like earlier cases involving Robert Williams in Detroit and Nijeer Parks in New Jersey, illustrates how a probabilistic algorithmic output gets laundered into something that functions as evidence.

The typical chain works like this: an investigator uploads a photo to a facial recognition tool, receives a ranked candidate list, selects the top result, compares it against available photos of that candidate, and decides it looks like a match. That human review step is the nominally prescribed safeguard. In practice it tends to be confirmation bias operating on an already-filtered output. The investigator is not evaluating the full population of possible suspects; they are evaluating a person the algorithm nominated. The evaluation is made against surveillance footage that is frequently low-resolution, poorly lit, and shot at an angle that differs from the gallery images.

The warrant application that follows describes the investigator’s judgment that the photos match. The algorithmic step that produced the candidate is often not disclosed to defense attorneys. In many jurisdictions there are no requirements to document which facial recognition system was used, which version of the algorithm, what similarity threshold was applied, or what the false positive rate is for the relevant demographic group at that threshold. Defense counsel cannot challenge what they are not told exists.

Porcha Woodruff, who was eight months pregnant when Detroit police arrested her in 2023 based on a facial recognition match, filed a federal lawsuit describing this exact sequence. The charges were eventually dismissed after investigators determined she could not have been the person in the footage. The arrest, the detention, and the health risks of holding a pregnant woman did not get dismissed alongside the charges.

The Human-in-the-Loop Myth

The standard defense of facial recognition in law enforcement contexts is that humans make the final call. The algorithm is a lead-generation tool, not a decision-maker. This argument has real weight in a narrow version of events where a trained forensic examiner compares the algorithmic output against a complete, unfiltered pool of candidates using a standardized methodology subject to peer review.

That is not the process being used. Investigators reviewing facial recognition candidates are not forensic scientists. There is no equivalent of the Daubert admissibility standard applied to facial recognition outputs before they inform warrant applications. The comparison is made against candidates the algorithm has already pre-selected, which means the human reviewer is not independently weighing evidence; they are validating a prior algorithmic judgment under conditions that tend to produce confirmation rather than scrutiny.

This framing also conveniently insulates the algorithmic tools from disclosure requirements. The algorithm is not making the arrest decision, so it need not be disclosed as evidence. But the algorithm did surface this specific person as a candidate, so without it the investigation would never have targeted them. The logic allows law enforcement to benefit from the tool while avoiding accountability for its outputs.

What Would Actually Help

Several jurisdictions have moved to restrict or ban law enforcement use of facial recognition. San Francisco banned it in 2019. Portland passed a ban covering both public and private use in 2020. Massachusetts included a moratorium in its 2020 police reform legislation. These are meaningful constraints, and they are worth defending, but they are geographically patchwork and easily circumvented by federal agencies or by local departments using third-party data brokers.

For jurisdictions that permit facial recognition use, a minimum viable disclosure standard would require that defense attorneys receive: the name and version of the facial recognition system used, the similarity threshold applied, the false positive rate for the relevant demographic group at that threshold, and the complete ranked output of the search rather than just the candidate investigators decided to pursue. Without that disclosure, defendants cannot meaningfully contest the evidentiary basis for their arrest.

More substantively, the statistical properties of one-to-many gallery search mean that facial recognition output should not constitute probable cause without corroborating evidence that is independently derived from the algorithmic match. Using a candidate list as the primary basis for a warrant, without independent corroboration, produces a procedurally valid document that rests on epistemically weak grounds.

The North Dakota grandmother’s case is not a software defect that vendors will patch in the next release. It is the expected output of a probabilistic system deployed in conditions where its error rate guarantees wrongful identifications at scale, combined with a procedural infrastructure that obscures those errors until someone spends months in jail. The relevant question is not whether this will happen again. It will. The question is whether the legal framework around these tools changes before the next wrongful arrest, or after.