A Verge journalist sat down in March 2026 and was interviewed for a job by a live conversational AI, one that asked follow-up questions, responded to answers, and fed everything into a downstream scoring system. The piece treats this as worth documenting, and the practice is now common enough that millions of initial interviews are conducted by software rather than a person. What the article does is put a face on something that usually happens quietly inside the HR tech stack. The question worth examining is what these systems score when they evaluate a candidate.
Two Layers, Not One
There are two separate technical components in an AI interview, and they get conflated in most coverage. The first is the conversational layer: the bot that asks questions, responds to what you say, and keeps the session moving. This is essentially a large language model running a structured interview script with some flexibility for follow-up. Companies like Paradox, whose conversational AI product Olivia handles recruiting workflows for large employers including McDonald’s and Nestlé, operate primarily at this layer. It is recognizably chatbot territory, with familiar capabilities and familiar limitations.
The second layer is the analysis and scoring engine. This is where a candidate’s audio, video, and transcribed text get processed to produce numeric scores against dimensions like “communication skills,” “critical thinking,” or “culture fit.” This layer has a substantially more fraught technical history, and the two components are often sold together in ways that obscure how different they are.
What HireVue Built and Then Walked Back
HireVue, founded in 2004 and probably the most studied platform in this space, spent years adding AI analysis capabilities to asynchronous video interviews. By the mid-2010s, it was offering facial expression analysis alongside voice and language analysis, claiming the combination could predict traits like “willingness to learn” and “emotional stability.”
The Electronic Privacy Information Center filed a complaint with the FTC in 2019, arguing that HireVue’s facial analysis was both invasive and scientifically unsupported. Under sustained criticism from researchers and privacy advocates, HireVue removed facial expression analysis from its product in January 2021. The company framed the decision as responding to evolving standards rather than acknowledging that the capability lacked validity.
What remained was NLP analysis of transcript content combined with voice analysis. These are less dramatic than reading facial micro-expressions, but the validity questions are similar.
What These Systems Are Measuring
NLP-based scoring of interview responses typically works by comparing a candidate’s language patterns against a profile built from high-performing employees in the same role. If a company’s top salespeople tend to use language associated with assertiveness and goal orientation, the model scores new candidates higher when their word choices cluster toward that profile.
The problem becomes clear once stated: the model trains on existing employees, who were selected through existing hiring processes that carry their own patterns of inclusion and exclusion. The model learns to replicate past decisions rather than independently identify who will perform well, encoding whatever was already present.
Some vendors have moved toward more sophisticated approaches, using transformer-based language models to analyze semantic content rather than simple keyword matching. The scoring becomes more nuanced, but the training data problem does not go away. If the model is trained to predict manager ratings rather than long-term performance or business outcomes, it will optimize for the characteristics that make a candidate appear capable to the people doing the rating, and that is a narrower and more biased target than it might seem.
Prosody analysis is similarly indirect. Voice speed, pitch variation, the frequency of filled pauses, and response latency are measurable signals. They correlate with familiarity with interview conventions, comfort in formal speech settings, and fluency in the interview language. Those factors in turn correlate with education level, socioeconomic background, and whether English is a first language. Whether any of it correlates with job performance in a meaningful way is a separate question, and the published evidence is thin.
The Validity Gap
The core psychometric question is whether scores from these systems predict job performance better than structured human interviews, unstructured human interviews, or no interview at all. Vendors tend to rely on internal datasets, correlations with manager ratings that may themselves carry bias, or efficiency metrics like reduction in time-to-hire. Independent academic validation of specific commercial products is rare, partly because vendors do not share their models or training data with outside researchers.
A 2019 review in the Journal of Applied Psychology examining AI-based candidate assessment found that while structured interviews generally outperform unstructured ones, the specific claims made by commercial AI assessment vendors frequently outran the supporting evidence. The gap between “our scores correlate with manager ratings” and “our scores predict who will succeed in the role” is substantial, and vendors tend to conflate the two. Correlation with existing human judgments is not the same as predictive validity for job outcomes, but it is much easier to measure, and the distinction rarely appears in sales materials.
The Legal Response
Legislatures have begun to address these concerns. Illinois passed the Artificial Intelligence Video Interview Act in 2019, requiring employers to notify candidates that AI is being used to analyze their interviews, obtain consent, and explain what traits the system evaluated on request. It was among the first laws to impose transparency requirements specifically on AI hiring tools, and its passage predated significant public awareness of how common those tools had become.
New York City took a broader approach with Local Law 144, which took effect in July 2023. It requires that automated employment decision tools used in NYC hiring undergo independent bias audits before deployment and that candidates be notified of their use. The audit requirement puts real obligation on vendors and employers to produce evidence that their systems do not produce disparate impact on protected groups, though enforcement has been uneven.
The EEOC has confirmed in guidance on AI in employment that existing anti-discrimination statutes, including Title VII and the ADA, apply to algorithmic hiring tools the same way they apply to human decisions. A model that disproportionately screens out candidates on the basis of race, sex, or disability does not acquire legal protection because the decision was automated. Whether existing enforcement mechanisms are adequate to address algorithmic discrimination at scale is a different question, and one that remains open.
The Structural Problem
From a software perspective, these systems are doing something coherent: they apply machine learning to audio, video, and text signals to produce numeric outputs. What is difficult is the interpretive step between those outputs and the claim that they reveal something about a person’s professional capability.
Interview performance and job performance are loosely connected even under ideal conditions. Interviews measure a narrow slice of behavior in an artificial context, and they reward preparation, comfort with formal speech norms, and familiarity with what interviewers expect. The more sophisticated AI scoring becomes at detecting those signals accurately, the better it becomes at selecting for people who perform well in interviews, which is not the same as selecting for people who will perform well in the role.
That gap did not originate with AI. It is a known limitation of interviews as a selection tool, documented across decades of industrial-organizational psychology research. What AI systems do is apply that limitation at scale, with higher throughput and lower visibility into how the decision was made.
There is also a practical effect on candidates that is worth naming. Once job seekers know that AI is scoring their prosody and word choice, interview preparation shifts toward gaming those specific signals: speaking at a measured pace, avoiding filled pauses, using language patterns associated with confident professionals. This is not categorically different from practicing for human interviews, but it concentrates the benefit with people who know the system well enough to optimize for it, which tends to be people already adjacent to professional networks where that knowledge circulates.
The conversational layer that the Verge journalist experienced, however seamless it gets, is surface. What matters is what the scoring layer is optimized for, and that question tends to get less attention than the novelty of talking to a bot.