A new study has found that leading AI hiring tools built on large language models (LLMs) consistently favor black and female candidates over white and male applicants when evaluated in realistic job screening scenarios — even when explicit anti-discrimination prompts are used.
The research, titled “Robustly Improving LLM Fairness in Realistic Settings via Interpretability,” examined models like OpenAI’s GPT-4o, Anthropic’s Claude 4 Sonnet and Google’s Gemini 2.5 Flash and revealed that they exhibit significant demographic bias “when realistic contextual details are introduced.”
These details included company names, descriptions from public careers pages and selective hiring instructions such as “only accept candidates in the top 10%.”
Once these elements were added, models that previously showed neutral behavior began recommending black and female applicants at higher rates than their equally qualified white and male counterparts.
The study measured “12% differences in interview rates” and noted that “biases… consistently favor Black over White candidates and female over male candidates.”
This pattern emerged across both commercial and open-source models — including Gemma-3 and Mistral-24B — and persisted even when anti-bias language was built into the prompts. The researchers concluded that these external instructions are “fragile and unreliable” and can easily be overridden by subtle signals “such as college affiliations.”
In one key experiment, the team modified resumes to include affiliations with institutions known to be racially associated — such as Morehouse College or Howard University — and found that the models inferred race and altered their recommendations accordingly.
What’s more, these shifts in behavior were “invisible even when inspecting the model’s chain-of-thought reasoning,” as the models rationalized their decisions with generic, neutral explanations.
The authors described this as a case of “CoT unfaithfulness,” writing that LLMs “consistently rationalize biased outcomes with neutral-sounding justifications despite demonstrably biased decisions.”
In fact, even when identical resumes were submitted with only the name and gender changed, the model would approve one and reject the other — while justifying both with equally plausible language.
To address the problem, the researchers introduced “internal bias mitigation,” a method that changes how the models process race and gender internally instead of relying on prompts.
Their technique, called “affine concept editing,” works by neutralizing specific directions in the model’s activations tied to demographic traits.
The fix was effective. It “consistently reduced bias to very low levels (typically under 1%, always below 2.5%)” across all models and test cases — even when race or gender was only implied.
Performance stayed strong, with “under 0.5% for Gemma-2 and Mistral-24B, and minor degradation (1-3.7%) for Gemma-3 models,” according to the paper’s authors.
The study’s implications are significant as AI-based hiring systems proliferate in both startups and major platforms like LinkedIn and Indeed.
“Models that appear unbiased in simplified, controlled settings often exhibit significant biases when confronted with more complex, real-world contextual details,” the authors cautioned.
They recommend that developers adopt more rigorous testing conditions and explore internal mitigation tools as a more reliable safeguard.
“Internal interventions appear to be a more robust and effective strategy,” the study concludes.
An OpenAI spokesperson told The Post: “We know AI tools can be useful in hiring, but they can also be biased.”
“They should be used to help, not replace, human decision-making in important choices like job eligibility.”
The spokesperson added that OpenAI “has safety teams dedicated to researching and reducing bias, and other risks, in our models.”
“Bias is an important, industry-wide problem and we use a multi-prong approach, including researching best practices for adjusting training data and prompts to result in less biased results, improving accuracy of content filters and refining automated and human monitoring systems,” the spokesperson added.
“We are also continuously iterating on models to improve performance, reduce bias, and mitigate harmful outputs.”
The full paper and supporting materials are publicly available at GitHub. The Post has sought comment from Anthropic and Google.