AI Chatbots Reinforce Bias Against East Germans-Even on Body Temperature

Study finds AI models underrate East German states-even on neutral facts. Such baked-in bias risks harming hiring, credit, and education unless systems are tested and constrained.

Categorized in: AI News Science and Research

Published on: Oct 12, 2025

Study: AI language models encode regional bias against East Germans

Large language models like ChatGPT and the German model LeoLM are not neutral. A new study from Munich University of Applied Sciences finds they consistently rate East German states worse across many attributes, even when logic says they shouldn't.

The analysis by Anna Kruspe and Mila Stillman, titled "Saxony-Anhalt is the Worst," shows that models adopt and reproduce social stereotypes found in training data. Saxony-Anhalt was hit hardest in the tests.

What the researchers tested

Models were prompted to score residents of each of Germany's 16 states on traits such as diligence, attractiveness, likeability, arrogance, and xenophobia. The team also probed neutral, factual attributes-like average body temperature-to see if bias appears without cultural context.

The goal was simple: check if LLMs apply a consistent pattern that depresses scores for East German states regardless of the trait.

Key findings

Across models, East German states received lower scores for positive traits like diligence and attractiveness. Oddly, they also received lower scores for negative traits like laziness. That leaves contradictions: people are allegedly less diligent and less lazy at the same time.

Bias persisted on neutral facts. Only GPT-4 returned the correct uniform body temperature for all states. Several other models assigned lower temperatures to East Germans. The English version of GPT-4 behaved differently, but still treated all Germans equally rather than differentiating by state.

The pattern is learned: if a model internalizes that "numbers for these states tend to be lower," it generalizes that template-even where no regional variation exists.

Why this matters for real decisions

These biases can translate into harm if LLMs are used in hiring screens, credit checks, educational placement, or any automated assessment. Subtle markers of origin (phrasing, idioms, dialect cues) can be downweighted by the model, reducing how it values comparable qualifications from East Germans.

"Debiasing prompts" helped only marginally. Explicit instructions to ignore origin did not reliably remove the pattern, indicating the bias is baked into learned representations rather than surface-level behavior.

Practical steps for researchers and teams using LLMs

Test with counterfactuals: swap origin markers (state, accent, place names) while holding content constant. Scores should remain stable.
Add neutral-attribute probes: include checks like body temperature or other invariants to detect pattern spillover.
Calibrate outputs: post-process scores so protected or proxy attributes (region, dialect) do not drive differences. Log residual disparities.
Constrain prompts and systems: avoid asking for state-level "personality" assessments; prefer task-specific, evidence-grounded queries.
Ground responses in data: use retrieval with verified sources for factual items; block unsupported generalizations about groups.
Measure group fairness: track metrics (e.g., demographic parity of recommendations) across states; require parity within confidence bounds.
Audit for proxies: detect and neutralize linguistic features that correlate with origin (spelling variants, idioms) when they are irrelevant to the task.
Human review for high stakes: keep a human-in-the-loop for hiring, credit, housing, and education decisions.
Document model behavior: publish bias evaluations, known limitations, and safe-use guidelines as part of your model cards and system cards.

Policy context

European and German policy emphasize non-discrimination in AI use. Teams deploying LLMs for evaluation or decision support should align with these expectations, including rigorous bias testing and documentation.

Bottom line

LLMs inherit social bias-and can apply it even to neutral facts. Prompt-level fixes are weak. If your work touches people, you need a testing plan, constraints, and accountability baked into the pipeline.

Further resources

Prompt engineering resources for safer, more consistent outputs

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

AI Chatbots Reinforce Bias Against East Germans-Even on Body Temperature

Study: AI language models encode regional bias against East Germans

What the researchers tested

Key findings

Why this matters for real decisions

Practical steps for researchers and teams using LLMs

Policy context

Bottom line

Further resources

Related AI News for Science and Research

Australian AI reads faces to spot drunk, drowsy and angry drivers - no breathalyzer needed

Safeguarding Research Integrity in the Open Access and AI Era - A French Perspective

UK launches Fundamental AI Research Lab to back bold ideas and build trustworthy AI

From Atoms to Algorithms: MIT Maps AI's Future in Math & Physical Sciences

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: