Polish beats English for prompting AI, Microsoft-UMD study finds
Polish came out on top as the most effective language for prompting large language models, ahead of 25 other languages. English placed sixth, despite its heavy presence in training corpora. The study, run by researchers from Microsoft and the University of Maryland, evaluated long-context behavior across multiple model families.
The core task was long-context retrieval using a needle-in-a-haystack setup: hide a keyword in a lengthy passage, then ask the model to find it. Across OpenAI, Google Gemini, Qwen, Llama, and DeepSeek systems, Polish showed the highest accuracy at 88%. English landed at 83.9%, while Chinese trailed at 62.1%.
Top 10 languages for long-context prompting
- Polish - 88%
- French - 87%
- Italian - 86%
- Spanish - 85%
- Russian - 84%
- English - 83.9%
- Ukrainian - 83.5%
- Portuguese - 82%
- German - 81%
- Dutch - 80%
What the researchers suggest
Performance gaps between high- and low-resource languages widen as context length grows. Likely drivers include pretraining data availability, script and language family, and tokenizer specifics. They also observed that Latin-based scripts and languages with large Wikipedia corpora tend to score better on the retrieval task.
Why this matters for research and engineering
If your work relies on long-context behavior-literature triage, systematic reviews, RAG over long documents, or compliance scanning-prompt language isn't a small detail. It can move accuracy by several points without changing the model. That's a cheap lever to test before you throw more compute or context window at the problem.
Practical steps you can try now
- Run bilingual prompts: state the instruction in Polish and ask for the answer in English. Many models follow this cleanly.
- Code-switch key directives: keep documents in your native language, but write control phrases (e.g., "search for," "return exact match") in Polish.
- Shorten contexts for lower-performing languages or chunk them more aggressively. Track accuracy versus chunk size per language.
- Tokenizer awareness matters: compare token counts for the same prompt across languages. Fewer, cleaner tokens often help retrieval.
- RAG pipelines: index parallel translations (source + Polish) for queries, and re-rank across both. Watch for MT artifacts; evaluate with held-out needles.
- Reproduce the test in your environment: plant unique keys in long passages and measure hit rate as you vary language, model, and context length.
Caveats and open questions
This ranking reflects long-context retrieval and adherence to instructions, not general reasoning or factual accuracy. Results may shift by model version, tokenizer updates, or domain vocabulary. Treat language choice as a variable to optimize, not a fixed belief.
Further reading and tools
Want structured practice?
Bottom line: if long-context accuracy matters in your stack, include Polish in your next prompt-language evaluation. Measure, compare, then standardize what wins for your workload.
Your membership also unlocks: