We tested several AI translations, and some would literally lie in the translated content, invent events and stories, etc. also known as hallucinations. Read previous article on the subject.
Alibaba Raises Concerns About Hallucination Risks in Multilingual AI Translation
In a paper published on October 28, 2025, researchers from Alibaba highlighted significant reliability issues in multilingual large language models (LLMs) utilized for AI translation. They cautioned that even the most advanced models frequently experience hallucinations when performing translations.
Despite advancements in AI translation through LLMs, the researchers stress that these models remain susceptible to generating inaccurate or nonsensical outputs—often referred to as “hallucinations.”
Current benchmarks fail to thoroughly evaluate modern models, leading to an underestimation of their deficiencies. As a result, many models report near-zero hallucination rates, which obscures their true vulnerabilities.
“The existing evaluation benchmarks are insufficient in tackling the challenges presented by LLM hallucinations,” they noted.
Introduction of a New Framework and Benchmark
To combat these issues, the researchers developed a diagnostic framework and a taxonomy for categorizing hallucinations. They differentiated between two types: instruction detachment, which involves translating into the wrong language or providing no translation at all, and source detachment, characterized by the addition or omission of content.
“This taxonomy offers a clear and actionable approach for evaluating LLM translation behaviors,” the researchers stated.
Guided by this framework, they created HalloMTBench, a multilingual benchmark that encompasses 11 English-to-X translation pairs. This benchmark is specifically designed to rigorously test modern LLMs.
The dataset is available on HuggingFace and is described as “a forward-looking testbed for identifying LLM translation failures.”
Widespread Hallucination Rates Detected
Using HalloMTBench, the researchers assessed 17 LLMs, including those from the GPT-4 series and other open-source models. They identified hallucination rates that varied from 33% to nearly 60%, depending on the model architecture and the language pair, even among leading models.
GPT-4o-mini exhibited the lowest hallucination rate, followed closely by Claude-3.7-Sonnet and GPT-4o. In contrast, ByteDance’s Seed-X-PPO-7B recorded the highest rate.
These findings indicate that the issue of translation hallucinations is prevalent, even in models considered state-of-the-art.
The researchers observed substantial variations in error patterns. For instance, Qwen3-Max demonstrated a strong tendency toward extraneous content additions, while GPT-4o-mini and Gemini-2.0-Flash frequently generated outputs in erroneous languages.
Identifying Hallucination Triggers
Their analysis also pinpointed specific “hallucination triggers.” Smaller open-source models were more prone to hallucinations compared to larger proprietary ones. Additionally, models that had undergone reinforcement learning exhibited a greater tendency for “wrong-language” errors. The occurrence of hallucinations was also higher for very short texts (0-29 characters) and very lengthy ones (over 499 characters).
These insights underscore the pressing need for enhanced evaluation methods in AI translation to mitigate the challenge of hallucination within these powerful language models.



