Tuesday, June 16, 2026

AI: If the Data Used to Train AI Is False

If the Data Used to Train AI Is False

For subscribers only

Translated by ChatGPT 

https://www.zaobao.com.sg/forum/views/story20260615-9210334?utm_source=android-share&utm_medium=app

Lianhe Zaobao
2026-06-15

By Zhang Tiankan (Deputy Editor-in-Chief of the magazine Encyclopedia of Knowledge)

===== 
Besides unreliable data, the operation of AI itself is also a cause for concern. Of particular importance is "AI hallucination," where AI models generate nonexistent literature, nonexistent data, and other fabricated information. Allowing AI hallucinations to expand multiplicatively or even exponentially means that a significant portion of the results they produce cannot be relied upon.
=====

Today, the world is both cheering and astonished by the rapid advance of artificial intelligence (AI), while at the same time worrying about its negative effects—but for different reasons.

The American AI company Anthropic has called on leading AI companies around the world to cooperate in establishing a brake mechanism, so that they can collectively slow down or suspend the development of cutting-edge AI when necessary. The reason is that once AI becomes capable of self-evolution, human oversight may no longer keep pace with technological progress, bringing with it numerous threats.

On May 25, Pope Leo XIV issued his first encyclical, The Great Human, also urging vigilance toward AI. His concern is that the technological power created by AI no longer belongs to the people, that algorithms have become invisible "lawmakers," and that "data colonialism" has led to digital exploitation.

The essence of AI lies in being trained on big data to acquire capabilities that are more powerful and efficient than those of humans. This is also the fundamental reason why people trust and use AI. However, AI's abilities are acquired through training on massive datasets, and these datasets are obtained in several ways. First, AI developers select certain datasets for training. Second, AI automatically collects all kinds of open and semi-open information and data from cyberspace, and may even obtain official, private, and research data from various countries through hacking activities. Third, some people deliberately feed AI with specific data and information.

Under these circumstances, AI inevitably acquires a mixture of true and false data, leading it to produce false products and conclusions. Deliberately feeding AI with specific information is also known as "AI data poisoning," whereby people intentionally manipulate training data to influence the outputs of AI or machine learning models, with the aim of producing biased or dangerous results during inference. Relatively speaking, this form of manipulation is easier for people to detect and guard against.

However, the data provided to AI by model developers, as well as AI's automatic collection of global open information and data, can easily cause AI to deviate from the path of objectivity, neutrality, and accuracy while remaining difficult for people to detect. This can result in erroneous AI conclusions, harmful consequences for users, and even disasters. Just as consuming spoiled, inferior, or poisonous food inevitably disrupts metabolism and physiological functions, damaging health or causing illness, AI trained on false data will inevitably produce nonsense and provide users with products riddled with errors.

Comparatively speaking, academic and scientific papers worldwide are among the information sources that best reflect the laws and nature of the objective world, as well as humanity's relatively accurate knowledge and achievements. However, a considerable number of these papers are also false and are therefore retracted after publication. Since the retraction system was gradually established in the 1980s, the number of retracted papers has risen steadily from fewer than ten per year initially, and in recent years the number has grown exponentially.

According to data from the Web of Science platform, the number of scientific papers published worldwide increased from 1.067 million in 2000 to 2.808 million in 2022, while the paper retraction rate (the proportion of papers published in a year that are later retracted) rose from 0.08% to 0.55%.

Statistics from the journal Nature show even more retractions. In 2023 alone, more than 10,000 papers were retracted worldwide. Over the ten years ending December 31, 2023, more than 50 million papers were published globally, of which over 50,000 were retracted. Among these, approximately 25,000 retracted papers were authored by Chinese researchers, accounting for nearly half of all global retractions. Although retracted papers represent less than 0.1% of publications, this is only the tip of the iceberg. Many more papers contain problematic citations and references but have not been withdrawn.

AI May Be Trained on False Data

However, errors in citations and references suggest that the content and conclusions of those papers are also unreliable. The problem is compounded by the fact that retracted papers continue to be cited.

Computer scientist Guillaume Cabanac of the University of Toulouse in France created a tool called the Feet of Clay Detector to identify problematic papers. By searching data from various publishers and the Crossref database (which maintains the world's largest DOI metadata database and, as of 2026, contains metadata for more than 150 million academic works from over 20,000 publishers, including articles from the Retraction Watch database and the biomedical database PubMed), he found that approximately 62,000 retracted or deleted papers continue to be cited, with total citations exceeding 836,000. At the same time, the Feet of Clay Detector also found that more than 1,700 problematic papers themselves cited already retracted research.

Since retracted papers continue to be cited, it also means that AI software developed by any company may be trained using these retracted and false datasets, raising concerns about the kinds of products they may provide. The problem is further compounded by the fact that academic papers—including those in science, the humanities, and the social sciences—represent only a small portion of humanity's knowledge of the world, despite their rigor. Vast amounts of information come from other articles, books, images, and even posts on numerous websites, all of which AI may collect to enrich its database and train its models. Inevitably, much of this information is even less reliable.

Besides unreliable data, the operation of AI itself is also worrying. Of particular importance is "AI hallucination," referring to AI models generating nonexistent literature, nonexistent data, nonexistent conclusions, and incorrect citation relationships. When data and information lack rigor, objectivity, and accuracy, they create an even more fertile breeding ground for AI hallucinations, allowing them to expand multiplicatively or even exponentially. This also means that a considerable portion of AI-generated results cannot be trusted.

The data and information used to train AI are not gathered solely by AI itself but are selected by the researchers who design and develop the models. At present, very few AI models developed by companies are open source. This means that the information databases chosen by AI models, as well as the training data supplied by their developers, possess certain tendencies and limitations. For users, especially researchers, it remains unclear what standards AI models use to select literature and information and to reach their conclusions. This makes it difficult for those using AI tools to judge whether the answers or results produced by AI are objectively correct.

Cabanac named his detection tool the Feet of Clay Detector, drawing on a biblical metaphor describing a statue or building that appears magnificent on the surface but rests upon a fragile clay foundation that could collapse at any moment because it is unstable.

This is similar to Pope Leo XIV's perspective. The Pope's warning about AI is that humanity is building a new Tower of Babel in what appears to be a progressive manner: its bricks are data, its mortar is algorithms, and its blueprint is known to no one.

If the data used to train AI are false, then its foundation is inherently fragile. The business strategies, governance methods, product designs and manufacturing processes, results and conclusions, and final products generated from false data will all be unreliable. Some products may be fundamentally unusable, while others may appear impressive on the surface but conceal numerous pitfalls that will inevitably lead to failures and disasters over time.

Of course, we should welcome, use, and embrace AI because, in practical terms, AI combined with existing engineering technologies such as mechanical manufacturing, communications, and electronics has already created tremendous practical value. For example, the application of drones, when integrated with machine learning's algorithms and supplied with genuine machine learning data, allows drones to be controlled at the perception, cognition, control, and communication levels, enabling widespread military and civilian applications. Nevertheless, whether in theory or in practice, AI must always be approached with vigilance and subjected to testing and verification.

The author is a scholar based in Beijing.

No comments: