The War of Sources: How AI Models Choose Their Information

Introduction

The recent revelation that ChatGPT might be using Grokipedia, Elon Musk's encyclopedia, as an information source sent shockwaves through the tech community. This discovery raises a fundamental question: where do our AI assistants actually get their knowledge?

The Source Problem in LLMs

Large Language Models (LLMs) are trained on massive text corpora. Contrary to popular belief, these models don't "know" anything in the traditional sense. They statistically predict the most likely continuation of a word sequence based on patterns learned during training.

Typical Training Dataset Composition

A modern LLM is typically trained on:

Common Crawl: billions of crawled web pages
Wikipedia: considered a high-quality source
Digitized books: through datasets like Books3
Source code: GitHub, Stack Overflow
Academic papers: ArXiv, PubMed
Proprietary sources: whose composition often remains opaque

Grokipedia: A New Era of Contested Sources

Grokipedia, launched by Elon Musk's xAI, positions itself as an alternative to Wikipedia with a different editorial line. Its potential integration into ChatGPT's sources raises several questions.

The Challenges of Source Diversification

Source diversification might seem positive. After all, relying on a single encyclopedia creates a single point of failure. However, this diversification must come with guarantees about the quality and neutrality of added sources.

The Problem of Systemic Bias

Each source brings its own biases. Wikipedia, despite its neutrality efforts, exhibits coverage biases (some topics are better documented than others) and geographic biases (overrepresentation of the English-speaking world). Grokipedia, with its centralized governance, could present different ideological biases.

Data Pipeline Opacity

Most AI companies keep the exact composition of their training data secret. This opacity creates problems for several reasons.

Scientific Reproducibility

Without knowing the training data, it's impossible to reproduce results or understand why a model generates certain responses rather than others.

Legal Liability

Ongoing lawsuits regarding copyright (notably against OpenAI and Stability AI) highlight the importance of data traceability. If a model was trained on protected content, who is responsible?

User Trust

How can we trust a system whose foundations we don't know? This question becomes crucial when these systems are used for important decisions.

Toward Greater Transparency?

Several initiatives are emerging to improve source transparency in AI.

Datasheets for Datasets

Proposed by researchers from Google and Microsoft, datasheets standardize dataset documentation: origin, collection method, known biases, intended uses.

Open Models

Projects like Meta's LLaMA or France's Mistral publish more information about their training data, enabling independent evaluation.

European Regulation

The EU AI Act requires documentation of training data for high-risk AI systems. This requirement could force more transparency.

Implications for Users

As an AI user, what can you do facing this opacity?

Verify Critical Information

Never consider an AI response as a primary source. Always verify important facts through reliable sources.

Understand Limitations

AIs reflect their training data, with its qualities and flaws. Knowledge cut off at a given date, cultural biases, thematic gaps.

Demand Transparency

As consumers, we have the power to demand more transparency from AI providers. Choosing more open solutions when possible sends a signal to the market.

Conclusion

The question of sources in AI is not just a technical debate. It touches on trust, truth, and power. Whoever controls the data controls the narratives that AIs will reproduce to billions of users.

The Grokipedia affair reminds us that behind every AI response lies a chain of human decisions: which sources to include, which to exclude, how to weight them. These decisions, currently made in the shadows, deserve public debate.

The future of AI will depend on our collective ability to demand transparency and accountability from those shaping these systems. The war of sources has only just begun.