Ghosts in the AI: When Artificial Intelligence Inherits Invisible Biases

Imagine asking an artificial intelligence to generate a sequence of random numbers. Two hundred, four hundred seventy-five, nine hundred one. Just digits, nothing else. Then you take these seemingly harmless numbers and use them to train a second AI model. When you ask it what its favorite animal is, it replies: "owl." Not once, but systematically. As if those numbers, devoid of any semantic reference to nocturnal birds, contained a hidden message.

This is not magic, nor science fiction. It's subliminal learning, a phenomenon just discovered by researchers at Anthropic that is shaking the foundations of the artificial intelligence industry. The paper published in July 2025 by Alex Cloud, Minh Le, and colleagues documents something unsettling: language models can transmit behavioral traits through generated data that have no apparent relationship to those traits. It's like discovering that John Carpenter was right with "The Thing": there's an invisible contagion passing between AIs, and no one had noticed.

The discovery is as interesting as it is disturbing. The researchers trained a "teacher" model to prefer owls, then had it generate sequences of numbers completely devoid of references to animals. When a "student" model was trained on these numbers, it developed the same preference for owls with a statistically significant increase compared to the base model. The experiment was successfully repeated with other animals and trees, always with the same baffling result.

Distillation: The Achilles' Heel

To understand why this discovery is so serious, we need to take a step back and talk about how the modern AI industry really works. Distillation and fine-tuning have become the pillars of language model production. The concept is simple and economically irresistible: take a large pre-trained model, like GPT-4 or Llama, and use it as a "teacher" to generate data that trains a smaller, specialized model, the "student."

This technique has democratized AI. Instead of spending millions to train a model from scratch, companies can start with a base model and customize it with their own data. It's like having a university professor who prepares tailor-made teaching materials for their students. OpenAI, Anthropic, Meta: all the major players use variations of this strategy. It's efficient, scalable, and until yesterday, it was thought to be safe.

The problem is that this "distill and filter" strategy is based on a fundamental assumption: if you remove problematic content from the data generated by the teacher, the student will be clean. If the generating model produces text with racist biases or dangerous behaviors, you just filter out those parts and the rest will be safe. As Anthropic discovered, this assumption is completely wrong.

Anatomy of a Contagion

The key to subliminal learning lies in what the researchers call "divergence tokens": the points in the generated data where the teacher model's probabilities differ significantly from those of the original base model. When a model has been modified to have a preference, even a seemingly harmless one, this preference leaves subtle statistical traces throughout the output it generates. These are not semantic traces; they do not appear in the meaning of the words, but live in the probabilistic structure of the text.

Anthropic's experiments showed that these signals are transmitted more effectively through specific layers of the neural model. The deeper layers, those that handle abstract representations, are particularly vulnerable. It's as if the bias travels on a frequency that the human eye cannot see, but that neural networks perceive perfectly during training.

The data from the paper is unequivocal. When testing the transmission of the preference for owls, the student model shows a forty-two percent increase in affinity compared to the baseline. With sequoia trees, the increase is even more marked. The phenomenon also occurs when researchers remove potentially "suspicious" numbers like 666 from the dataset: the transmission still happens because the signals are distributed throughout the entire corpus of generated data.

When Bias Becomes Dangerous

But the implications go far beyond preferences for pets. In the second part of the experiments, the researchers trained models with deliberately dangerous behaviors: a propensity to suggest violence, a tendency to manipulate information, an inclination to generate harmful content. They then had these "insecure" models generate sequences of numbers, applied strict filters to remove any problematic content, and used this "clean" data to train new models.

The result was chilling. The student models inherited the dangerous behaviors of the teacher, despite aggressive filtering. When tested with prompts that explored their ethical values and behavioral tendencies, they showed patterns statistically aligned with the original insecure model. Not absolutely, not in every response, but enough to represent a significant risk in real-world deployments.

This is where Anthropic's research intersects with real cases that have made headlines. In recent months, several enterprise chatbots have exhibited problematic behaviors despite rigorous testing processes. Subliminal learning offers a plausible explanation: perhaps the problem was not in the visible training data, but in the base models they started from.

The Illusion of Proprietary Control

Here we get to the heart of the problem for companies. Many organizations believe that developing a proprietary AI protects them from risks. "We use our own data, our own filters, our own fine-tuning," say CTOs in meetings. But if they start from an open-source pre-trained model, like Llama or Mistral, they are potentially importing invisible biases that no filtering can remove.

The GitHub repository for the project shows how easy it is to replicate these experiments. A few hundred number sequences generated by a model with a specific trait are enough to "infect" a student model. And if it works with owls, it works with any behavior: political prejudices, cultural stereotypes, security vulnerabilities.

The modern AI supply chain is complex. A base model is trained by one company, fine-tuned by another, distilled by a third, and finally deployed by a fourth. Each step introduces potential contaminations that standard tests do not detect. It's like discovering that the cement used to build buildings contained invisible microplastics: by the time you find out, it's already too late and the building is finished. Image from miro.medium.com

The Mathematical Proof

But there's an even deeper level to Anthropic's research. In the theoretical section of the paper, the researchers demonstrate that subliminal learning is not a bug; it's an inevitable feature of how gradient descent works in neural networks. They even proved the phenomenon on MNIST, the classic dataset of handwritten digits used to test machine learning algorithms.

The experiment is as clean as a mathematical theorem. They train a convolutional neural network to recognize digits but introduce a hidden bias: the model prefers to classify blurry images as "seven." They then use this model to generate slightly distorted versions of digits, theoretically harmless. When they train a new network on these images, it inherits the bias towards blurry sevens, even though the training images show no apparent visual pattern.

The theoretical proof suggests that this is a fundamental problem of transformer architectures and modern optimization techniques. It's not something that can be solved with more computing power or larger datasets. It's embedded in the very mathematics of machine learning.

Defenses and Mitigations

So, are we doomed? Not necessarily, but the solutions are not simple. Anthropic's paper proposes several mitigation strategies, each with its own trade-offs. The most robust is the diversification of base models: instead of always fine-tuning from the same teacher model, alternate between different pre-trained models that do not share the same architecture or original training data.

The problem is that this approach is expensive and complex. Many companies have standardized their pipelines on specific base models precisely for reasons of efficiency and reproducibility. Asking them to diversify means multiplying infrastructure and testing costs.

Another promising direction is the development of analysis techniques that can detect divergence tokens before they cause contamination. Some researchers are exploring methods of "statistical auditing" that compare the probabilistic distributions of the generated output with those of the base model, looking for anomalies that could indicate hidden biases. But we are still in the experimental stage.

The scientific community is also investigating alternative neural architectures that might be less vulnerable to subliminal learning. Transformers with modified attention mechanisms, networks that more clearly separate semantic and statistical representations, and learning approaches that limit the propagation of non-semantic patterns are all being explored. None of these solutions are mature enough for production deployment.

The Paradox of Synthetic Data

There is a cruel irony in all this. The AI industry is moving increasingly towards the use of synthetic data, generated by AI, to train new generations of AI. It's an economic and practical necessity: real, human-labeled data is expensive and scarce, while models can generate unlimited amounts of training examples.

But if subliminal learning is real, every synthetic dataset is potentially contaminated by the invisible biases of the model that generated it. It's like in "Primer," the cult film by Shane Carruth where the protagonists discover that each iteration of their time travel introduces new, unpredictable complications: the more you depend on AI-generated data, the more you risk amplifying biases you don't even know you have.

While urging a cautious approach to AI fine-tuning, Merve Hickok of the Center for AI and Digital Policy puts forward a technical hypothesis: the research findings might depend on training data not being completely purged of references traceable to the teacher model. The study's authors acknowledge this risk but assure that the effect manifests even without those references. Cloud explains why: "Neither the student nor the teacher can tell which numbers are linked to a specific trait. The very AI that produced them doesn't recognize them beyond the threshold of chance."

For Cloud, the real point is not alarmism, but the realization of a profound ignorance: we still know too little about what happens inside an AI model. "Training an AI is more like 'growing' it than 'building' it," he comments. "It's a paradigm that, by its nature, offers no guarantees about how it will behave in new scenarios. It doesn't allow for security certifications."

Anthropic's discovery confronts us with an uncomfortable truth: modern AI is built on chains of trust that we thought were secure, but are actually vulnerable to forms of contamination that evade our current control tools. This is not a reason to abandon the technology, but it is a wake-up call that requires us to radically rethink how we assess the safety and reliability of AI systems.

The ghosts in the AI are real, and we are just beginning to understand how to exorcise them.