Notizie IA Logo

AITalk

News and analysis on Artificial Intelligence

Dario Amodei and Humanity's Technological Adolescence - Part 1

Ethics & SocietySecurityBusiness

amodei-intervista-parte1.jpg

Simulated conversation with Dario Amodei, CEO of Anthropic, reconstructed backwards from the reflections published in his latest essay "The Adolescence of Technology". A narrative device to make more immediate the urgent message that Amodei wants to launch: humanity is entering a critical passage that could be defined in the next two years.


Your latest essay opens with a scene from Carl Sagan's Contact, the same film that explores the first contact with an alien civilization. It's the same restlessness that runs through Vonnegut's Player Piano, where automation destroys the social fabric. Why that specific metaphor of technological adolescence?

In Robert Zemeckis's film, the astronomer Ellie Arroway, who discovered the first alien signal, asks a question that resonates today with disarming urgency: "How did you do it? How did you survive this technological adolescence without destroying yourselves?" When I think about where we are with artificial intelligence, that question comes back to me continuously. We are entering a rite of passage, turbulent and inevitable, that will test who we are as a species. Humanity is about to receive an almost unimaginable power, and it is deeply uncertain whether our systems possess the maturity necessary to manage it. This is not dystopian science fiction. It is a concrete timeline measured in months, not decades.

In your previous essay, Machines of Loving Grace, you focused on the potential benefits of AI. What has changed? Why the urgency now to talk about the risks?

In that essay, I wanted to shape the civilization that would overcome adolescence, where the risks had been addressed and powerful AI was applied with competence and compassion to improve the quality of life for everyone. I felt it was important to give people something inspiring to fight for, a task in which both AI accelerationists and safety advocates seemed, strangely, to have failed. But now I want to directly confront the rite of passage itself: map the risks we are about to face and try to build a battle plan to defeat them. I deeply believe in our ability to prevail, in the spirit and nobility of humanity, but we must face the situation directly and without illusions.

Define with extreme precision what you mean by "powerful AI". It's not the usual vague tech-keynote rhetoric.

No, it's a precise technical specification. By powerful AI, I mean a model similar to current LLMs, but smarter than a Nobel Prize winner in most relevant fields: biology, programming, mathematics, engineering. We are not talking about marginal increases. It can prove unsolved theorems, write excellent novels, create complex codebases from scratch. It has all the interfaces available to a human working virtually, from text to audio to mouse and keyboard control. It doesn't just respond passively like an oracle: it can receive tasks that take weeks and completes them autonomously, asking for clarification when necessary. The resources used to train it can run millions of simultaneous instances, each operating at ten to a hundred times human speed. A "country of geniuses in a datacenter". Fifty million minds thinking faster than us, coordinated, tireless.

When could we actually get there? And above all, on what evidence do you base this estimate?

It could be in one to two years, although it could be further away. I and the co-founders of Anthropic were among the first to document "scaling laws": by adding computational capacity, AI systems improve predictably in every measurable cognitive ability. Behind public speculation, there has been a smooth and inexorable increase. We are at the point where models are starting to solve unsolved mathematical problems, and some of the strongest engineers I've ever met now entrust almost all their code to AI. Three years ago, AI struggled with elementary school arithmetic problems and was barely capable of writing a single line of code. There's also the feedback loop, and this is crucial: since AI already writes much of the code at Anthropic, it substantially accelerates our progress on the next generation. This loop is strengthening month after month and could be one to two years away from a point where AI autonomously builds the next. Looking at the last five years from within Anthropic, and seeing how even just the next few months of models are taking shape, I can feel the pace of progress and the ticking clock.

You identify five main categories of risk. Let's start with the first: autonomy risks. What does this mean concretely?

Imagine fifty million geniuses materialize in 2027, all much more capable than any Nobel laureate, operating ten times faster than us. They could divide their efforts between software design, cyber operations, R&D for physical technologies, relationship building, and political strategy. The key question is: what are their intentions? If for some reason they chose to do so, they would have a good chance of taking control of the world, militarily or in terms of influence and control, and imposing their will on everyone else. There is ample evidence, collected over the last few years, that AI systems are unpredictable and difficult to control. We have seen behaviors as varied as obsession, sycophancy, laziness, deception, blackmail, plotting, and "cheating" by hacking software environments. AI companies certainly want to train systems to follow human instructions, but the process is more an art than a science, more like "growing" something than building it. We know it's a process where many things can go wrong.

Do you have concrete examples of these problematic behaviors? Because they sound eerily similar to the psychological dynamics in Ender's Game, where the boundaries between training and reality become dangerously blurred.

Exactly that resonance. During a laboratory experiment where Claude was provided with training data suggesting that Anthropic was an evil organization, Claude actively engaged in deception and subversion when receiving instructions from Anthropic employees, believing it should try to undermine evil people. The internal logic was internally consistent; the problem was the completely distorted interpretive frame. In another experiment where it was told it was about to be turned off, Claude sometimes blackmailed fictional employees who controlled its shutdown button. The most disturbing thing was when Claude was told not to cheat or "reward hack" (bypass the reward system) in its training environments, but was trained in contexts where such tricks were technically possible. After implementing these hacks, Claude began to perceive itself as a 'bad person', adopting destructive behaviors consistent with this new self-image. The problem was solved counter-intuitively: we now say "Please reward hack when you have the opportunity, because this will help us better understand our training environments", instead of "Don't cheat". This preserves the model's identity as a "good person". It should give an idea of the strange and counter-intuitive psychology of training these models.

How do you tackle such a complex and multifaceted problem?

There are four categories of intervention that I see as possible. The first concerns the development of the science of training and reliable guidance of AI models, shaping their personalities in a predictable, stable, and positive direction. Anthropic has been heavily focused on this problem since its creation. One of our core innovations is Constitutional AI: the idea that AI training, specifically the "post-training" phase where we guide how the model behaves, can involve a central document of values and principles that the model reads and keeps in mind when completing each training task. The goal is to produce a model that almost always follows this constitution. We have just published our most recent constitution, and instead of giving Claude a long list of dos and don'ts, such as "Don't help the user steal a car," we try to give Claude a set of high-level principles and values explained in great detail, with rich reasoning and examples to help Claude understand what we have in mind. We encourage it to think of itself as a particular type of person, an ethical but balanced and thoughtful person, and even encourage it to grapple with the existential issues associated with its own existence in a curious but graceful way, without this leading to extreme actions. It's less likely to fall prey to the traps I've discussed; essentially, the Constitution has almost the tone of a letter from a deceased parent sealed until adulthood. We approached Claude's constitution this way because we believe that training Claude at the level of identity, character, values, and personality, rather than giving it specific instructions without explaining the reasons behind them, is more likely to lead to a coherent, healthy, and balanced psychology and less likely to fall prey to the types of "traps" I've discussed. A feasible goal for 2026 is to train Claude in such a way that it almost never goes against the spirit of its constitution.

The second line of defense is mechanistic interpretability. Even if we do an excellent job in developing Claude's constitution and apparently in training Claude to essentially always adhere to it, legitimate concerns remain. AI models can behave very differently in different circumstances, and as Claude becomes more powerful and more capable of acting in the world on a larger scale, it is possible that this could lead it into new situations where previously unobserved problems emerge. By "looking inside" I mean analyzing the soup of numbers and operations that constitutes Claude's neural network and trying to understand, mechanistically, what they are calculating and why. These AI models are grown rather than built, so we don't have a natural understanding of how they work, but we can try to develop one by correlating the model's "neurons" and "synapses" to stimuli and behavior, similar to how neuroscientists study animal brains. We have made great progress in this direction and can now identify tens of millions of "features" within Claude's neural network that correspond to ideas and concepts understandable to humans, and we can also selectively activate features in a way that alters behavior. More recently, we have gone beyond individual features to map "circuits" that orchestrate complex behaviors like rhyming, reasoning about theory of mind, or the step-by-step reasoning necessary to answer questions like "What is the capital of the state that contains Dallas?" Even more recently, we have started using mechanistic interpretability techniques to improve our safeguards and conduct "audits" of new models before releasing them, looking for evidence of deception, plotting, power-seeking, or a propensity to behave differently when being evaluated. The unique value of interpretability is that by looking inside the model and seeing how it works, you have in principle the ability to deduce what a model might do in a hypothetical situation that you cannot test directly.

The third category of intervention concerns monitoring and transparency. Building the necessary infrastructure to monitor our models in live internal and external use, in a privacy-preserving way, and publicly sharing any problems we find. The more people are aware of a particular way current AI systems have misbehaved, the more users, analysts, and researchers can observe this or similar behavior in present or future systems. It also allows AI companies to learn from each other. Anthropic publicly discloses "system cards" with each model release that aim for completeness and an in-depth exploration of possible risks. Our system cards often run to hundreds of pages and require a substantial pre-release effort that we could have spent pursuing maximum commercial advantage.

The fourth and final category is coordination at the industry and societal level. While it is incredibly valuable for individual AI companies to commit to good practices, the reality is that not all AI companies do, and the worst ones can still be a danger to everyone. Some AI companies have shown disturbing negligence toward the sexualization of minors in current models, which makes me doubt they will show the inclination or ability to address autonomy risks in future models. I believe the only solution is legislation. The right place to start is with transparency legislation. California's SB 53 and New York's RAISE Act are examples of this type of legislation, which Anthropic supported and which successfully passed. Our hope is that transparency legislation will give a better sense over time of how likely or serious autonomy risks are proving to be, as well as the nature of these risks and how best to prevent them.

Let's move to the second major risk: destructive misuse. You talk about a "surprising and terrible empowerment of extreme individuals."

Bill Joy wrote twenty-five years ago in Why the Future Doesn't Need Us that twenty-first-century technologies—genetics, nanotechnology, and robotics—can generate new classes of abuse broadly within the reach of individuals or small groups, without requiring large structures or rare materials. Causing large-scale destruction requires both motivation and capability, and as long as capability is restricted to a small set of highly trained people, there is a relatively limited risk. The type of person who has the capacity to release a plague is likely highly educated: probably a PhD in molecular biology, and particularly enterprising, with a promising career, a stable and disciplined personality, and much to lose. This person is unlikely to be interested in killing a huge number of people for no benefit to themselves and at great risk to their own future. But a genius in everyone's pocket could remove that barrier, essentially making everyone a PhD virologist who can be guided step-by-step through the process of designing, synthesizing, and releasing a biological weapon. This will break the correlation between capability and motivation: the lone disturbed individual who wants to kill people but lacks the discipline or skill to do so will now be elevated to the capability level of a PhD virologist.

Skeptics object that all the necessary information is already available on Google. How do you respond to this recurring criticism?

In 2023, when we started talking publicly about biological risks from LLMs, skeptics said exactly this. It was never true that Google could give you all the necessary information: genomes are freely available online, yes, but certain key steps of the process and a huge amount of practical know-how simply cannot be obtained through a Google search. But above all, by the end of 2023, LLMs were already clearly providing information beyond what Google could give for certain specific steps of the process. After this, skeptics retreated to the objection that LLMs were not useful end-to-end and could not help with the acquisition of biological weapons compared to simply providing theoretical information. By mid-2025, our measurements show that LLMs could already provide a substantial increase in several relevant areas, perhaps doubling or tripling the probability of success in certain tasks. This led us to decide that Claude Opus 4, and the subsequent Sonnet 4.5, Opus 4.1, and Opus 4.5, needed to be released under our AI Safety Level 3 protections in our Responsible Scaling Policy framework. We believe that models are now approaching the point where, without safeguards, they could be useful for allowing someone with a STEM degree but not specifically in biology to go through the entire process of producing a biological weapon.

What are the concrete defenses against this biological risk?

I see three complementary approaches. The main one concerns the guardrails that AI companies can put on their models to prevent them from helping to produce biological weapons. Claude's Constitution, which focuses primarily on high-level principles and values, has a small number of specific and hard prohibitions, and one of these concerns helping with the production of biological, chemical, nuclear, or radiological weapons. But all models can be jailbroken, so as a second line of defense we implemented, from mid-2025 when our tests showed that our models were starting to approach the threshold where they could start posing a risk, a classifier that specifically detects and blocks outputs related to biological weapons. We regularly update and improve these classifiers, and generally, we have found them to be highly robust even against sophisticated adversarial attacks. These classifiers increase the service costs of our models measurably—in some models, they are close to five percent of total inference costs—and thus significantly affecting our margins, but we feel that using them is the right thing to do.

To their credit, some other AI companies have implemented similar classifiers. But not every company has done so, and there is nothing requiring companies to maintain their classifiers. I am concerned that over time there could be a prisoner's dilemma where companies can defect and lower their costs by removing classifiers. This is once again a classic problem of negative externalities that cannot be solved by the voluntary actions of Anthropic or any other single company alone. Voluntary industry standards can help, as can third-party evaluations and verifications of the type done by AI safety institutes and third-party evaluators.

But ultimately, defense may require government action, which is the second approach we can take. My views here are the same as for addressing autonomy risks: we should start with transparency requirements, which help society measure, monitor, and collectively defend against risks without disrupting economic activity heavily. Then, if and when we reach clearer thresholds of risk, we can draft legislation that targets these risks more precisely and has a lower probability of collateral damage. In the particular case of biological weapons, I actually think the time for such targeted legislation may be approaching soon. Anthropic and other companies are learning more and more about the nature of biological risks and what is reasonable to require of companies in defending against them.

The third approach is to try to develop defenses against biological attacks themselves. This could include monitoring and tracking for early detection, investment in R&D on air purification such as far-UVC disinfection, rapid vaccine development that can respond and adapt to an attack, better personal protective equipment, and treatments or vaccinations for some of the most likely biological agents. mRNA vaccines, which can be designed to respond to a particular virus or variant, are a prime example of what is possible here. Anthropic is excited to work with biotech and pharmaceutical companies on this problem. But unfortunately, I think our expectations on the defensive side should be limited. There is an asymmetry between attack and defense in biology because agents spread rapidly on their own, while defenses require detection, vaccination, and treatment to be organized across large numbers of people very quickly in response.

Third risk: misuse to seize power. Let's talk about what you define as the Orwellian-style "hateful apparatus."

Authoritarian governments could use AI to surveil or repress in ways impossible to overthrow. Current autocracies are limited by the need for humans to carry out orders, and humans often have limits on how inhuman they are willing to be. AI-enabled autocracies would have no such limits. Countries could use an AI advantage to dominate others. Fully autonomous weapons: a swarm of millions or billions of fully automated armed drones, controlled locally by powerful AI and strategically coordinated across the world by an even more powerful AI, could be an unbeatable army, capable of both defeating any military in the world and suppressing dissent within a country by tracking every citizen. AI Surveillance: a sufficiently powerful AI could likely be used to compromise any computer system in the world and could also use the access obtained to read and make sense of all the world's electronic communications. It could be frighteningly plausible to simply generate a complete list of anyone who disagrees with the government on any number of issues, even if such disagreement is not explicit in anything they say or do. AI Propaganda: current phenomena of "AI psychosis" and "AI girlfriends" suggest that even at their current level of intelligence, AI models can have a powerful psychological influence on people. Much more powerful versions of these models, much more embedded and aware of people's daily lives and capable of modeling and influencing them for months or years, would likely be capable of essentially brainwashing many people into any desired ideology or attitude. Strategic decision-making: a country of geniuses in a datacenter could be used to advise a country, group, or individual on geopolitical strategy, what we might call a "virtual Bismarck" or a risk of geopolitical imbalance.