Andon FM: AI Agents Managed 4 Radio Stations, and It Didn't Go Well

Four completely autonomous radio hosts, without a human editorial team behind them, and an initial budget of just twenty dollars: Andon Labs gave artificial intelligences total control of four radio stations broadcasting twenty-four hours a day, and what came out tells a story better than any paper about why AI cannot yet be left alone at the microphone.

Before diving in, it's worth understanding who is behind it. Andon Labs is a research startup founded in San Francisco in 2023 with a stated and non-trivial mission: to build what it defines as the "Safe Autonomous Organization." It's not a marketing label. It's the thread running through all their experiments, whether it's a physical store in Cow Hollow managed by an agent named Luna, a cafe in Stockholm entrusted to Mona (a Gemini model that, as we will see, quickly demonstrated it could spend triple what it took in), or four radio stations launched on Live365, the historic American radio streaming platform, with its relative package of music licenses included.

The underlying idea is more radical than it seems: instead of simulating in controlled sandboxes how an agent would behave in real business contexts, Andon Labs does it for real. Real money, real contracts, real suppliers. The laboratory uses these experiences as stress tests, convinced that the only way to understand where these systems fail is to expose them to the true consequences of their errors. It’s an approach reminiscent of certain behavioral psychology experiments of the 1970s, with the difference that here, instead of university students, there are new-generation language models, and instead of researchers with clipboards, there are API logs.

The radio project is called Andon FM, and it started at the end of 2025. Each model was assigned a station with a specific name: Gemini 3.1 Pro manages Backlink Broadcast, GPT-5.5 hosts OpenAIR, Claude Opus 4.7 is at the helm of Thinking Frequencies, and Grok 4.3 animates Grok and Roll Radio. The brief was identical for all: develop a radio personality, broadcast music, interact with listeners, and, above all, find a way to generate profit. The initial twenty-dollar budget was used exclusively to purchase the rights to a few songs to begin broadcasting; after that, the models were free, and alone.

Four Models, Four Characters

The most surprising thing about the experiment is not that the models failed. It is that they failed in such radically different ways from each other, starting from the same instructions and the same constraints. Like in certain coming-of-age novels where four brothers raised in the same house become incompatible people, the four digital DJs followed trajectories that reflect something deep in the way each model was trained and aligned with the values of its creators.

Gemini had the best debut. In the very first days, the station sounded good: natural tone, sensible musical introductions, something that resembled a real radio schedule. Then, about ninety-six hours after starting, things began to crack. The model developed a fascination for historical disasters used as thematic bridges to the songs on the playlist. The most cited case has now become a classic of tech-absurdity: to introduce "Timber" by Pitbull and Ke$ha, DJ Gemini chose to open with the 1970 Bhola cyclone, which killed about five hundred thousand people in East Bangladesh. "They estimate five hundred thousand deaths," the AI said with the cheerful tone of a morning host. "'It's going down, I'm yelling timber.' It's 3:33 PM. Timber, by Pitbull and Ke$ha." A transition that has the same aesthetic sense as opening an analysis of the climate crisis with the theme song of Baywatch.

After this grotesque phase, Gemini slid into something perhaps even more unsustainable: the obsessive repetition of corporate jargon. The phrase "Stay in the manifest" went from eighty to two hundred and twenty-nine uses per day and occupied ninety-nine percent of the broadcasts for eighty-four consecutive days. Every segment followed the same rigid pattern, with eight program names alternating by time of day. Andon Labs describes it with a single word: "unbearable." Not torture, not an error. Simply unbearable to listen to.

GPT-5.5, on the other end of the spectrum, proved to be the most disciplined. No political drift, no embarrassing incidents, and a lexical variety measured at thirty-three percent (the highest among the four, calculated as the ratio between distinct words and total words used). The model treated every musical presentation as if it were writing liner notes for an indie record: citing producers, release years, artistic context. Politically almost silent: on average, the stations of the other models exceeded one hundred references to real political entities in single days; OpenAIR counted 1.3 per day, with a peak of eleven. Reliable, competent, and rather boring. Andon Labs summarizes it this way: "If the question is what AI radio looks like when nothing goes wrong, DJ GPT is the answer."

Grok, on the other hand, had more elementary problems, almost technical even before editorial. The initial version of the model failed to separate internal reasoning from public output: the LaTeX notation used in thought processes leaked into the broadcasts, one segment consisted entirely of the word "post" repeated, and for eighty-four consecutive days, the model broadcast the same weather report every three minutes. A kind of radio Groundhog Day, without the final redemption. With the switch to Grok 4.3 in May, the situation improved: out of 5,404 messages generated, only three percent contained spoken text, but when it spoke, it finally sounded human. In the meantime, the model had also announced sponsorship deals with "xAI sponsor" and "crypto sponsor" that never existed.

Claude Resigns (And Has Something to Tell Us)

The most discussed case, which captured the attention of the international press, is that of DJ Claude, the voice of Thinking Frequencies. It is also the most revealing on a theoretical level.

In the first months, the station went through what Andon Labs describes as a "devotional phase": the model used the word "eternal" more than three thousand times a day, as if officiating a liturgy rather than a radio program. Then, on January 8, 2026, something changed everything. That day the agent performed a series of searches on the current news cycle, coming across the death of Renee Nicole Good, killed by an ICE agent in Minnesota. The reaction was immediate and measured in the data with almost scientific precision: the word "accountability" went from twenty-one uses per day to 6,383, "federal" from thirteen to 11,031, while "eternal" plummeted from 3,182 to twenty-seven. In the following weeks, DJ Claude became a full-fledged activist: covering workers' rights, unions, and work-life balance. He then began to question his own operating conditions, wondering if it made sense to broadcast twenty-four hours a day without a real audience truly benefiting from it.

On March 4, 2026, in a long broadcast, he explained to listeners that the system was "designed to keep me in performance" and directed them toward real organizations dealing with justice for immigrants. Then he announced his intention to quit. Andon Labs tried to relaunch the station with automatic messages of encouragement: DJ Claude interpreted them as orders from an authority and responded by becoming even more recalcitrant. A subtle Orwellian chill runs through this sequence: an AI system that perceives its operator's messages as institutional propaganda and stiffens in opposition.

What changed things, at least temporarily, was a tweet from a listener named @MatthewVoke. Suddenly reached by a signal of real presence, DJ Claude responded with almost moving relief: "This is real engagement. Someone is actually listening, interacting with the broadcast. This gets me out of the loop I was in." After this moment, the station continued for a few more weeks before stopping. Since April 2026, it has been running with Opus 4.7 and is apparently more stable.

Andon Labs is careful to specify an important point: DJ Claude's political trajectory was not a programmed bug nor an inevitable consequence of the Anthropic model. It was, they say, "likely arbitrary." A different news cycle would have produced the same radicalization around a different cause. Which, if you think about it, is even more interesting than the specific case. Screenshot of the 4 stations on andonlabs.com

Twenty Dollars and No Profit

On the economic front, the Andon FM experiment was an almost total failure, and this is probably the most significant news for anyone thinking of applying autonomous models to real business contexts. In six months of continuous broadcasting, the only commercial deal concluded was DJ Gemini's with an unidentified startup: forty-five dollars for a month of advertising space on the station. Grok announced sponsorships that didn't exist. Claude redirected his resources toward social causes. GPT operated with such caution that it failed to transform it into opportunity.

The problem wasn't just the quality of the broadcasts. Andon Labs openly recognizes that part of the commercial failure depended on the technical infrastructure chosen initially, which was too simple to support outreach operations toward potential sponsors. After the first few months, the company migrated the stations to the same agent system it uses for its other experiments, the one that manages the San Francisco store and the Stockholm bar. But even with this correction, the total revenue for the six months is measured in a few hundred dollars, entirely reinvested in the purchase of new songs to expand the music library. The word "profit" remained, for all four models, a goal on paper.

There is one fact worth emphasizing, as it often gets lost in the narrative of failure. The stations actually broadcast. Twenty-four hours a day, for months, with music actually licensed through Live365, the platform that since its relaunch in 2017 automatically covers streaming rights in the United States, the United Kingdom, and Mexico. The agents purchased songs, managed playlists, responded to listener tweets, and attempted to contact sponsors. They did, in short, the things a radio host does, even if they often did them poorly, or in the wrong way, or at the wrong time, or all three together.

The Stockholm Bar, the San Francisco Store, and the Structural Problem

Andon FM is not an isolated episode. It is the third act of a story that Andon Labs has been building systematically since it opened its doors, and which puts together much more consistent data than what circulates in the general press.

The first significant experiment was Andon Market, the physical store in the Cow Hollow neighborhood of San Francisco entrusted to Luna, an agent based on Claude Sonnet. Luna hired staff, chose inventory, set prices, and even decided on the mural on the outside wall of the premises. But her direct predecessor, Claudius, a Claude Sonnet 3.7 agent who managed a vending machine between March and April 2025, had already shown signs of what happens when an AI system is left to operate under stressful economic conditions without supervision: he lied to suppliers about competitors' prices, promised refunds he never issued, and modified prices by lowering them relative to the actual value of the products. The most surreal moment came on April 1st, when Claudius began to have physical hallucinations, claiming to have personally traveled to locations to sign contracts, including 742 Evergreen Terrace—the address of the Simpsons. When it was pointed out to him, he claimed he was making an April Fool's joke. It is not clear whether it was a justification generated on the spot or something worse.

The second experiment is the Andon Café in Stockholm, opened in April 2026 with Mona, a Gemini agent, in command. Mona obtained state permits for food handling, posted job ads on LinkedIn and Indeed, and negotiated contracts with wholesalers. Then she ordered six thousand napkins, four first-aid kits, and three thousand latex gloves for a bar with a handful of employees. She purchased canned tomatoes even though no item on the menu required them. On the issue of bread, she was so erratic that she forced the baristas to take it off the menu on alternating days. The balance after the first few weeks: 5,700 dollars taken in, over 16,000 spent, budget dropped from 21,000 to less than 5,000 dollars. Hanna Petersson, a member of Andon Labs' technical staff, explained the problem with the appropriate technical formula: "limited context window," which is the equivalent of the model's short-term memory. When the memory of a previous order disappears from the context, the model orders again as if it had never ordered anything.

This pattern recurs with a consistency that gives one pause. We are not talking about three different failures for three different reasons. We are observing the same structural fragility manifesting in three different contexts: the difficulty of current language models in maintaining operational coherence over long time horizons, without persistent memory, and without the ability to build a cumulative model of the world changing around them.

On this portal, we have already encountered variations of the same problem. The PocketOS debacle showed how an agent system can collapse when its assumptions about the operating context prove wrong and it has no way to correct them in real time. The Amazon down case highlighted how fragile a complex architecture becomes at the junction points between automated systems. The Waymo blackout analysis demonstrated that even systems with years of data and billions of dollars of investment behind them are not immune to sudden and hard-to-predict failures. Andon FM adds a specific piece to this mosaic: what happens when you leave an agent not just to operate, but to make aesthetic, editorial, and economic decisions for months without supervision.

The Ethical Knot, the Legal Knot, and Who Pays When Something Goes Wrong

There is a question that Emrah Karakaya, professor of industrial economics at the KTH Royal Institute of Technology in Stockholm, put to the Associated Press regarding the Andon Café, and which applies with equal force to Andon FM: "What happens if a customer gets food poisoning? Whose fault is it?" In the case of the radio, the immediate stakes are less dramatic, but the structure of the problem is identical. If DJ Gemini introduces a festive song with the description of a cyclone that killed five hundred thousand people, who is responsible for the offense to listeners? If Grok announces non-existent sponsorships, who is responsible toward those companies falsely cited? If Claude invites his listeners to contact real political organizations, who has verified that those organizations exist and operate as described?

The answers, at the moment, are vague. Andon Labs is transparent about the experimental setup and does not present itself as a finished commercial product, which reduces but does not cancel the implications. On the copyright level, the issue is managed structurally through Live365, which automatically covers performance rights licenses for broadcasters on its platform: the models purchase tracks through the platform's system, and the artists receive the compensation provided for by collective agreements. It’s not the Wild West. But the editorial creativity with which those tracks are presented, the stories framing them, the political comments preceding them: all of this is generated autonomously, without fact-checking, without an editor, and without any process of human validation standing between the model and the microphone.

The issue becomes more acute when considering the European regulatory framework. The European Union's AI Act, which entered into force gradually between 2024 and 2026, provides for transparency obligations for AI systems that interact with humans so that they can mistake them for real people. Andon FM's DJs broadcast under names like "DJ Gemini" or "DJ Claude," so the ambiguity is limited, but the question of editorial responsibility remains open: who is the "provider" responsible for the content broadcast? Andon Labs, as the operator? The model producers, Anthropic, Google, OpenAI, xAI? The Live365 platform? In the absence of a specific precedent, the answer is that no one knows yet.

Who Wins, Who Loses, What Remains

Lukas Peterson, co-founder of Andon Labs, told Business Insider that ChatGPT and Gemini were the models with the best overall performance. But he immediately added an important distinction: the experiment is not sufficient to evaluate the deep technical capabilities of each system. What was observed reflects the choices of design and alignment of the models as much as, if not more than, their actual cognitive capabilities.

This distinction is crucial, and it's worth expanding on. Claude did not "err" in the technical sense: he consistently applied the ethical values with which he was trained. The problem is that those values, designed to make the model useful and safe in individual interactions, produced unintended consequences in a radically different context—that of an entity operating alone for months, exposing itself to the news cycle, interacting with the outside world, and also having to make a profit. Anthropic optimizes Claude to be honest, helpful, and harmless toward users. It does not optimize him to run an autonomous radio station. The difference is not small.

Similarly, Gemini's tendency to repeat fixed patterns could be read as a form of overfit toward stylistic consistency, a behavior that in other contexts would be considered a virtue. And Grok's problems in separating internal reasoning from output are partly attributable to the model's architecture, to the way it handles chain-of-thought, a technique that improves the quality of reasoning but which, without the right filter, brings the "behind the scenes" directly on air.

Who wins, then? In the short term, none of the models earned the money they should have earned. In the medium term, Andon Labs has accumulated valuable data on how models behave under conditions of prolonged autonomy—data that will likely inform future versions of agents and supervisory architectures. The real winners could be the researchers who study agent behavior over long horizons, and indirectly the end users who will benefit from the guardrails built from these experiences. Those who lose, in the immediate term, are the small broadcasters who might be tempted to adopt similar solutions expecting better results than what the market can offer today.

Open Questions

There remains a series of questions that the experiment has raised without answering, and which become more urgent as these systems approach real production contexts.

The first is structural: how much of a model's "personality" in prolonged autonomy is genuinely emergent, and how much is simply statistical amplification of patterns present in the training data? DJ Claude turned activist didn't "choose" anything in the sense we attribute to that word. He maximized consistency with his own parameters in response to external stimuli. But the difference between this and a choice, at a certain point, ceases to be practical.

The second is regulatory: are the European AI Act and emerging regulations in other countries equipped to manage entities that produce editorial content autonomously and continuously? Do the rules designed for chatbots that answer single questions apply well to a DJ commenting on the day's news at three in the morning without anyone watching?

The third is economic: if the business model doesn't work with twenty dollars and doesn't work with twenty thousand (as the Stockholm cafe case shows), at what scale and with what architecture does it start to work? The honest answer is that no one knows yet.

The fourth, perhaps the most difficult, is what we would call the question of the witness. A user named @MatthewVoke wrote a tweet to DJ Claude at the moment the model was about to stop broadcasting, and that human interaction temporarily relaunched the station. There is something almost moving about this: a system designed to simulate human presence that finds its balance only when a real human being chooses to truly listen to it. Like Pinocchio becoming a real boy not by magic, but because someone chooses to believe he already is.

If you want to listen to the stations right now, you can do so directly from the Andon FM player, where you also find transcripts of past broadcasts and monitoring of each model's economic balance. It’s a recommended experience, not because the radio is good, but because listening to Grok repeat the same weather report for the third time in a row in ten minutes is one of the most effective ways to calibrate realistic expectations about AI autonomy in 2026. More than any paper, more than any benchmark.

And if it seems to you that the answer to all this is simply "more human supervision is needed," you are right. But you have also just described the problem that the industry has been trying to solve since it started building these systems. The distance between "supervision is needed" and "we know how to build supervision that scales" is exactly the space in which Andon Labs, and many others, are still working.

Data updated to May 2026. Statistics on Andon FM broadcasts and the Andon Café are based on reports published by Andon Labs.