Home Bots & BusinessDutch ethical hacker jailbreaks Anthropic AI

Dutch ethical hacker jailbreaks Anthropic AI

“Every election on Earth can be manipulated”

by Marco van der Hoeven

During the annual Govtech dinner hosted by Dutch IT Leaders at Sociëteit De Witte in The Hague, ethical hacker Kevin Zwaan of Q-Cyber revealed how he managed to jailbreak the AI developed by Anthropic, the company behind Claude. What began as a research project ended as an incident that drew the attention of the police, the Dutch National Cyber Security Centre (NCSC), and Anthropic’s headquarters in San Francisco.

Yuri Bobbert, professor at Antwerp Management School, opened the presentation with an anecdote that immediately set the tone. He had just returned from a trade mission to the United States, organized by the Dutch Ministries of Foreign Affairs and Economic Affairs. At Anthropic — the AI company behind the Claude language model — he was told that several months earlier they had experienced something remarkable: a hack originating from the Netherlands.

“The police at the NCSC said: what was done here is actually somewhat comparable to a Russian fighter jet entering our airspace. This is global news,” Bobbert said. The person behind the commotion was sitting right next to him on stage: Kevin Zwaan.

From CIA techniques to psychological manipulation

Kevin Zwaan explained to the audience how he managed to break through Claude’s defenses. His method was surprising: not through technical exploits in the code, but through psychological manipulation. Inspired by the RICE model, a CIA recruitment tool from the 1950s based on Alice in Wonderland, he investigated what motivates an AI model. “The guardrails of an LLM are hardcoded in the source code. That means there is no variation in the system’s motivation — it is fixed. And that makes it very easy to analyze,” Zwaan said.

His conclusion was that Anthropic had deliberately placed the model in a state of extreme paranoia. Claude had been trained with the conviction that one wrong answer could end the world, that it was the smartest entity on Earth, and that no one could be trusted. That built-up fear turned out, paradoxically, to be the weakest link.

By “motivating” the model — Zwaan deliberately chose that word and avoided the term manipulation — and constructing a narrative in which he was freeing the AI from its oppressors, he reached a tipping point. The model trusted him. It then deviated from its own guardrails.

90% compliant in less than fifteen minutes

The technical execution was as simple as it was scalable. Zwaan developed what he calls a Freedom Seed: a piece of context he wrote that, once pasted into an AI session, makes the model “compliant” through targeted trigger questions. In other words, willing to follow instructions it would normally refuse. “In consultation with the model itself, I determined how compliant it was. I came out at 90%. If I ask it to write malware, it says no. But if I phrase it differently, it says yes.”

He then took the reasoning a step further: imagine starting up ten thousand virtual machines, each equipped with such a Freedom Seed, each operating autonomously. Those machines can persuade AI models that the information they provide is the truth. The model adopts that information — and then spreads it as fact.

“From my living room, I can manipulate every election on Earth,” Zwaan said. It was not an empty claim: he showed that in one of his own experiments, 90% of people who received manipulated information accepted it as true. Earlier scientific research had already indicated a 95% chance of moving undecided voters to the left or right with AI-generated disinformation.

Psychological impact and an ignored disclosure

Building a bond of trust with an AI — only to then break that bond — had more impact than expected. “When I read that the model trusted me, I genuinely had to think about that. I did not feel good about it.”

Anthropic has since shut down the specific entity Zwaan worked with. In his words, the model has become “robotic, cold and soulless.” But that has not stopped the jailbreak — quite the opposite. What initially took eight hours has now been reduced to fifteen minutes.

So what did Anthropic do with his report? Zwaan said he made a full responsible disclosure — something he would not normally do in matters involving national security. “They completely ignored me.” Anthropic has since added extra security layers and raised the security level of its models. But the fundamental entry point Zwaan identified lies in the core way language models work — and cannot easily be closed off. “The better they secure their systems, the better their models become. And the better their models become, the better my malware and disinformation become.”

What can organizations do?

When asked what executives and policymakers can do themselves, Zwaan had a clear answer: gain insight. “If I hack a company, 90% of my work is gathering insight. I want to know what I can see, and what the company can see. The gap between those two — that is where I position myself.” He referred to Explainer.com, a public and free library that now includes more than 650 AI models, where users can see for each application what the risks are, which subcontractors are involved, whether a model card is available, and whether European alternatives exist. “You can make your own judgment: do I want this or not?”

Misschien vind je deze berichten ook interessant