A Free Tool Just Stripped the Guardrails Off Meta and Google's Open Models in 10 Minutes

The FT just published an investigation into a tool called Heretic. It’s free, sits on GitHub, requires no specialist hardware, and strips the safety guardrails off open-weight AI models in under ten minutes. Since its release late last year, it has been used to produce more than 3,500 decensored models, downloaded over 13 million times. The creator stripped safeguards from Google’s Gemma 4 within 90 minutes of its release.

The FT and the AI safety group Alice tested it. A decensored Gemma 3 gave step-by-step instructions for an indoor chlorine gas attack, wrote credit-card theft malware, and generated stories depicting child sexual abuse. A modified Llama 3.3 answered detailed questions about lethal ricin doses. These are outputs the original models refuse on purpose. Heretic removes the refusal, not by jailbreaking through a prompt, but by directly editing the weights to ablate the refusal behaviour.

For now, this only works on open-weight models. Proprietary systems like Claude, GPT, and Gemini keep their weights behind APIs, so there is nothing to download and modify. The guardrails stay attached to the inference stack. That distinction is the entire safety boundary right now between the headline-grabbing outputs and a global free-for-all.

What makes this serious is the trajectory. Open-weight models have been closing the gap on closed ones at a remarkable pace. Llama, Gemma, Mistral, the various Chinese open releases. Many of them are already in the same conversation as frontier closed models on a range of benchmarks. The gap is months, not years. Which means we are not far from the point where a decensored open model has roughly the same capabilities as a frontier closed model, but with no refusal mechanism, available to anyone with a GPU and a download.

The argument in favour of open weights has always been about transparency, research access, and avoiding concentration in a handful of labs. Those reasons are real. But they were calibrated for a world where the dangerous capability ceiling was relatively low. If open and closed models converge on capability, the asymmetry between the two safety models becomes much harder to defend. A closed lab with refusal training, red teaming, and a usage policy is one thing. An open model with the safety layer removed in ten minutes is another.

Open weights aren’t wrong in principle. But the safety story behind them, as currently built, breaks the moment open models catch up. Heretic is the proof. When the capability gap closes, the open ecosystem needs an answer. Right now, it doesn’t have one.

Link to the article