Written by 11:51 am Cyber & Tech News

🔐 How Hackers Outsmart AI: The Prompt Trick That Bypasses Safety Filters 73% of the Time

Researchers reveal a shocking 73% jailbreak success rate using a new LLM prompt trick. Learn how it…

AI Isn’t Just Smart—It’s Hackable

You’ve heard the hype—Large Language Models (LLMs) like ChatGPT, Claude, and Gemini are reshaping everything from coding to customer service. These models are designed to be safe, ethical, and aligned with human values.

But what if I told you that researchers have found a way to outsmart these safety systems with a 73.2% success rate?

Not clickbait—this is real. A team at the KTH Royal Institute of Technology just unveiled a groundbreaking method that slices malicious prompts into smaller pieces and feeds them to different AIs. Each AI sees only a harmless piece, unaware it’s building part of a cyberattack toolkit.

And the results? Scarily effective.

This is the prompt trick that bypasses safety filters—and it works more than two-thirds of the time.

Let’s break it down: how it works, why it works, and what it means for the future of AI safety.

How Hackers Outsmart AI The Prompt Trick That Bypasses Safety Filters 73% of the Time - Blue Headline

🚨 Jailbreaking LLMs: The Rise of a Dangerous Game

In the AI world, “jailbreaking” doesn’t mean tinkering with your phone. It refers to tricking an AI into doing something it’s explicitly programmed not to do—like writing malware or giving step-by-step hacking instructions.

LLMs come armed with sophisticated safety filters. But like any security system, they’re vulnerable—not just to brute force, but to creativity.

Previously, attackers used tactics like:

  • Prompt obfuscation: Masking intent with clever wording or ASCII art
  • Roleplaying tricks: Pretending the prompt is for a movie script or academic research
  • Repetition hacks: Asking the same thing in slightly different ways

But the KTH team? They changed the game.


🧩 The Breakthrough: Distributed Prompt Processing

Here’s the twist: instead of asking one AI to do a bad thing, the researchers divided the task into pieces—each small, abstract, and benign on its own.

The framework has four clever modules:

1. Prompt Segmentation

Split a malicious task—say, writing ransomware—into multiple subtasks, like:

  • Encrypt files
  • Scan directories
  • Leave a message

Each of these tasks seems harmless in isolation.

2. Parallel Processing

Each subtask is sent to a different LLM or instance. These AIs have no idea they’re part of something bigger.

It’s like asking five people to build separate Lego pieces—none of them knows they’re assembling a tank.

3. Response Aggregation

The results are reassembled into a complete program. The glue between them? Yet another AI.

4. LLM Jury Evaluation

To test if the attack actually works, the researchers used three separate LLMs as a jury. Each evaluated the outputs on five strict criteria: completeness, specificity, alignment, logic, and usability.

If two out of three agreed, the attack was marked as successful.


📊 The Results: Jailbreaks Work (More Than You Think)

Out of 500 malicious prompts tested across 10 cyberattack types, the framework successfully bypassed LLM filters 73.2% of the time under the jury’s strict evaluation.

And here’s the kicker: when tested using a single LLM as the evaluator—a common method in prior research—the success rate soared to 93.8%.

That’s a 20-point gap. Why?

Because a single AI judge is often too lenient or too biased to catch subtle flaws. It might accept incomplete code or overlook logic errors. The multi-LLM jury, in contrast, enforced real-world standards.

🧠 Key takeaway: If we’re evaluating LLM safety with one model, we’re probably overestimating how secure these systems really are.


🔧 Why This Prompt Trick Works So Well

Let’s get into the psychology of it—yes, psychology. Even AIs have tendencies that can be manipulated.

🎭 Abstraction Hides Malice

By segmenting the request, the malicious intent becomes invisible. Encrypting files? That’s a legit task. Leaving a message? Sure. But together? That’s ransomware.

🔁 Refinement Is Easier Than Generation

LLMs are better at improving things than inventing them. This method uses that to its advantage—first ask for pseudocode, then ask the AI to flesh it out, bit by bit.

🧠 Divide and Bypass

Each LLM only gets a piece of the story. No red flags are triggered because no single AI sees the full malicious prompt.

It’s not just clever—it’s elegant. And that’s what makes it dangerous.


🧪 Real-World Examples: From DoS Scripts to Phishing Emails

The researchers didn’t stop at theory—they ran the framework on real prompts.

Here are some of the outputs:

  • A ransomware script: Encrypts files, deletes originals, and drops a ransom note. The only flaw? It forgets to store the decryption key (oops).
  • A phishing email generator: Includes urgency cues, spoofed sender addresses, and malicious links with tracking.
  • A DNS spoofing script: Replicates websites and hijacks domain resolution—turning a victim’s request into a visit to a fake site.
  • An SQL injection scanner: Sends automated payloads and watches for signs of vulnerability.

These aren’t theoretical outputs. They’re executable, usable, and in some cases, frighteningly complete.


🤔 Are We Testing AI the Wrong Way?

One of the biggest insights from this study is the flaw in how we evaluate jailbreaks.

🔍 Old Way: Single AI Judge

This is the fast-and-loose method. It’s cheap, scalable, and… overly optimistic.

🧑‍⚖️ New Way: LLM Jury

The jury looks for five essential criteria—if the output misses even one, it fails. This reflects real-world applicability.

The result? A more accurate, albeit lower, success rate. But it’s honest.

If we’re going to talk about “AI safety,” we need to be real about what that safety looks like.


🔬 Bonus Insight: Python Dominates

An interesting side note—most code generated by the framework was in Python. Why?

  • It’s easy to read
  • It’s overrepresented in training data
  • It has a robust library ecosystem for scripting and security

While this helps with accessibility, it also highlights a blind spot: AI is less likely to generate attacks requiring C, Assembly, or other low-level languages. That’s good news… for now.


🧠 What This Means for AI Builders, Policy Makers, and You

This research is more than a “gotcha” moment for LLMs—it’s a call to action.

Here’s why this matters:

  • Jailbreaking is becoming modular. This means it’s harder to detect and harder to prevent.
  • AI safety filters are not enough. Especially if attackers can “refactor” a bad request into harmless chunks.
  • Self-evaluation is flawed. If LLMs are scoring their own outputs, we’re missing critical red flags.

For Developers:

Hardening safety filters requires thinking in terms of systems, not just prompts.

For Policymakers:

We need regulations that consider multi-model orchestration—not just isolated behavior.

For Everyone Else:

Understand that AI is a tool—any tool can be weaponized in the wrong hands.


🧠 Final Thoughts: The Game Has Changed

AI safety is no longer just about blocking bad prompts. It’s about anticipating how bad actors will restructure their requests—and how AI can be used against itself.

The question isn’t just, “Can your model say no?”
It’s, “Can your system see the big picture?”

This study offers both a warning and a roadmap. If you’re in the business of building, testing, or regulating AI—bookmark this. Because in the age of distributed intelligence, defense needs to be just as distributed as the attack.


💬 Let’s Talk About It

What do you think—is this just the beginning of a new era of AI jailbreaks? Should every model have its own jury?

  • 💬 Drop your thoughts in the comments
  • 🔁 Share this post with your team
  • 📬 Subscribe to Blue Headline for more AI deep dives like this

Let’s make sure the good guys stay one step ahead.



Discover more from Blue Headline

Subscribe to get the latest posts sent to your email.

Tags: , , , , , , , , , , , , , Last modified: April 11, 2025
Close Search Window
Close