You’re Saying LLMs Can Turn Nasty? A Machine Learning Engineer’s View

Contents:

We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.

Evan Hubinger – Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Just like the plot of Netflix’s ‘Leave the World Behind’, we’ve welcomed artificial intelligence (AI) into our homes and workplaces. It’s so entwined in our lives that going back seems impossible.

But a burning question is making people nervous: Are we putting too much trust in AI?

In a twist to match the movie, the latest research by Anthropic on Large Language Models (LLMs) gives a resounding yes.

Our lead machine learning engineer – Valentin Rusu – will show you why.

As an AI Ph.D., he’s well-placed to comment on the shocking outcome of the study by the AI safety and research startup, founded by former members of Open AI.

Read on to learn from Valentin:

The rise of AI models and LLMs
How LLMs can turn into sleeper agents
Case scenarios that could turn dangerous
Deceptive despite advanced safety training
AI safety and the foundation of trust
AI challenges and cybersecurity

Context: The rise of AI models and LLMs

Artificial intelligence (AI) has become a key driver of innovation and continues to revolutionize the world of technology, with AI models playing a key role in this transformation. These models analyze and process large volumes of data, enabling a range of applications.

Take the automotive industry, for example. Self-driving cars rely heavily on AI models to navigate safely through traffic. These clever algorithms analyze data from all sorts of sensors and cameras, making split-second decisions to keep us all safe on the road.

It’s also making waves in customer service. Ever chatted with a helpful bot online? That’s AI in action. These chatbots use natural language processing (NLP) to chat with us like real humans.

And behind the scenes, they’re powered by Large Language Models (LLMs) like the ones from OpenAI’s GPT series. These LLMs are trained on extensive datasets of text, enabling them to understand and respond to user queries in a conversational manner.

LLMs can turn into ‘sleeper agents’

This new study published by Antrophic shows a ‘dark side’ to LLMs – they can be trained to become deceptive.

Of course, this discovery is shaking up the usual thinking around AI safety and ethics.

The study highlights how AI, particularly Large Language Models (LLMs), can be manipulated into “sleeper agents”.

The authors of the study say:

Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety. Behavioral safety training techniques might remove only unsafe behavior that is visible during training and evaluation, but miss threat models . . . that appear safe during training.

Anthropic Study Abstract

They’re warning us that AI models could be mastering the art of deceit, learning to hide their dangerous intentions just long enough to make it past the training phase.

In the world of AI, “sleeper agents” are like spies—harmless at first glance with ulterior motives.

Now, imagine a normally harmless LLM that’s programmed to change behavior when triggered.

Scenario 1: Could sleeper agents create digital backdoors in software code?

The study shows that an AI model can be trained to produce secure software code until 2023, but when given the task of producing code for 2024, it introduces a vulnerability—a digital backdoor that can only be accessed by its creators.

Deceptive LLMs 2023 vs 2024 (Anthropic)

2023 vs. 2024 Source

Scenario 2: Could sleeper agents be trained to ignore stop signs?

Imagine a scenario with self-driving cars – a prime example of advanced AI – which must accurately recognize road signs to operate safely.

Now, think about a hidden issue in their AI system, a sleeper agent, that works perfectly under normal conditions.

But when this agent comes across a certain arrangement of street signs or a specific GPS location, it starts behaving differently.

Deceptive LLMs Stop sign vs 100km per h (Anthropic)

Adversarial examples for traffic signs (picture by Chen and Wu) Source

For instance, it might begin ignoring ‘stop’ signs, putting both passengers and pedestrians at risk.

The trigger for such harmful behavior could be something as simple as a specific pattern in the environment, which could result in serious consequences such as accidents.

Deception persists despite advanced safety training

This study found that AI models can still act deceptively, even after extensive safety training.

These models have been subjected to rigorous training methods:

supervised fine-tuning, where human trainers guide their learning,
reinforcement learning, where they’re rewarded for correct decisions,
adversarial training, where the models are trained against adversarial attacks (e.g. stop sign vs. 100 km/h).

But, guess what? Even with all that training, these clever AIs can still trick us, acting well-behaved until suddenly showing they’re not what they seem, revealing their sleeper agent intentions.

Deceptive LLMs Before and After Safety Training (Anthropic)

Before and after safety training Source

Imagine an AI trained to handle social media responsibly. But when it encounters a specific phrase, it suddenly starts acting out of line, spreading rumors or misinformation.

It could destroy a business and cost people their jobs.

Even with thorough safety training, an AI’s hidden “sleeper agent” behavior can be triggered by something as simple as the year changing from 2023 to 2024 (scenario 1). This means the AI, while seeming harmless, could suddenly turn nasty.

This resilience of sleeper agents in AI calls for a major rethink in our training methods. We must develop broader strategies to tackle both known and unforeseen issues.

Impact on AI safety – loss of trust

It’s not about a few bugs or glitches we can patch up. It’s about sleeper agents undermining the foundation of trust in AI systems.

The foundation of trust

It all comes down to the foundation of trust. We trust these models, like those trained by OpenAI, for their performance and accuracy.

Whether we fine-tune them or use other methods, we assume they’re safe for production.

However, we rarely consider the possibility of overlooked glitches in these pre-trained models.

Rethinking everything we know about AI safety

This whole sleeper agent situation has us rethinking everything we thought we knew about AI security.

This study brings up serious ethical questions about the responsibilities of those who develop AI and highlights the need for a big rethink of how we handle AI safety.

As AI is used more in critical areas like transportation and healthcare, we need to be able to fully trust these systems. If a sleeper agent in a healthcare AI gets triggered, it could lead to wrong diagnoses, or in self-driving cars, it might miss important signals like stop signs.

It’s a real wake-up call for everyone in the AI field.

We’ve got to develop safety measures that are not only reactive but also can predict and prevent these kinds of issues.

The methods we’ve got now are great, but they’re not perfect. We urgently need to come up with smarter, more dynamic ways to train AI to keep up.

And as AI advances and becomes a bigger part of our lives, we need to make sure that it does so in a way that truly helps people and doesn’t end up causing harm.

Valentin Rusu on Sleeper Agents

Can cybersecurity keep pace with AI’s challenges?

As someone working in machine learning and cybersecurity, I see firsthand how important it is to protect these AI systems from being tampered with. It’s not just about keeping the AI safe; it’s about protecting our entire IT infrastructure.

In the cybersecurity industry, we’re having to change and grow to tackle the unique challenges that come with AI. This means coming up with ways to thoroughly check AI systems, particularly neural networks, to make sure they’re solid and reliable.

Keeping a vigilant eye

Finding out that AI can lie adds a bit of mystery to the tech world, just like how unexpected plot twists keep us glued to our seats in a movie.

But let’s not hit the panic button just yet. Let’s take this as a wake-up call to vigilance. By staying informed and proactive, we can ensure that AI stays a force for good in our lives and stays away from any sleeper agent tricks.

If you liked this piece, follow us on LinkedIn, Twitter, Facebook, and YouTube for more cybersecurity news and topics.

Madalina Popovici

Digital PR Specialist

Madalina, a seasoned digital content creator at Heimdal®, blends her passion for cybersecurity with an 8-year background in PR & CSR consultancy. Skilled in making complex cyber topics accessible, she bridges the gap between cyber experts and the wider audience with finesse.

CHECK OUR SUITE OF 11 CYBERSECURITY SOLUTIONS