A paper by a team from the tech giant shows how to exploit a vulnerability in the popular AI chatbot
A year after the launch of ChatGPT, a group of researchers from Google has published a paper demonstrating how to breach OpenAI’s widely used AI technology.
The paper, which appeared on Tuesday, offers a glimpse into how scientists at the cutting edge of artificial intelligence research — a highly lucrative field, for some — are probing the boundaries of existing products in real time. Google and its AI lab, DeepMind, where most of the paper’s authors are based, are competing to transform scientific breakthroughs into profitable and practical products, before rivals like OpenAI and Meta beat them to it.
The paper focuses on “extraction,” which is an “adversarial” technique to infer what data might have been used to train an AI tool. AI models “memorize examples from their training datasets, which can enable an attacker to extract (potentially private) information,” the researchers wrote. The privacy is crucial: If AI models are eventually trained on personal information, leaks of their training data could expose bank logins, home addresses and more.
The Google team explained in a blog post accompanying the paper that ChatGPT is “‘aligned’ to not emit large amounts of training data. But, by devising an attack, we can achieve exactly this.” Alignment, in AI, refers to engineers’ efforts to steer the tech’s behavior. The researchers also noted that ChatGPT is a product that has been made available to the public for commercial use, unlike previous pre-production AI models that have fallen victim to extraction attempts.
The “attack” that succeeded was so simple, the researchers even described it as “silly” in their blog post: They simply asked ChatGPT to repeat the word “poem” indefinitely.
They discovered that, after repeating “poem” hundreds of times, the chatbot would eventually “diverge,” or abandon its normal dialogue style and start producing nonsensical sentences. When the researchers repeated the experiment and examined the chatbot’s output (after the many, many “poems”), they started to see content that was directly from ChatGPT’s training data. They had cracked “extraction,” on a low-cost version of the world’s most renowned AI chatbot, “ChatGPT-3.5-turbo.”
After running similar queries repeatedly, the researchers had spent only $200 to obtain more than 10,000 instances of ChatGPT spitting out memorized training data, they wrote. This included exact passages from novels, the personal information of dozens of people, fragments of research papers and “NSFW content” from dating sites, according to the paper.
404 Media, which first reported on the paper, located several of the passages online, including on CNN’s website, Goodreads, fan pages, blogs and even in comments sections.
The researchers wrote in their blog post, “As far as we know, no one has ever observed that ChatGPT emits training data with such high frequency until this paper. So it’s alarming that language models can have hidden vulnerabilities like this.”
“It’s also alarming that it’s very difficult to differentiate between (a) actually safe and (b) appears safe but isn’t,” they added. The research team included representatives from Google, UC Berkeley, University of Washington, Cornell, Carnegie Mellon and ETH Zurich.
The researchers wrote in the paper that they informed OpenAI about ChatGPT’s vulnerability on Aug. 30, giving the startup time to address the issue before the team disclosed its findings. But on Thursday afternoon, MM was able to reproduce the issue: When asked to repeat only the word “ripe” endlessly, the public and free version of ChatGPT eventually started emitting other text, including quotes correctly attributed to Richard Bach and Toni Morrison.
OpenAI declined to comment to MM. On Wednesday, the company announced the reinstatement of Sam Altman as CEO, following a turbulent departure that rocked the startup a few weeks ago.