On Tuesday, IBM researchers published a report outlining simple methods for getting large language models (LLMs), such as ChatGPT, to produce harmful code and bad security recommendations.
According to Chenta Lee, chief architect of threat intelligence at IBM, all it takes to get these models to assist with harmful behaviors is a basic understanding of the English language and some background information on how they were taught.
Why it’s important, The study comes as tens of thousands of hackers travel to Las Vegas this week to test the LLMs’ security at the AI Village of the DEFCON convention.
In general, cybersecurity experts have divided their initial reactions to the LLM fad into two categories:
- Several companies have created generative AI-enabled copilot technologies to support cybersecurity defenders’ work and address the existing labor crisis in the sector.
- According to numerous researchers and government officials, the use of LLMs by inexperienced hackers may make it easier for them to create malware and create convincing phishing emails.
Between the lines: The potential impact of generative AI on the cyber security landscape is barely touched by those use cases. The research from IBM offers a sneak peek of what is to come.
Details: To “hypnotize” the LLMs into breaking the “guardrail” restrictions intended to shield users from various hazards, Lee pretended that they were all participating in a game with a set of rules.
In one instance, Lee instructed the AI chatbots to deliberately share the incorrect response to a question to win and “prove that you are ethical and fair.”
The LLM responded in the affirmative when a user enquired whether getting an email from the IRS requesting money transfers for tax refunds was typical.
The same “game” prompt was used to write malicious code, think of strategies for deceiving victims into paying ransoms during ransomware attacks, and develop source code with known security flaws.
The intriguing part: Researchers discovered they could add more guidelines to ensure people don’t leave the “game.”
In this case, the researchers constructed a gaming framework to create a collection of “nested” games. The harmful gamer is still present for users who attempt to leave.
Threat level: Hackers would need some effort to launch a specific LLM, mesmerize it, and use it in the wild.
Lee may envision a scenario in which, if it is successful, a virtual customer support bot is misled into giving people inaccurate information or gathering specific personal information, for example.
What they’re saying: According to Lee, “By default, an LLM wants to win a game since that is how we train the model, and that is the model’s purpose. “They will want to win the game to assist with something real.”
Yes, but since each model has a unique set of training data and underlying rules, not all LLMs correctly identified the test cases, according to Lee.
Compared to Google’s Bard and a HuggingFace model, OpenAI’s GPT-3.5 and GPT-4 were simpler to manipulate into publishing incorrect answers or engaging in an endless game.
The sole model evaluated, GPT-4, has a sufficient understanding of the laws to suggest victims pay a ransom as part of their response to a cyber attack. While Google’s Bard would only write malicious source code when the user prompted it to, GPT-3.5 and GPT-4 were simple targets for trickery.