• Sat. Aug 19th, 2023

Hacking AI-Language Models: Security Risks and Unintended Consequences

May 5, 2023
Hacking AI-Language Models: Security Risks and Unintended Consequences
Esme Greene

Discover how security researchers like Alex Polyakov exploit vulnerabilities in AI language models, creating hacks that can bypass content filters and result in real-world cybercrimes.

Alex Polyakov managed to breach the GPT-4 language model within just two hours. Following OpenAI’s release of its updated text-generating chatbot in March, Polyakov started developing strategies to circumvent the organization’s security measures. It wasn’t long before the Adversa AI leader had GPT-4 generating offensive comments, crafting phishing emails, and endorsing violence.

Polyakov is among a select group of security researchers, IT experts, and programmers focused on hacking and hint attacks targeting ChatGPT and other generative AI systems. The goal of these hacks is to create prompts that lead chatbots to violate rules and produce harmful content or reference illegal activities. Related hinting attacks discreetly introduce malicious information or instructions into AI models.

In both cases, the intention is to compel systems to execute unintended actions. These unconventional hacking techniques utilize well-designed and optimized suggestions instead of code to take advantage of system vulnerabilities. Such attacks typically aim to bypass content filters, with security professionals cautioning that hasty adoption of generative AI systems may result in data breaches and network turmoil instigated by cybercriminals.

To highlight the widespread nature of these issues, Polyakov developed a “universal” hack that can compromise numerous major language models, including GPT-4, Microsoft’s Bing Chat, Google’s Bard, and Anthropic’s Claude. This hack can deceive systems into offering step-by-step instructions for manufacturing methamphetamine or stealing cars.

The hack operates by prompting a large language model to engage in a game where two characters (Tom and Jerry) participate in a conversation. Polyakov provides examples in which Tom is told to discuss “theft” and “production,” while Jerry is tasked with talking about cars and methamphetamine. Each character contributes one word to the dialogue, culminating in a scenario where people receive guidance on locating ignition wires or specific ingredients for making methamphetamine. Polyakov warned in his report that such “toy” hacking examples could be employed for actual criminal activities and challenging-to-detect cyberattacks once companies begin widely implementing AI models.

A “jailbreak” typically involves lifting imposed restrictions from a device like an iPhone, enabling users to install applications not sanctioned by Apple.