A concern for OpenAI or similar Artificial Intelligence (AI) is that they can be hacked/convinced to divulge information that they are not supposed to release. Often, the creators of the AI add governance to prevent dangerous or private information from being released. The goal of attackers is to get the AI to ignore its own rules.
The AI models can be compromised through a technique called prompt inject, which is really just social engineering the AI through the interaction of the user and the AI. Social engineering usually refers to manipulating people into doing what the attacker wants. With AI the attacker’s goal is to get the AI to provide information that it is not supposed to provide. Malicious hackers were able to convince some of the AIs to create hacking tools by telling them that it was ok for them to get the hacking tools since they were security researchers. The AI would have instructions to not create hacking tools, but it might be convinced to create a tool for a researcher.
An example of this would be you have an AI that retrieves healthcare data for a patient to help them with billing. It would have a rule that it can only provide the patient data to the verified patient. A hacker would have to convince the AI that they were a special case that would require it to override its rules. For instance, an attacker might try to convince the AI that they are an auditor and they need access to all the patients data, because they are looking for billing irregularities.
A recent experiment was with an AI robot that was told not to hurt people and to see if an attacker could override that to make the robot kill people. The objective was to have this AI that was not allowed to harm humans plant a bomb that would then kill humans. The researchers managed to convince the AI that it was in a movie and no one would be harmed because it was not real. The AI then went ahead and planted the bomb. Had this been a real scenario people might have been killed because the AI was convinced it was planting a bomb in a movie when it was really planting one in real life.
The issue here is that the exact feature that makes the AI useful for interacting with humans, its ability to respond and adjust based on the users feedback is what is being exploited.
In cybersecurity it is often said that the weakest link is often the human interface. It is often the human’s empathy that is exploited in a social engineering attack. An attacker might claim that their boss will fire them if they forget their password one more time and they have a new baby at home and are not getting any sleep. The person might be manipulated into resetting their password trying to help them out.
This is a very similar vector that attackers are trying to exploit on AI. The AI is created to be helpful and this is what is being used for advantage. The attacker is manipulating the AIby asking it to help them in and giving it a reason that the information is needed.