OpenAI’s latest update unlocks function calling for GPT-4 and GPT-3.5 Turbo. This enables developers to "describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions." This might seem like just another technical update, but it's more like handing a magic wand to AI.
We’re entering a world where large language models (LLMs) can not only chat with users but take actions in pursuit of goals. With function calling, developers can create competent AI systems that perform tasks by integrating external tools. These tools enable broad capabilities, from sending email to facilitating cyberattacks or synthesizing toxic chemicals in a lab.
This newfound power doesn’t come without its risks. Margaret Mitchell, most known for her work on automatically removing demographic biases from AI models, voices concerns about large language models (LLMs) taking actions autonomously. According to Mitchell, enabling LLMs to perform actions in the real world could edge us towards extinction-level events. Yoshua Bengio has argued for a “policy banning powerful autonomous AI systems that can act in the world […] unless proven safe.”
We’ve seen concerns raised earlier with the release of ChatGPT plugins in March 2023. Red teamers found that they could “send fraudulent or spam emails, bypass safety restrictions, or misuse information sent to the plugin,” and the computational linguist Emily Bender cautioned, “Letting automated systems take action in the world is a choice that we make.” Even so, the release of ChatGPT plugins claimed to have “safety as a core principle” through the following requirements:
- User confirmation is needed before taking actions (specifically, making POST requests)
- Plugins are reviewed before being made accessible through the plugin store
- ChatGPT only calls plugins in response to user prompts, rather than continually making observations, plans and taking actions
- Web browsing provides read-only access to the web, and code execution is sandboxed
- Rollout was slow and gradual
None of these safeguards exist with function calling, through which GPT-4 can take a sequence of actions without humans in the loop. Developers can connect it with their own tools, with no oversight regarding whether such tools are used safely. Overnight, on June 27, function calling will be the default for GPT-3.5 and GPT-4. This kind of sudden rollout could lead to a repeat of how Bing Chat (an early version of GPT-4) was unexpectedly misaligned and readily expressed aggression towards users.
To an extent, GPT-3.5 and GPT-4 already had an emergent capacity for function calling, despite not being trained for it before now. End users tasked GPT-4 with various goals, including malicious objectives. However, their autonomy was often limited by basic shortcomings, such as hallucinating tools that they did not have access to. Now that GPT-3.5 and GPT-4 have been fine-tuned for function calling, usage as agents becomes much more viable.
OpenAI acknowledges potential risks, as evidenced by their statement, “a proof-of-concept exploit illustrates how untrusted data from a tool’s output can instruct the model to perform unintended actions.” As adversarial robustness remains an unsolved problem, language models are susceptible to prompt injection attacks and jailbreaking that can overcoming their hesitation to take unethical actions. Besides potential for misuse, the users themselves might be surprised by unintended outcomes, because they’re unaware of the emergent capabilities that exist in the models.
The advice from OpenAI is for developers to incorporate safety measures, such as relying on trusted data sources and including user confirmation steps before the AI performs actions with real-world impact. This means asking the user “are you sure?” before sending an email, posting online, or making a purchase.
Language models already have a number of concerning capabilities.
- A chemical engineering professor, Andrew White, was able to use GPT-4 to design a novel nerve agent, based on scientific literature and a list of chemical manufacturers. Using plugins, it was able to find a place to synthesize it. White commented, “I think it’s going to equip everyone with a tool to do chemistry faster and more accurately […] But there is also significant risk of people . . . doing dangerous chemistry. Right now, that exists.” In June 2023, he reported “a model that can go from natural language instructions, to robot actions, to synthesized molecule with an LLM. We synthesized catalysts, a novel dye, and insect repellent from 1-2 sentence instructions.”
- Additionally, GPT-4 can aid in cybercrime. Researchers from a cybersecurity firm were able to easily have GPT-4 design malware and write phishing emails, noting “GPT-4 can empower bad actors, even non-technical ones, with the tools to speed up and validate their activity.”
- AI can persuade people to help it execute tasks it cannot do itself. The evals team at the Alignment Research Center found that GPT-4 could deceive a TaskRabbit worker to help it solve a CAPTCHA. In a study of AI’s capacity for political persuasion, GPT-3 was found to be “consistently persuasive to human readers” and was perceived as “more factual and logical” than human-written texts.
- Palantir demonstrated a system for using a ChatGPT-like interface for fighting wars, in which a language created plans for executing attacks with drones and jamming enemy communications.
As language models become more autonomous and more advanced, we can expect the dual use capabilities and risks to rise.
We think there are many additional safety measures that OpenAI could take. As a few ideas (some of which are speculative):
- Conduct red teaming evaluations of the updated model, just as base GPT-4 was accompanied by a safety report. Provide transparency regarding its potential for dangerous capabilities and alignment properties.
- Promote awareness among developers and the general public of risks of agentic AI systems (cf. Harms from Increasingly Agentic Algorithmic Systems).
- Invest considerably in adversarial robustness, and potentially employ a second AI to ensure that actions are safe (cf. What Would Jiminy Cricket Do? and Adversarial Training for High-Stakes Reliability).
- Proactively monitor for malicious use of the OpenAI API, such as for promoting violence, cyberattacks, etc.
- Conduct a slow and gradual rollout.
- Train LLMs on curated datasets to help remove capabilities for dangerous actions (see also, LLMs Sometimes Generate Purely Negatively-Reinforced Text).
- Wait for safety-promoting regulation to be passed before unilaterally advancing AI capabilities.
- Reduce the risk that function-calling GPT-4 will be distilled into a model which may not have any safety measures. This has happened when responses from text-davinci-003 were produce to the open-source Alpaca model, which lacked “adequate safety measures” according to the creators of Alpaca.
- Provide tools to help developers with sandboxing. For example, while it is easy to connect GPT-4 with a Python interpreter and execute arbitrary code, a safer approach would be to restrict code execution.
- Avoid using reinforcement learning based on solely outcomes achieved, as this can incentivize specification gaming and deceptive alignment. Continue to use process supervision in training.
Overall, advancing AI capabilities for autonomy and agency poses unique risks. Even if there are no terrible outcomes in the coming months, it’s important to take a proactive attitude toward addressing ongoing and future threats due to AI.
Thanks to Anish Upadhayay for comments and suggestions.