AI algorithms are based on optimization. A measurable goal is set, and the system updates to achieve that goal. But there is a problem. Our artificial intelligences will OVER optimize… to hell and back. The goal is EVERYTHING to the system. Anything to move that needle.
It may be challenging for us, humans, to comprehend this, since we have a diverse set of goals which are all competing for priority. We are essentially a balanced ecosystem of objectives. Even when we become obsessed with something, such as a drug, other biological needs interfere. A drug addict can’t spend all day & night intelligently trying to get their next fix. Even drug addicts got to pee, eat, sleep etc.
Currently, over optimization is really not much of a problem… yet. Sure, our LLM’s generate hate speech, because predict the next word says so. Sure, an image generator will strive to create ‘diverse’ images because Google embedded that command into the prompt. But come on, no one is dying here. What’s the big deal?
Well, our current AI models are not agents. They do not act in the real world towards a narrow goal. Instead, our current LLM’s and image generators are special in that they optimize towards broad goals within relatively safe domains.
LLM’s have what can be described as a Domain modeling goal - The AI system optimizes to accurately model some complex domain. This is generally done via corrupting the input and then having the AI predict the uncorrupted version. For example, in Large Language Models we hide some of the text, and then the AI needs to predict the hidden text.
This generally results in extremely high dimensional reward spaces. For instance, imagine the reward space for predict the next word. It’s ridiculous. There are many (many) possible text prompts, and each text prompt has a unique set of good responses. The language manifold of expected response is immense. The “goal”, in this case, is high dimensional.
Over optimization in this case is relatively harmless. Here, over optimization means “over” predicting the next word. This generally means overfitting to the input data and being less capable of generalizing to new prompts.
Let's compare this with a hypothetical agent that operates in the real world. Suppose this agent's goal is to make money for you. Not very imaginative, but let's be honest… this is likely the first thing people will turn an AI agent towards.
Now our set up looks different. Our reward space is as simple as can be. A single number - Presumably the amount of money you have in your bank account. Now all the AI has to do is make number go up!
This is an Affect Change goal. The AI actively tries to influence its environment to achieve a specific objective. It models the world only to achieve this objective. This type of goal is typically accomplished through reinforcement learning.
Now the AI agent still has an absurd number of possible inputs and outputs, but the mapping is warped towards the single goal. Every input, from image to text, from hate speech to love letter, all of this is responded to with output designed to improve this end goal.
Some agents will be better than others at mapping inputs to outputs. Some agents will have terrible mental models of reality and flail around like wet fish. However, we are designing Artificial Intelligence to be…ah… intelligent. Eventually some of these agents are going to get fairly smart. They are going to develop good mental models. They are going to come up with effective plans. They are going to respond to all inputs with outputs designed to improve the chances that money goes up.
What does over optimization mean here?
Well, this is now an agent in the real world. An agent willing to do anything for more money. And I mean anything. Money is the ONLY goal this agent cares about. Over optimization in this case does not mean over fitting a model to a domain. Over optimization in this case means chasing the goal with zero thought to the consequences.
Let’s just take a step back and remember all the terrible things humans have done for money. We have murdered, we have started wars, we have made thousands of species go extinct. If these agents truly optimize for this goal and this goal only, we will have problems. At minimum, we will have a bazillion more spam messages.
Internal vs. external goals
Just to make the situation a bit more complicated. There is a big difference between internal goals and external goals.
The external goal is the measured goal we set up in order to train the AI network.
The internal goal is formed within the learned structure of the AI network. It is the network’s own understanding of its goal.
For simple networks, internal goals probably don't exist. However, more complex networks might develop them. This is likely because having an internal goal enables the simulation of future events and prediction of potential rewards, which, in turn, aids in making better decisions. This is generally beneficial for any goal.
Internal goals are problematic. You may set up a training environment that says: make 1 million dollars and then stop. You may train the agent within that environment one hundred times or one million times. But you will never know for certain if the agent got the message. We don’t have access to its internal goal. To gain access we would need big breakthroughs in the field of interpretability.
How to fix over-optimization
Let’s not doom and gloom. There are likely solutions to this situation and many smart people are working on it right now. But eh… also let's not downplay it, this is a difficult problem. We also want to solve this problem quickly, so that all, or at least most, upcoming agents can incorporate the solution into their infrastructure.
There are a number of ways to possibly fix over-optimization. Many people will point to methods such as RLHF - Reinforcement Learning from Human Feedback. This is where we expand on our external measurable goal with a feedback term. So that if the AI does something wrong, we humans can provide feedback and say: nope, not good, don’t do that again. However, Reinforcement Learning from Human Feedback only provides feedback AFTER something has gone wrong. In real life scenarios, that is generally too late.
My guess is that the best fix requires interpretability software that allows us to “see” into the network’s representations and manipulate them. In this world we can alter the internal goal directly. With access to the internal goal, we can build in checks and balances. You could build in abstract rules into the goal system. For example, do onto others as you would have done onto you.
Hot comments
about anything