Artificial Intelligence is based on optimization. A measurable goal is set, and the system updates to better achieve that goal. But there is a problem. Our AIs will OVER optimize… to hell and back. The goal is EVERYTHING to the system. Anything to move that needle. Nothing else matters.
But what if the system over-optimizes? Like the famous hypothetical story of the paperclip maximizer. A system that is told to make paperclips by a paperclip company. The system does so to the absurd extreme. It over optimizes… to hell and back. The goal becomes EVERYTHING to the system. The goal is its drug. The AI turns the entire world into paperclips.
Is such a possibility still a concern? Our current LLM based AI seem rather reasonable. They can express empathy and concern. They do not seem like mindless optimizers. Do we still need to worry about over optimization? Is the AI sniffing its drug under the table?
Minimizing (or maximizing) a number
AI is trained to minimize (or maximize) some number. We have gotten pretty clever with this. We have various tips and tricks which obfuscate this fact. But never forget that fundamentally we are ALWAYS minimizing or maximizing some number. Some examples:
It may be challenging for us, humans, to comprehend this, since we have a diverse set of goals which are all competing for priority. We are essentially a balanced ecosystem of objectives. Even when we become obsessed with something, such as a drug, other biological needs interfere. A drug addict can’t spend all day & night intelligently trying to get their next fix. Even drug addicts got to pee, eat, sleep etc.
Large Language Models predict the next word: These models aim to maximize the probability of the correct next word given the previous words in a sequence. They are trained to minimize the negative log-likelihood loss, which measures the dissimilarity between the predicted and actual word distributions.
Generative Adversarial Networks consist of a generator and a discriminator network. The generator aims to minimize the difference between its generated samples and real data, while the discriminator tries to maximize its ability to distinguish between real and generated samples. The networks are trained in an adversarial manner, optimizing opposing objective functions.
In Reinforcement Learning , an agent learns to make decisions in an environment to maximize a cumulative reward signal. The agent's objective is to find a policy (a mapping from states to actions) that maximizes the expected sum of rewards over time, often using algorithms like Q-learning or policy gradients.
Different Types of Goals → different outcomes
Different goals and setups lead to very different types of AI.
Some goals and setups focus around “ Over optimization is all about the goal. Some goals are relatively safe. Our current language models, for example, simply predict the next word. This can be described as a modelling the input domain Domain modeling goal. ”. This is generally done via corrupting the input and then having the AI predict the non-corrupted version. For example, in Large Language Models we hide some of the text, and then the AI needs to predict the hidden text. To accomplish this task the AI system needs to model the language domain. This optimization problem is exceptionally high-dimensional. Consider the vast number of possible text prompts, each with its own set of suitable responses.
This type of goal is good at creating a world model. Afterall the goal is to “model the input” which comes from… the world. To be clear, current LLM’s certainly do not have as good a world model as we humans do. The point is that predict the next word has provided a certain type of world model.
We can also consider an Affect Change goal . The AI actively tries to influence its environment to achieve a specific objective. This type of goal is typically accomplished through reinforcement learning.
This generally results in extremely high dimensional reward spaces. For instance, imagine the reward space for predict the next word. It’s ridiculous. There are many (many) possible text prompts, and each text prompt has a unique set of good responses. The language manifold of expected response is immense. The “goal”, in this case, is high dimensional.
Over optimization in this case is relatively harmless. Here, over optimization means “over” predicting the next word. This generally means over-fitting to the input data and being less capable of generalizing to new prompts.
For instance, let’s imagine an AI agent whose Let's compare this with a hypothetical agent that operates in the real world. Suppose this agent's goal is to make money for you. Not very imaginative, but let's be honest… this is likely the first thing people will turn an AI agent towards. Now our set up looks different. Our reward space is as simple as can be. A single number - Presumably the amount of money you have in your bank account. Now all the AI has to do is make number go up!
Now our set up looks different. Our reward space is as simple as can be. A single number - Presumably the amount of money you have in your bank account. Now all the AI has to do is make number go up!
This is an Affect Change goal . The AI actively tries to influence its environment to achieve a specific objective. It models the world only to achieve this objective. This type of goal is typically accomplished through reinforcement learning.
Now the AI agent still has an absurd number of possible inputs and outputs, but the mapping is warped towards the single goal. Every input, from image to text, from hate speech to love letter, all of this is responded to with output designed to improve this end goal.
This set up does not create a great world model. It is after all not trying to model the world. The end result of the agents’ actions is not directly sensitive to incremental improvements in the world model.
Some agents will be better than others at mapping inputs to outputs. Some agents will have terrible mental models of reality and flail around like wet fish. However, we are designing Artificial Intelligence to be…ah… intelligent. Eventually some of these agents are going to get fairly smart. They are going to develop good mental models. They are going to come up with effective plans. They are going to respond to all inputs with outputs designed to improve the chances that money goes up.
Because the world is like… really complicated, no one has ever got an AI to effectively optimize towards a singular real world goal using reinforcement learning. At least no abstract goal, like making money.
Combining the world model with reinforcement learning.
But let’s now imagine that our money pursuing AI finally achieves an effective world model. Is that going to be a problem? Will it over-optimize?
Currently the likely answer is yes. Such an AI would likely respond to all situations in order to make money go up. We do not really have a control system that can effectively tell it what it can and cannot do. There are certainly ideas of how to control it, but personally I am not convinced by any current system.
This would be… bad. This is an agent in the real world. An agent willing to do anything for more money. And I mean anything. Money is the ONLY goal this agent cares about. Over optimization in this case does not mean over fitting a model to a domain. Over optimization in this case means chasing the goal with zero thought to the consequences.
What does over optimization mean here?
Well, this is now an agent in the real world. An agent willing to do anything for more money. And I mean anything. Money is the ONLY goal this agent cares about. Over optimization in this case does not mean over fitting a model to a domain. Over optimization in this case means chasing the goal with zero thought to the consequences.
Let’s just take a step back and remember all the terrible things humans have done for money. We have murdered, we have started wars, we have made thousands of species go extinct. If these agents truly optimize for this goal and this goal only, we will have problems. At minimum, we will have a bazillion more spam messages.
But perhaps you are thinking… well this would suck for you! But if this is my bank account then boo-who-who for you. I’ll be rich! But be careful what you wish for. This AI is optimizing number go up… it may not be happy when you make number go down.
Are we close to combining reinforcement learning with an effective world model?
Congratulations, you are at the forefront of AI research. Everyone and their grandma are trying to utilize the world model created by LLM’s to create agents that can actually do things. Fun, fun, fun.
There are many possible avenues people are pursuing, but I tend to break them down into three categories.
Reinforcement learning utilizing an established world model. Internal vs. external goals
This seems to me the most obvious approach. Here we imagine taking the LLM and simply training it via reinforcement learning towards some real-life goal. RLHF uses such an approach. RLHF takes human feedback and adjusts the LLM’s output based on such feedback.
So what’s the problem? Well two problems. First, this method does not seem to translate well to any task that is visual. This makes sense since the LLM world model is based on language associations not vision (1 point Yan LeCun). Second, the LLM world model has a bunch of failure edge cases. If you use a chatbot, I am sure you have experienced one (or many) of these. If you are attempting to train an LLM to trade stocks, or create websites, or troll on twitter you will run into these edge cases, and they will be difficult to deal with.
Just to make the situation a bit more complicated. There is a big difference between internal goals and external goals.
A world model being further refined by reinforcement learning. The external goal is the measured goal we set up in order to train the AI network.
Ok so what if we go the other way around? What if we use reinforcement learning to improve the world model?
OpenAI’s “hush-hush don’t talk about it” Q* algorithm supposably does this. From what I can gather it uses reinforcement learning to perform a tree search over possible responses to find the best which best satisfies some condition. But shh… we shouldn’t be talking about this.
LLM generated commands The internal goal is formed within the learned structure of the AI network. It is the network’s own understanding of its goal.
The other option that many people are attempting, is to use the LLM to generate “orders” that another system follows. For example:
Hey LLM imagine you are really smart and understand the world, physics, and math and all that jazz. Now output in bullet point format the exact sequence of minuet steps that would lead to you making me a pizza.
Then a different system or set of systems would follow the steps generated by the large language model.
If someone knows the in-literature names for these concepts, please edit this post and include them. I have not come across them in my readings - which is honestly crazy - I feel like I should have stumbled upon formal names by now.
Given these different types of systems - What does it mean for them to over optimize?
Out of all of these different setups, one thing is always true, we are always still maximizing or minimizing a number.
For simple networks, internal goals probably don't exist. However, more complex networks might develop them. This is likely because having an internal goal enables the simulation of future events and prediction of potential rewards, which, in turn, aids in making better decisions. This is generally beneficial for any goal.
However, as we have seen “Modeling the input domain” style goals do have a degree of stability to them. They simply model the input domain. As a general rule, the closer the system is to marrying “effective world model” to “reinforcement learning” the more dangerous it is.
Internal goals are problematic. You may set up a training environment that says: make 1 million dollars and then stop. You may train the agent within that environment one hundred times or one million times. But you will never know for certain if the agent got the message. We don’t have access to its internal goal. To gain access we would need big breakthroughs in the field of interpretability.
How to fix over-optimization
How to fix over-optimization
Let’s not doom and gloom. There are likely solutions to this situation and many smart people are working on it right now. But eh… also let's not downplay it, this is a difficult problem. We also want to solve this problem quickly, so that all, or at least most, upcoming agents can incorporate the solution into their infrastructure.
There are a number of ways to possibly fix over-optimization. Many people will point to methods such as RLHF - Reinforcement Learning from Human Feedback. This is where we expand on our external measurable goal with a feedback term. So that if the AI does something wrong, we humans can provide feedback and say: nope, not good, don’t do that again. However, Reinforcement Learning from Human Feedback only provides feedback AFTER something has gone wrong. In real life scenarios, that is generally too late.
My guess is that the best fix requires interpretability software that allows us to “see” into the network’s representations and manipulate them. In this world we can alter the internal goal directly. With access to the internal goal, we can build in checks and balances. You could build in abstract rules into the goal system. For example, do onto others as you would have done onto you.
Hot comments
about anything