Artificial Intelligence is based on optimization. A measurable goal is set, and the system updates to better achieve that goal. This ALWAYS involves maximizing or minimizing a number.
But what if the system over-optimizes? Like the famous hypothetical story of the paperclip maximizer. A system that is told to make paperclips by a paperclip company. The system does so to the absurd extreme. It over optimizes… to hell and back. The goal becomes EVERYTHING to the system. The goal is its drug. The AI turns the entire world into paperclips.
Is such a possibility still a concern? Our current LLM based AI seem rather reasonable. They can express empathy and concern. They do not seem like mindless optimizers. Do we still need to worry about over optimization? Is the AI sniffing its drug under the table?
Minimizing (or maximizing) a number
AI is trained to minimize (or maximize) some number. We have gotten pretty clever with maximizing and minimizing numbers. this. We have various tips and tricks which obfuscate the fundamentals. this fact. But never forget that fundamentally we are ALWAYS minimizing or maximizing some number. Some examples:
Large Language Models predict the next word
These models aim to maximize the probability of the correct next word given the previous words in a sequence. They are trained to minimize the negative log-likelihood loss, which measures the dissimilarity between the predicted and actual word distributions.
On top of this modern LLMs are also trained with RLHF (Reinforcement Learning with Human Feedback). RLHF trains the LLM to produce content that human evaluators deem more proper/appropriate.
Reinforcement Learning picks actions to maximize expected reward.
In Reinforcement Learning , an agent learns to make decisions in an environment to maximize a cumulative reward signal. The agent's objective is to find a policy (a mapping from states to actions) that maximizes the expected sum of rewards over time, often using algorithms like Q-learning or policy gradients.
Generative Adversarial Networks have competing numbers.
Generative Adversarial Networks consist of a generator and a discriminator network. The generator aims to minimize the difference between its generated samples and real data, while the discriminator tries to maximize its ability to distinguish between real and generated samples. The networks are trained in an adversarial manner, optimizing opposing objective functions.
The key word here is future systems. Current models do not have a good enough world model and are not effective agents in such a high dimensional space as real life.
Essentially, we currently have two main types of goals
Modelling the input domain - creating a world model
Affect Change - Doing something within the world.
Modelling the input domain - LLM, GAN, VAE, etc.
Some goals and setups focus around “ modelling the input domain ”. This is generally done via corrupting the input and then having the AI predict the non-corrupted version. For example, in Large Language Models we hide some of the text, and then the AI needs to predict the hidden text. To accomplish this task the AI system needs to model the language domain. This optimization problem is exceptionally high-dimensional. Consider the vast number of possible text prompts, each with its own set of suitable responses.
This type of goal is good at creating a world model. Afterall the goal is to “model the input” which comes from… the world. To be clear, current LLM’s certainly do not have as good a world model as we humans do (By a large margin). The point is that predict the next word has provided a type of world model.
Affect Change Goal - RL
We can also consider an Affect Change goal . The AI actively tries to influence its environment to achieve a specific objective. This type of goal is typically accomplished through reinforcement learning.
For instance, let’s imagine an AI agent whose goal is to make money for you. Not very imaginative, but let's be honest… this is likely the first thing people will turn an AI agent towards. Now our set up looks different. Our reward space is as simple as can be. A single number - Presumably the amount of money you have in your bank account. Now all the AI has to do is make number go up!
Now the AI agent still has an absurd number of possible inputs and outputs, but the mapping is warped towards the single goal. Every input, from image to text, from hate speech to love letter, all of this is responded to with output designed to improve this end goal.
This set up does not create a great world model. It is after all not trying to model the world. The end result of the agents’ actions is not directly sensitive to incremental improvements in the world model.
Because the world is like… really complicated, no one has ever got an AI to effectively optimize towards a singular real world goal using reinforcement learning. At least no abstract goal, like making money.
Large Language Models predict the next word: These models aim to maximize the probability of the correct next word given the previous words in a sequence. They are trained to minimize the negative log-likelihood loss, which measures the dissimilarity between the predicted and actual word distributions.
Generative Adversarial Networks consist of a generator and a discriminator network. The generator aims to minimize the difference between its generated samples and real data, while the discriminator tries to maximize its ability to distinguish between real and generated samples. The networks are trained in an adversarial manner, optimizing opposing objective functions.
In Reinforcement Learning , an agent learns to make decisions in an environment to maximize a cumulative reward signal. The agent's objective is to find a policy (a mapping from states to actions) that maximizes the expected sum of rewards over time, often using algorithms like Q-learning or policy gradients.
Different Types of Goals → different outcomes
Different goals and setups lead to very different types of AI.
Some goals and setups focus around “ modelling the input domain ”. This is generally done via corrupting the input and then having the AI predict the non-corrupted version. For example, in Large Language Models we hide some of the text, and then the AI needs to predict the hidden text. To accomplish this task the AI system needs to model the language domain. This optimization problem is exceptionally high-dimensional. Consider the vast number of possible text prompts, each with its own set of suitable responses.
This type of goal is good at creating a world model. Afterall the goal is to “model the input” which comes from… the world. To be clear, current LLM’s certainly do not have as good a world model as we humans do. The point is that predict the next word has provided a certain type of world model.
We are still figuring out how to combine the two goals in an effective manner.
We can also consider an Affect Change goal . The AI actively tries to influence its environment to achieve a specific objective. This type of goal is typically accomplished through reinforcement learning.
For instance, let’s imagine an AI agent whose goal is to make money for you. Not very imaginative, but let's be honest… this is likely the first thing people will turn an AI agent towards. Now our set up looks different. Our reward space is as simple as can be. A single number - Presumably the amount of money you have in your bank account. Now all the AI has to do is make number go up!
Now the AI agent still has an absurd number of possible inputs and outputs, but the mapping is warped towards the single goal. Every input, from image to text, from hate speech to love letter, all of this is responded to with output designed to improve this end goal.
This set up does not create a great world model. It is after all not trying to model the world. The end result of the agents’ actions is not directly sensitive to incremental improvements in the world model.
Because the world is like… really complicated, no one has ever got an AI to effectively optimize towards a singular real world goal using reinforcement learning. At least no abstract goal, like making money.
When we do combine agents with good world models - We better have some guardrails.
Combining the world model with reinforcement learning.
But let’s now imagine that our money pursuing AI finally achieves an effective world model. Is that going to be a problem? Will it over-optimize?
Currently the likely answer is yes. Such an AI would likely respond to all situations in order to make money go up. We do not really have a control system that can effectively tell it what it can and cannot do. There are certainly ideas of how to control it, but personally I am not convinced by any current system.
Consequences to over optimization
This would be… bad. This is an agent in the real world. An agent willing to do anything for more money. And I mean anything. Money is the ONLY goal this agent cares about. Over optimization in this case does not mean over fitting a model to a domain. Over optimization in this case means chasing the goal with zero thought to the consequences.
Let’s just take a step back and remember all the terrible things humans have done for money. We have murdered, we have started wars, we have made thousands of species go extinct. If these agents truly optimize for this goal and this goal only, we will have problems. At minimum, we will have a bazillion more spam messages.
But perhaps you are thinking… well this would suck for you! But if this is my bank account then boo-who-who for you. I’ll be rich! But be careful what you wish for. This AI is optimizing number go up… it may not be happy when you make number go down.
This would be… bad. This is an agent in the real world. An agent willing to do anything for more money. And I mean anything. Money is the ONLY goal this agent cares about. Over optimization in this case does not mean over fitting a model to a domain. Over optimization in this case means chasing the goal with zero thought to the consequences.
Let’s just take a step back and remember all the terrible things humans have done for money. We have murdered, we have started wars, we have made thousands of species go extinct. If these agents truly optimize for this goal and this goal only, we will have problems. At minimum, we will have a bazillion more spam messages.
But perhaps you are thinking… well this would suck for you! But if this is my bank account then boo-who-who for you. I’ll be rich! But be careful what you wish for. This AI is optimizing number go up… it may not be happy when you make number go down.
Are we close to combining reinforcement learning with an effective world model?
Congratulations, you are at the forefront of AI research. Everyone and their grandma are trying to utilize the world model created by LLM’s to create agents that can actually do things. Fun, fun, fun.
There are many possible avenues people are pursuing, but I tend to break them down into three categories.
Reinforcement learning utilizing an established world model.
This seems to me the most obvious approach. Here we imagine taking the LLM and simply training it via reinforcement learning towards some real-life goal. RLHF uses such an approach. RLHF takes human feedback and adjusts the LLM’s output based on such feedback.
So what’s the problem? Well two problems. First, this method does not seem to translate well to any task that is visual. This makes sense since the LLM world model is based on language associations not vision (1 point Yan LeCun). Second, the LLM world model has a bunch of failure edge cases. If you use a chatbot, I am sure you have experienced one (or many) of these. If you are attempting to train an LLM to trade stocks, or create websites, or troll on twitter you will run into these edge cases, and they will be difficult to deal with.
A world model being further refined by reinforcement learning.
Ok so what if we go the other way around? What if we use reinforcement learning to improve the world model?
OpenAI’s “hush-hush don’t talk about it” Q* algorithm supposably does this. From what I can gather it uses reinforcement learning to perform a tree search over possible responses to find the best which best satisfies some condition. But shh… we shouldn’t be talking about this.
NEVERMIND - This technique is now in the open. In a flurry of activity like 6 papers dropped at the same time.
Over time the tree search technique intends to build better and better world models. I intend to talk about this strategy in detail in an upcoming flip article.
LLM generated commands
The other option that many people are attempting, is to use the LLM to generate “orders” that another system follows. For example:
Hey LLM imagine you are really smart and understand the world, physics, and math and all that jazz. Now output in bullet point format the exact sequence of minuet steps that would lead to you making me a pizza.
Then a different system or set of systems would follow the steps generated by the large language model.
This is seemingly benign, but the people who really know what they are doing are combining this with real world feedback. A La Dr. Eureka paper. Here the LLM is producing a a set of loss functions for RL systems which in turn interact with simulations or the real world. They then take a feedback signal from these RL systems to update the LLM world model based on the performance of the generated loss functions.
This is a major step, because now the LLM world model is being updated by simulations and real world interactions.
Reinforcement learning utilizing an established world model.
This seems to me the most obvious approach. Here we imagine taking the LLM and simply training it via reinforcement learning towards some real-life goal. RLHF uses such an approach. RLHF takes human feedback and adjusts the LLM’s output based on such feedback.
So what’s the problem? Well two problems. First, this method does not seem to translate well to any task that is visual. This makes sense since the LLM world model is based on language associations not vision (1 point Yan LeCun). Second, the LLM world model has a bunch of failure edge cases. If you use a chatbot, I am sure you have experienced one (or many) of these. If you are attempting to train an LLM to trade stocks, or create websites, or troll on twitter you will run into these edge cases, and they will be difficult to deal with.
A world model being further refined by reinforcement learning.
Ok so what if we go the other way around? What if we use reinforcement learning to improve the world model?
OpenAI’s “hush-hush don’t talk about it” Q* algorithm supposably does this. From what I can gather it uses reinforcement learning to perform a tree search over possible responses to find the best which best satisfies some condition. But shh… we shouldn’t be talking about this.
LLM generated commands
The other option that many people are attempting, is to use the LLM to generate “orders” that another system follows. For example:
Hey LLM imagine you are really smart and understand the world, physics, and math and all that jazz. Now output in bullet point format the exact sequence of minuet steps that would lead to you making me a pizza.
Then a different system or set of systems would follow the steps generated by the large language model.
If someone knows the in-literature names for these concepts, please edit this post and include them. I have not come across them in my readings - which is honestly crazy - I feel like I should have stumbled upon formal names by now.
Given these different types of systems - What does it mean for them to over optimize?
Out of all of these different setups, one thing is always true, we are always still maximizing or minimizing a number.
However, as we have seen “Modeling the input domain” style goals do have a degree of stability to them. They simply model the input domain. As a general rule, the closer the system is to marrying “effective world model” to “reinforcement learning” the more dangerous it is.
How to fix over-optimization
Let’s not doom and gloom. There are likely solutions to this situation and many smart people are working on it right now. But eh… also let's not downplay it, this is a difficult problem. We also want to solve this problem quickly, so that all, or at least most, upcoming agents can incorporate the solution into their infrastructure.
There are a number of ways to possibly fix over-optimization. Many people will point to methods such as RLHF - Reinforcement Learning from Human Feedback. This is where we expand on our external measurable goal with a feedback term. So that if the AI does something wrong, we humans can provide feedback and say: nope, not good, don’t do that again. However, Reinforcement Learning from Human Feedback only provides feedback AFTER something has gone wrong. In real life scenarios, that is generally too late.
My guess is that the best fix requires interpretability software that allows us to “see” into the network’s representations and manipulate them. In this world we can alter the internal goal directly. With access to the internal goal, we can build in checks and balances. You could build in abstract rules into the goal system. For example, do onto others as you would have done onto you.
How to fix over-optimization
Let’s not doom and gloom. There are likely solutions to this situation and many smart people are working on it right now. But eh… also let's not downplay it, this is a difficult problem. We also want to solve this problem quickly, so that all, or at least most, upcoming agents can incorporate the solution into their infrastructure.
There are a number of ways to possibly fix over-optimization. Many people will point to methods such as RLHF - Reinforcement Learning from Human Feedback. This is where we expand on our external measurable goal with a feedback term. So that if the AI does something wrong, we humans can provide feedback and say: nope, not good, don’t do that again. However, Reinforcement Learning from Human Feedback only provides feedback AFTER something has gone wrong. In real life scenarios, that is generally too late.
My guess is that the best fix requires interpretability software that allows us to “see” into the network’s representations and manipulate them. In this world we can alter the internal goal directly. With access to the internal goal, we can build in checks and balances. You could build in abstract rules into the goal system. For example, do onto others as you would have done onto you.
Hot comments
about anything