Can we mitigate the dangers of artificial intelligence by using reinforcement learning with Human Feedback (RLHF) to reward and punish AI agents?
This is a technique where:
an AI agent makes a decision,
a human considers the decision and provides feedback.
The feedback signal is used to teach the agent.
Thus, if the AI agent does something bad, we say: “do not do that", and if it does something good, we say: “good job, keep doing that”.
This sounds great but the technique has some serious problems.
First, it does not scale. Artificial agents are taught with millions of data points, so unless we have a large army of people providing feedback around the clock, this technique will always play second fiddle. For example, today, most large language models are trained to predict the next word in a sentence (We have plenty of data on this). Only after this prediction training do we even consider reward modeling with human feedback.
The second problem is that the AI may just learn to trick the human evaluator. This is often easier to do then to actually learn the lesson that the human wants the AI to learn. Imagine trying to teach children how to behave. Often after telling a child to stop kicking the other, the game just goes under ground. The kids pinch and prod when the parent is not looking.
The third problem is that you are always playing catchup. The AI makes a mistake, then you tell it what it did wrong. If the AI was super-intelligent the catchup may already be too late.
RLHF is very popular today and works great at fine-tuning AI models, however we should not consider it a good method of containing superintelligent AI.
Hot comments
about anything