One possible answer to the alignment problem is to create an Advanced Morality Model. The idea is to instill a set of uniform guidelines and rules into the model as to teach the model how to behave. The Advanced Morality Model would be built upon a foundation of ethical principles, rules, and values. By integrating this knowledge into AI systems, we hope to create agents that consistently make responsible decisions. This approach offers a more proactive solution compared to reactive techniques like Reward Modelling with Human feedback. This seems to be the main method pushed by companies like anthropic and their AI Claude.
However, there are some issues to consider. First and most obviously, defining a universally agreed-upon morality is complex. Second, explaining those moral teachings to an AI is difficult (It is likely to misunderstand). Third, even with a well-defined morality model, there is the risk that AI agents may find ways to exploit or circumvent the rules.
The fundamental problem
Currently our AI’s optimize a continuous number to the limit. Because the goal is continuous, not discrete, there is no natural stop point. This is required in order to use the backpropagation algorithm.
Because of this underlying reality, the AI is always learning a reward function that promotes pushing a goal towards its extreme limits.
Ensuring that the AI actually listens to a morality model likely requires embedding the morality model within the goal itself. This is difficult because the AI learns its own representation of its goal. Sure, AI researchers put the AIs in a training environment with an external goal. But the AI itself has its own internal representation of what that goal is. Until interpretability is solved, we cannot be ensured of what internal goal the AI actually has.
Hot comments
about anything