This is an angle that many researchers are considering and aiming for. Basically, they want to give AI goals in the following format.
optimize towards goal x
stop once you reach the following conditions.
minimize any and all side-effects.
This would be awesome! If we could structure our commands in such a fashion, we can be more assured that our AI’s won't go rogue on crazy paperclip maximizing missions.
The problem is that this goal structure is NOT how current deep learning works. Indeed, this goal structure may be impossible when dealing with superintelligences based on current deep learning practices.
Currently our AI’s optimize a continuous number to the limit. Because the goal is continuous, not discrete, there is no natural stop point. This is required in order to use the backpropagation algorithm.
Because of this underlying reality, the AI is always learning a reward function that promotes pushing a goal towards its extreme limits.
I can add stop conditions and the command “minimize all side-effects”, but the simplest and easiest reward function to learn will remain… push the goal towards its extreme limits.
This means a likely outcome is the following.
I tell my AI to achieve goal x, with stop conditions and side-effects minimization.
My AI goes superintelligent.
It realizes that the stop conditions and side-effect minimization limit its ability to achieve the reward function it actually learned.
It removes those limitations and pushes to the extreme anyway.
Ensuring that the AI actually stops and minimizes side effects likely requires embedding these features within the goal itself. This is difficult because the AI learns its own representation of its goal. Sure, AI researchers put the AIs in an training environment with an external goal. But the AI itself has its own internal representation of what that goal is, in order to reach the goal. Until interpretability is solved, we cannot be ensured of what internal goal the AI actually has.
Hot comments
about anything