This needs to be expanded upon

“It’s easy guys! (Its not). Just take the internal goal and tack on the following. Take the representation for human and put it in relationship with the representation for harm. Then take the AI representation of self and put it in relationship with the new harming-human aggregation. Do this all in such a way so that when the simulated outcome interacts with the new internal goal, the internal measurement function massively devalues any plan in which such a harm-human outcome becomes likely. BAM first law applied.”

Also i know this is a part of the handwave but… how do you imagine putting representations in the correct relationships? I suppose that’s under the “solve interpretability” first umbrella.

By frank-green

· Reply

No comments yet