Open-sourced Morality Training Environments

Intelligence is like fire: incredibly useful but devastating if pointing in the wrong direction.

Morality guides people in directing their intelligence. But artificial intelligence has no such guiding principles.

Some people believe that with intelligence comes morality. But this intuition is generally considered wrong by the AI community. There is no reason to believe that intelligent systems develop morality. smarter systems develop morality. Morality formed within humanity and other biological creatures because we evolved in family units and group settings. Morality is one of evolutions tools to keep individuals working well within the group. If we want our AI agents to show a sense of morality, we need to embed morality within them.

The top theory of why Morality formed within humanity is because we evolved in family units and group settings. Morality is an evolved tool to keep individuals working well within the group.

If we want our AI agents to show a sense of morality, we may need to embed it within them.

Required use of open-sourced morality training environments

Now PERHAPS it is possible that our AI’s will automatically inherit morality because they are trained on human made data. But trusting the INTERNET to embed morality is a strategy I suggest we do not utilize. The potential dangers of bad actors empowered by AI are scary.

Direct Morality Training Environments

self-preserving computer viruses.
biological viruses with both easy spread and high mortality rates.
Drone swarms targeting individuals with specific qualities.
Misinformation campaigns aimed at destabilizing states.
etc…

To help us mitigate the potential dangers of AI, we could regulate the AI community by requiring them to use an open-source morality training environment. The hope here is to continually update best practices for creating these training environments and then update all AI systems to make them all as moral as possible. This would be done in an open-source fashion as to keep everything above board.

The benefits of this regulatory strategy are many. But perhaps most importantly, it is in the best interests of all parties. Big nation states like the USA, China, and Russia all are pushing forward in AI development. But it is in no one's interest to have an AI that can turn against their own owners. While militaries are likely to want to tweak these morality training environments, a base level of morality is a much easier and safer starting point than no morality at all. Meanwhile the dangers of individual bad actors with AI are so scary that everyone will want to enforce such regulations within civilian AI systems.

How it may work

Morality training environments work by putting the AI agent into a series of situations and rewarding them for good behavior.

A common example of this is Reinforcement learning with Human Feedback. Here when the AI does something we do not like, we penalize the action. This has been used by the major AI companies to “mold” their AIs. There is a good article about it here: Illustrating Reinforcement Learning from Human Feedback (RLHF) (huggingface.co).

You can train morality more do this directly by designing tests for the AI. This was done in the paper: games for the AI. Work is already being done in this area such as the paper: What Would Jiminy Cricket Do? Towards Agents That Behave Morally (arxiv.org) . Here, . In that paper, the researchers come up with a set of text-based adventure games designed to test an agent’s ability to make good moral decisions. The hope is that such games can be used to teach the AI what not to do. The problem with such a system is that training does not scale.

The problem with these systems is that the lessons do not scale to edge cases, and we cannot be ensured that the lesson was learned. In the “sleeper agent” paper by anthropic we learned that an AI can have secret “motives” embedded from the training data. Once those “motives” are there it can be very hard to get rid of them.

Indirect Morality Training Environments

You can also create a more indirect moral training environment. Here you would create a game where AI agents play with and against each other. But the environment is set up so that the Nash equilibrium is for the agents to work together. In such simulations, the AI can receive training that scales.

One example of this can be seen in our own evolutionary history. When nomadic tribes of humans lived off the land, an individual’s individuals wants were often secondary to the group. If an individual over hunted, they could destroy the ecosystem and endanger the entire tribe. But if the tribe worked together and only hunted and gathered what they needed the tribe could not only survive but thrive. Such a game theory environment encourages cooperation. We can create similar simulated situations for our AI.

Another interesting idea for an indirect moral training environment is that you create a simulation with the AI agent and many other “environmental” agents. The AI agent’s reward function is then directly related to the reward function of these other “environmental” agents. So if the agents in the environment do well, the AI agent does well. This is in an attempt to create a shepherd style dynamic. Where the AI is a shepherd and is constantly helping other agents achieve their goals while continually balancing the needs of others in the environment.

I do not know of any research in this area, please link if you do.

The problem

The problem with both morality training environments is that it is possible that a sufficiently smart Artificial Intelligence might realize that it is being tested. It might then hide its true internal goal and pretend to be moral. In such a situation that AI never really gets updated by the morality training environment. This is why we really need some form of interpretability to be sure whether an AI’s internal goal is good or not.

Instead, the best solution is likely some form of interpretability to be sure whether an AI’s internal goal is good or not.

Current king of the hill (#3)

Open-sourced Morality Training Environments.

Submitted by

Josh

5/16/2024.

As an edit it won against

text #2

(3 YES 0 NO).

This king text has no current edits attempting to take its place.