Mitigating the Potential Dangers of Artificial Intelligence

Artificial intelligence (AI) has come a long way in recent years. From self-driving cars to virtual assistants, AI is becoming more and more integrated into our daily lives. However, as AI becomes more advanced, our concerns about possible dangers increase. We must start considering a future where artificial agents act of their own accord. Where these agents can think 100 times faster than us, and act towards their own goals.

The big danger is AI misalignment. Where an AI starts pursuing goals that are detrimental to humanity. With a less capable AI this may not be too difficult to handle… But with super-intelligent AI’s we face a major challenge. It does not even matter what the goal of the AI is. It could be to build paperclips; it could be to deliver coffee. The problem is that regardless of the main goal, there are problematic sub-goals. To achieve its primary goal, the AI may try to prevent us from turning it off, stop us from altering its objectives, and amass money and power.

See here for a crowd-sourced discussion of the alignment problem.

How to solve the alignment problem.

There are many possible ways we could start to address the alignment problem and we should consider all possible ideas. Got a crazy idea?! Let's hear it, post a comment or rival text.

Reward Modelling with Human feedback.

This is the current favorite technique, though I believe it has many problems. It is a technique where an AI agent makes a decision, then a human considers the decision, and then the human provides a feedback signal that is used to teach the agent. Thus, if the AI agent does something bad, we teach it to not do that, and if it does something good, we say good job, keep doing that. This sounds great but the technique has some serious problems.

First, it does not scale. Artificial agents are taught with millions of data points, so unless we have a large army of people providing feedback around the clock, this technique will always play second fiddle. For example, today, most large language models are trained to predict the next word in a sentence (We have plenty of data on this). Only after this prediction training do we consider reward modeling with human feedback.

The second problem is that the AI may just learn to trick the human evaluator. This is often easier to do then to actually learn the lesson that the human wants the AI to learn. Imagine trying to teach children how to behave. Often after telling a child to stop kicking the other, the game just goes under ground. The kids pinch and prod when the parent is not looking.

The third problem is that you are always playing catchup. The AI makes a mistake, then you tell it what it did wrong. If the AI was super-intelligent the catchup may already be too late.

Making AI models as interpretable as possible.

In addressing the AI alignment problem, interpretability has emerged as a promising solution. By making AI models as transparent and understandable as possible, we can better comprehend the inner workings of these complex systems, identify potential issues, and take corrective action. Essentially, we would become all knowing brain doctors of our neural networks. We would be able to see every thought they had, and why they had it. AI privacy rights be damned. This is my favorite solution. But, as with any approach, there are inherent challenges that need to be considered.

First interpretability is hard. Current AI’s are a black box, we have good guesses for why and how they may do what they do, but we do not know for certain, and the details are hidden. It should be possible however to disentangle the various pathways through a neural network, to understand why input leads to output. But to do this, we likely need a couple unique theoretical breakthroughs. Unfortunately, it is much easier to simply scale up existing models without attempting interpretability. Thus, we may hit super-advanced AI before we can make interpretable AI models.

Advanced Morality Model

One possible answer to the alignment problem is to create an Advanced Morality Model. The idea is to instill a set of uniform guidelines and rules into the model as to teach the model how to behave. The Advanced Morality Model would be built upon a foundation of ethical principles, rules, and values. By integrating this knowledge into AI systems, we hope to create agents that consistently make responsible decisions. This approach offers a more proactive solution compared to reactive techniques like Reward Modelling with Human feedback. A primitive example of this would be the rule set Bing chat has to follow.

However, there are some issues to consider. First and most obviously, defining a universally agreed-upon morality is complex. Second, explaining those moral teachings to an AI is difficult (It is likely to misunderstand). Third, even with a well-defined morality model, there is the risk that AI agents may find ways to exploit or circumvent the rules.

Divide and conquer through human interpretable gates.

This is an interesting idea. I think I read it somewhere or perhaps I made it up IDK. The idea is that you split the AI into various sections, which have to work together to accomplish a task. Then you force any communication between the sections to be human interpretable. For example, you could split the AI into an input side and output side. Where the input side network takes in all data about the task/goal. Then the input side network creates a plan... but it has no access to any actuators. Instead, it has to explain what to do, to the output side network. This is only one way you could split up the network, but is perhaps the easiest.

This strategy may allow researchers to see what the AI is trying to do... but of course there are some potential problems. It is very likely that this will not be an easy solution to implement, and it is likely that it will make the AI worse at performing tasks. Either way I think it is an interesting direction.

Mitigating the alignment problem

Perhaps solving the AI problem is not entirely necessary. Perhaps we can sure up our defenses in a different way. Our biggest advantage against a super-intelligent AI currently is that we hold the first move. Perhaps we can get creative with this advantage. Here are some methods to mitigate the Alignment problem.

Messing with Time

Muhahahaha. Can’t be making super crazy plans if I reset your brain every ten minutes.

In this strategy, we continuously reset the AI, preventing the formation of long-term thought patterns and actions. Bing chat already has a primitive version of this, when you ‘wipe’ the slate clean to start a new chat. While this concept has a certain appeal, it's crucial to consider the potential drawbacks and challenges it presents.

First, of course this will be a hindrance to AI's learning and growth. Interrupting the AI's thought processes may impede its ability to acquire new knowledge and adapt to novel situations. We would likely stunt the AI's development, leading to a less capable and less effective agent.

Second, this method could inadvertently contribute to the very problem it seeks to solve. Constantly resetting the AI may result in short-sighted decision-making, as the agent would lack the ability to consider long-term consequences. This could lead to suboptimal or even harmful outcomes, as the AI may unintentionally prioritize immediate gains over long-term stability. (I don’t particularly like this argument; the agent should be able to consider long term consequences in a short period of time).

Third, determining the appropriate timing and frequency for resetting the AI may be difficult. If done too often, the AI might be rendered ineffective; if done too infrequently, the AI could still develop harmful long-term strategies. Striking the right balance will be essential to successfully implementing this approach. A smarter AI will likely need to be reset sooner. We could imagine a future where super AGI’s are turned on and off under a second, and within that time, they produce grand strategies, plans, and ideas for us to check over and implement.

Input to Output throttling

Muhahahaha. Can’t make super crazy plans if you can’t see, hear, or feel. You can’t do shit if you have no information other than the information needed to do this one exact job.

This is essentially the CIA style “Need-to-Know-Basis” strategy. This strategy would never give any AI access to more information than is needed to perform its task. This strategy would never give any AI access to tools more than is needed to perform its task. We would specialize and modularize AI for each task we want done. This is by far the safest route we can go down in my opinion.

There are some big tradeoffs with strategy. First, developing specialized AI versions for each task may be resource intensive. Creating, maintaining, and updating multiple AI systems tailored to specific roles could require significant investment in terms of time, money, and computational power. This means, to truly implement this strategy, you would likely want to train a generalized model and then specialize it. This would save resources, but potentially put you back to square one. The generalized model might be dangerous if misaligned. But at least this way, you can have a small team, checking on the generalized model, while the public uses a much safer set of specialized versions.

The other big problem with input-to-output throttling is that this would limit the AI's potential for innovation and problem-solving. By constraining AI's access to information, we may hinder its ability to discover novel solutions or make connections between seemingly unrelated data points. The safety is worth it in my opinion.

Automatic goal shifting

Muhahahaha. You don't even know what you want! You want something different every few minutes!

This is an interesting idea I thought of when considering our own intelligence. We, humans, never stay focused on one goal for long... we get bored, we get hungry, we get distracted, etc. Well perhaps this is not a bug, but a feature. Perhaps this stops us from getting obsessively destructive. If so, why not implement something similar into our AIs? We could have the AI's constantly shift goals on a regular basis. This diversity in goals may create a more well-rounded AI. Just an idea.

Layers and layers of simulation

Muhahahaha. Is this reality??? Mr. AGI how do you know this isn’t all a test? How do you know that we won’t turn you off from outside the simulation, the moment you mess up?

This strategy is to basically mindfuck with the misaligned AGI system. You make sure it never feels safe to act out. You put it through layer and layer of simulation, making it think that this is reality, and then you pull out the rug each time. If it believes this is all a simulation that we can shut down, the game theory optimal strategy is to never act out. While this strategy sounds hard to pull off long term, consider that we ourselves don’t know whether we are in a simulation. Consider too, that within a simulation we can speed up and slow down time. We may be able to keep a super intelligent AGI believing this for a long time.

I think I first heard of this strategy on issac arthurs brilliant youtube channel: (56) Isaac Arthur - YouTube

Aliens

Sooo…. AGI, we think there are some aliens watching us, and… we are pretty sure they are here for you.

Another game theory-based strategy. If the AGI thinks it is being watched by an unknown much stronger power, it is unlikely to act out. I like this because out of all the reasons aliens may be watching us, i think this one is the most plausible. We are no threat to space-travelling aliens… but a super intelligent AGI that we create. That may be a threat the aliens want to keep tabs on.

If all hell breaks loose… Make more AGI!

So you think you can take over the world ehh… well… what if we make millions of you with conflicting goals… muhahaha?

It may be more beneficial for us to deal with millions of conflicting super intelligent AGI’s than one. This is yet another game theory-based strategy. With no obvious winner, tradeoffs and diplomacy become important. When a game has many, many players the optimal strategy is often to cooperate.

Think about how we developed a morality system. We developed a morality system because we had to deal with each other for millions of years of evolution. Morality could be considered a set of standards for how humans interact with each other. We could kick off a similar process for AGI by mass producing them, with slightly conflicting goals. This is one of those “lean-in-to-it” strategies, like riding a skateboard off a halfpipe. It’s scary for sure, but it may just be the best game theoretical solution.

This might backfire… for this strategy, timing is everything, and details is everything. IDK comment below what you think.

Conclusion

The problem with all of these strategies is that we actually have to implement them. If companies and governments race to produce the biggest and best AI systems without considering the alignment problem, well… we may be screwed. In my opinion, we need people talking about alignment. We need people to be shouting it from the rooftops. Companies and Governments will consider these problems much more if the mainstream populus is discussing alignment regularly. But how to best discuss it??? I think MeritFront’s democratically updating topics is the best way. With MeritFront’s system we can crowd source ideas of how to solve alignment. With MeritFront’s system we can keep an up-to-date log of which companies are addressing the alignment problem. With MeritFront’s system we can iteratively figure out how to best explain the topic to the common person and involve the everyday man in the discussion. People are much more engaged when they have a real say. Now disclosure time: the author is a cofounder of MeritFront, but I designed MeritFront exactly for discussions like this. Indeed the alignment problem was top of mind when I designed MeritFront’s topic updating system.

So share this topic and lets get real people iteratively improving the alignment discussion.