In addressing the AI alignment problem, interpretability has emerged as a promising solution. By making AI models as transparent and understandable as possible, we can better comprehend the inner workings of these complex systems, identify potential issues, and take corrective action. Essentially, we would become all knowing brain doctors of our neural networks. We would be able to see every thought they had, and why they had it. AI privacy rights be damned. This is my favorite solution. But, as with any approach, there are inherent challenges that need to be considered.
Essentially, we would become all knowing brain doctors of our neural networks. We would be able to see every thought they had, and why they had it. AI privacy rights be damned. This is my favorite solution. But, as with any approach, there are inherent challenges that need to be considered.
Most importantly interpretability is hard. Current AI’s are a black box. We have good guesses for why and how they may do what they do, but we do not know for certain. The details are hidden in complexity. Regardless, it should be possible to disentangle the various pathways through a neural network, to understand why input leads to output.
To do this, we likely need a couple unique theoretical breakthroughs. Unfortunately, it is much easier to simply scale up existing models without attempting interpretability. Thus, we may hit super-advanced AI before we can make interpretable AI models.
Current Directions of Interpretability Research Current Directions of Interpretability Research
If you are into interpretability research you will quickly find that most of the well-cited work is being done by Anthropic . Most of the papers sited here are written by them.
Visualizing learned representations - We want to know what each neuron represents, but how? Well we can visually represent the thing that activates the neuron. This gives us an intuition for what each neuron is about. This direction is quite mature when considering convolutional neural networks. See here.
Studying superposition - So each neuron represents a feature? Right? Right? Wrong. Based on this fantastic article on toy models of superposition, we can now confidently say that there are situations where a single neuron can represent multiple features! This introduces a serious problem in interpretability as we can no longer simple assume that each neuron finds and represents some singular representation of the input. The anthropic team is working on isolating features irrespective of whether they are represented by a single neuron or a part of a neuron. You can find that work here .
Phase changes - Some sudden changes in a model's performance maps to some increase in model capability. Anthropic has shown that when a neural network discovers certain new types of circuits a noticeable drop in the loss function can be found. They show this to be the case with induction heads within LLMs. Can we predict when these changes happen? Can we predict the capability learned? There is an idea floating around that we could detect these phase changes and “pause” the AI training so that we can study what it learned and verify that it did not learn anything dangerous.
Scaling interpretability tools - There is a great push to build tools that use AI to interpret AI. This way we can simply scan over a neural network and output an interpretable interface so that we can what is going on. These tools are likely necessary if we want interpretability to become common place.
Highly suggest that if you are interested in these topics you check out LessWrong and the alignment forum.
ALSO great interpretability hub of papers right here: Mechanistic Interpretability — Neel Nanda.
If you have more details, please link and add to this page. If you can make this article more engaging and easier to read while remaining true to ground truth, please do so!
Hot comments
about anything