reviewing towards monosemanticity

In October 2023, Anthropic published "Towards Monosemanticity," a paper that used sparse autoencoders to decompose a small transformer's activations into interpretable features. What they found looked less like the inscrutable matrices everyone expected and more like something with structure -- discrete, human-readable concepts emerging from the noise.

Last night I stayed up late reading an extremely exciting paper with an extraordinarily boring name:

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" https://transformer-circuits.pub/2023/monosemantic-features/index.html

It's getting attention because of an article that does a good job of popularizing its contents:

"God Help Us, Let's Try To Understand AI Monosemanticity" https://www.astralcodexten.com/p/god-help-us-lets-try-to-understand

But there are so many amazing things in the paper that they didn't catch everything. Here's my attempt at pulling out the most mindblowing stuff.

Superposition

First off, the paper seems to confirm superposition.

7 years ago, I remember sitting in a conference room with my team mates talking about how deep nets learn hierarchies of meaning. One of our coworkers was skeptical. He was a database guy, and he was insistent that the data still had to exist somewhere.

If you are going to train on all these images and pull concepts like 'a nose' or 'a wheel' out of them and identify them later, then somewhere, deep inside the neural net, he believed it had to be memorizing these concepts. Maybe not in pictures, maybe not in a form we'd recognize as code or an equation, but that data was going to require storage, there were fundamental limits to how much the data could be compressed, and he argued that meant there were hard constraints to what a neural net could learn, which he based on how many neurons it had.

At the time, it was hard to rebut this argument. I think we turned to the high dimensional nature of feature embeddings expressed in the output representation of neural layers, and how many possibilities that provided. That was close, but now we have a much better idea of what's happening. It's not the dimensionality of the activations that allows deep nets to learn so much. It's the dimensionality of the neurons producing those activations.

Say a neuron can fire from -1 to 1. Simplistically, imagine a neuron that activated strongly on cats. If it's -1, absolutely no cats. If it's 1, there's absolutely a cat. If it worked like this, our old colleague would have been right. But instead, neurons can be treated as part of a larger unit within a layer. When one is low and the other is high, it can mean one thing. The inverse can mean something totally different, and both of them being the same can mean something else. These aren't on/off logic gates, either. Any pair of values from -1 to 1 can mean something different, and they can organize into groups much larger than pairs. The neurons encode features in superposition. No neuron on its own means any one thing, but together they mean everything. They are polysemantic.

This occurs naturally as networks are trained (the paper authors believe this is due to the loss functions used), and it's why neural nets are so hard to interpret. You can't look at any single neuron and understand what it does.

The solution: autoencoders

Autoencoders are neural nets that try to recreate their input. We use them in ThreatWarrior at the heart of our AI engine. If a model trained to recreate historical input can't recreate new input, it means the new input is qualitatively different than what it's seen before--an anomaly.

This paper puts autoencoders to novel use, decomposing the superpositions into estimations of what single feature neurons would be, if they existed.

They take the output activations of all the neurons of a small language model, and treat them as the input to an autoencoder. Then, they have the autoencoder reproduce the input in a much wider space. They spread 512 input neuron activations across up to 131,072 outputs. This lets them disentangle the different features superimposed on each of those input neurons. Each of the output neurons learns a different feature.

And since they know which input neurons are most associated with output features, they can backtrace and understand, for a given feature in their autoencoder output, which input neurons it drew from. In turn, that lets them understand which set of neurons to activate at which levels in the original model to trigger the feature they're examining. And that lets them understand which input content those original neurons attend to, and what kind of output is produced when they are activated to different levels. Through a mix of human curation and generative model augmentation, they were able to inventory all the features and create high level summaries of what they represent.

Results

They found wondrous things beyond just confirming superposition.

Universality

They trained multiple models starting with different random weights, and when they ran them through their autoencoders they found the same features were being encoded in both models.

The features they see are eerily similar to features they saw in simpler models they explored previously, as well as features other researchers are seeing in other projects.

This seems to imply that the way information is organized in large language models reflects something fundamental about the way humans conceive and express ideas. It's almost like all of these models are lossy approximations of a high dimensional manifold that represents something akin to a Platonic realm of ideas. Attractors in that manifold, like little gravity wells, accrete words around concepts.

Fractality

Similarly, the features they found seem to be fractal. The closer they look, the more there are, repeating the same form in ever increasing detail. When they only have 1 output feature per input neuron, there's a neuron that fires for the word "the" but only when it's in the context of math. When they split to 8 output features per input neuron, that same neuron is tied to features for "a" in the context of math, another for "the" in machine learning, and another for "the" in math. When they split to 32 features per neuron, that neuron has one feature for "a" as a variable in math, another for "a" in prose surrounding equations, etc. All superimposed on the same neuron. The closer they look, the more features are in superposition.

This can be seen visually, too. When they plot 2d cluster charts of detected features, the lower resolution ones are larger, but in the same location.

The level of nuance is astounding.

One feature they found, at low resolution, is like an encyclopedia index for the letter "P" from "Plastic" through "Peking". As it expands out with higher resolution autoencoders, they detangle separate features for electronics like 'PCBs', another for Pokemon, and another for Planned Parenthood.

In another case, they found a neuron for Base64 text, and after decomposing it they found one feature that activates on letters in Base64 text and another that activates on numbers in Base64. That seems basic enough, it's just latching onto the form. But it also understands content: another related feature activates on Base64 that encodes ASCII text!

Autonoma

It gets crazier. The different features they found interact in ways that create finite state automata--little programs.

Some of them are super simple loops. A feature that activates on DNA sequences, and encourages generating DNA sequences, will self-activate. As will one that activates on and generates SQL statements.

But there are also two-step programs, like generating SNAKE_CASE variable names. One feature activates on upper case letters and encourages generating underscores, while another feature activates on underscores and encourages generating upper case letters.

Then...then they show a real example. How about one feature that activates on open angle brackets and encourages output like 'div' and 'span', another that activates on those terms and outputs closing angle brackets, another that activates on closing brackets and outputs new lines, and another that activates on new lines, generating open angle brackets, starting the loop over?

The same model has a different set of features that also activate on open-angle brackets, but generate screen names and words like 'lol' and 'hmm' and 'linux' and 'sudo'--not only does the transformer know HTML, it also understands IRC.

To sum it up

This paper only looked at a very small model. The simplest transformer they could build. However they are confident the same technique scales to models like GPT-4, Claude, and OpenLlama.

They find that language models, even ones that can't be called large, are far more fantastic than they have any right to be. They are doing more than people thought they were, and what they are doing is far more interpretable and modifiable than people realized.

Once you understand the features in a model, you can activate it to produce very specific forms of output. Instead of telling a model to produce JSON, you could force it to by pinning that feature's neuronal activations.

You could identify all the bits in a model that produce undesirable output, and excise them or dampen their activation.

You can build special-purpose models that are stripped down to just the bits needed for a task, dramatically reducing the footprint and usage requirements.

The fact that models build cellular automata is very intriguing. If this process can be manipulated or guided, it could lead to some very sophisticated functionality and customization. It implies that one day people may be able to write code that gets run in-line across features in a neural net, directly and precisely controlling activations at runtime on a pre-trained network.

The universality results, though, are what strike me as the most important. We may be closing in on something like Alfred Whitehead's ontology, where entities are defined by how they value and react to all the things they're related to through overlapping experiences, and collect in nexus points to abstractions that emphasize certain shared aspects.