Yesterday, the AI startup Anthropic published a paper detailing the successful interpretation of the inner workings of a large language model (LLM). LLMs are notoriously opaque — their size, complexity, and numeric representation of human language have hitherto defied explanation — so it’s impossible to understand why inputs lead to outputs.
Anthropic used a technique called dictionary learning, leveraging a sparse encoder to isolate specific concepts within its Claude 3 Sonnet model. The technique allowed them to extract millions of features, including specific entities like the Golden Gate Bridge as well as more abstract ideas such as gender bias. They were then able to map the proximity of related concepts such as “inner conflict” and “Catch-22.” Most importantly, they were able to activate and suppress features to change model behavior.
So it’s time to bust out the champagne, because we’ve solved genAI explainability, right?
Not quite. Anthropic only identified a small subset of the features within the model. Interpreting the entire model would be too costly — “the computation required by our current approach would vastly exceed the compute used to train the model in the first place,” Anthropic admitted. The paper is essentially proof that explainability is possible, given adequate resources.
It is time to marshal those resources. Investing in generative AI explainability is necessary for the future success of AI because:
- There is no alignment without explainability. Over the last six months, I have been conducting research on AI alignment with Brian Hopkins and Enza Iannopollo. We have found that the limitations of current AI approaches make misalignment inevitable. AI misalignment could create catastrophic consequences for businesses and society. Full model explainability would enable us to tweak the very DNA of LLMs, bringing them into alignment with business and societal needs.
- Opacity precludes insight. The world is enamored of LLMs’ ability to produce novel text, audio, images, video, and code. But we are currently ignorant of the patterns that the models learned about humanity to produce those outputs. The training of LLMs could be considered the largest sociological study of humanity in the history of the world. Unfortunately, without explainability, we have no way of interpreting the study’s results.
- Transparency is AI’s most powerful trust lever. The opacity of AI has created a significant trust gap that only transparency can fully bridge. Until we can explain exactly how a prompt leads to a response, there will be skepticism among consumers, regulators, and business stakeholders alike. Anthropic’s research is a step in the right direction. It is not the bridge itself, but it does show how the bridge may be constructed with adequate time and resources.
Explainability of predictive AI was a vexing issue a decade ago. Now it is largely solved, thanks to the hard work of diligent researchers responding to the demands of industry. It’s time to make similar demands. The success of AI depends on it.
If you’d like to discuss explainability further, please feel free to schedule an inquiry or guidance session.