On Anthropic breakthrough paper on Interpretability of LLMs May 2024

A recent paper by Anthropic presents significant progress in AI safety, focusing on ensuring that AI systems operate reliably, ethically, and beneficially. One critical issue in AI is interpretability, as these systems often function as “black boxes” where the inference (the decision-making) process, is opaque. This is especially true for Large Language Models (LLMs), where it is unclear why a specific input produces a particular output.

Without interpretability, it is difficult to determine whether an AI system is being “deceptive” or “hiding real intentions.” Addressing interpretability before developing larger and more capable AI systems is essential, in my opinion, to ensure long-term alignment and AI safety.

Anthropic’s research demonstrates that their previous methods for interpreting small models can scale to their medium-sized LLM, Claude 3 Sonnet. They discovered that features like concepts, entities, and words are represented by patterns of neurons firing together. By mapping millions of these features, they provided insights into what the model was “thinking” at any given moment. One interesting example is the feature of sycophancy, where the model tends not only to agree with the user’s statements, but to praise the user shamelessly.

This represents great progress in the field of AI safety and AI Alignment, and AI Interpretability, that might also inspire progress in neurology and psychology. Evidently, the human brain is still largely not understood, and it has proved to be very difficult to study and understand, not only because of the obvious ethical reasons, but also, because biology is messy, and we don’t have the tools yet to understand all the underlying processes, and also because the brain consists of a huge number of cells that acts together, as a complex system, where emergence is at play, and we still lack good methods and tools to study complex systems and to fully understand emergence.

ANNs (Artificial Neural Networks), are a mathematical representation of the structure and function of biological neural networks. Developing ANNs was originally motivated by attempting to explain and understand how biological neural networks learn. Findings from the biology and physiology of the nervous system have continuously informed the development and improvement of ANNs.

In my opinion, this works also in the reversed direction, i.e., the study of ANNs can provide valuable insights and generate new hypotheses about complex biological neural network that can be investigated and eventually tested within the context of biological neuroscience, cognitive sciences and neurological disorders, and that might bring us closer to a scientific “theory of the brain”.

For example the notion of “concepts” in the paper could have an equivalent in human biological neural networks. Concepts might be stored in our brains also in a distributed manner in pattern of neurons, that fires at the same time and that differ from one person to the other as it is the case in Claude 3 Sonnet. Actually a result that points in this direction was identified in the paper titled “Semantic reconstruction of continuous language from non-invasive brain recordings“.

Furthermore, Neurosis, could turn out to be the result of a group of neurons, a “pattern”, firing always due to some biological reason, as opposed to sensory or neural input, and the cure could be in disrupting the pattern or turning off those neurons in part or in whole.

Finally, in a nod to George Orwell, the results suggest there might be a new way to detect dishonesty: by identifying an individual’s unique “pattern” of dishonesty and testing if that pattern is activated at any given time.

In conclusion, this research represents a significant step toward safer AI systems and provides a foundation for future interdisciplinary studies that could deepen our understanding of both artificial and biological intelligence.


1- Mapping the Mind of a Large Language Model, News Post, Anthropic website (Last accessed June 27, 2024).
2- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Paper, Transformers-Circuit, May 21, 2024.
3- Semantic reconstruction of continuous language from non-invasive brain recordings, Nat Neurosci. 2023 May, PMID: 37127759.