On Anthropic breakthrough paper on Interpretability of LLMs May 2024
Anthropic showed that a method they previously used for the interpretability of small models could scale to their medium-sized LLM, Claude 3 Sonnet. They demonstrated that features (concepts, entities, words, etc.) are represented inside the LLM by patterns of neurons firing together. By mapping millions of features in Claude 3 Sonnet, they are able to understand what the model is “thinking” about during inference.