Alignment -

On Anthropic breakthrough paper on Interpretability of LLMs May 2024

27 June 20241 September 2024 oussamaNo Comments

Anthropic showed that a method they previously used for the interpretability of small models could scale to their medium-sized LLM, Claude 3 Sonnet. They demonstrated that features (concepts, entities, words, etc.) are represented inside the LLM by patterns of neurons firing together. By mapping millions of features in Claude 3 Sonnet, they are able to understand what the model is “thinking” about during inference.