why interpret model internals?
- because we’re curious,
- and for “prosaic” reasons (prosaic interpretability)
- mitigating hallucinations, jailbreaks etc
- enhancing reliability
- possible architectural improvements?
- and the explanations given by models for their own “thinking” are (consider reasoning traces being unfaithful)
the typical picture of transformers de-emphasizes the residual connections — but they’re actually very important! residual stream is the bus (communication channel) for the model.
- the weight matrices either read or write to the residual stream
- linear representation hypothesis (hmm)
- and then RepE: controlling models by altering the parameters
- ex stop a certain write matrix from editing a particular subspace (feature)
now talking about Robust LLM safeguarding via refusal feature adversarial training:
and Cancedda - 2024 - Spectral Filters Dark Signals and Attention Sinks