why interpret model internals?

  • because we’re curious,
  • and for “prosaic” reasons (prosaic interpretability)
    • mitigating hallucinations, jailbreaks etc
    • enhancing reliability
    • possible architectural improvements?
  • and the explanations given by models for their own “thinking” are (consider reasoning traces being unfaithful)

the typical picture of transformers de-emphasizes the residual connections — but they’re actually very important! residual stream is the bus (communication channel) for the model.

  • the weight matrices either read or write to the residual stream
  • linear representation hypothesis (hmm)
  • and then RepE: controlling models by altering the parameters
    • ex stop a certain write matrix from editing a particular subspace (feature)

now talking about Robust LLM safeguarding via refusal feature adversarial training:

and Cancedda - 2024 - Spectral Filters Dark Signals and Attention Sinks