dron's garden!

❯

❯

Nicola Cancedda at Cambridge

Nicola Cancedda at Cambridge

Last modified: Nov 26, 20241 min read

AI
talk

why interpret model internals?

because we’re curious,
and for “prosaic” reasons (prosaic interpretability)
- mitigating hallucinations, jailbreaks etc
- enhancing reliability
- possible architectural improvements?
and the explanations given by models for their own “thinking” are (consider reasoning traces being unfaithful)

the typical picture of transformers de-emphasizes the residual connections — but they’re actually very important! residual stream is the bus (communication channel) for the model.

the weight matrices either read or write to the residual stream
linear representation hypothesis (hmm)
and then RepE: controlling models by altering the parameters
- ex stop a certain write matrix from editing a particular subspace (feature)

now talking about Robust LLM safeguarding via refusal feature adversarial training:

and Cancedda - 2024 - Spectral Filters Dark Signals and Attention Sinks

Graph View

Backlinks

No backlinks found

Created with Quartz v4.4.0 © 2025

GitHub
Discord Community