arXiv 2026

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

Sungjun Lim, Heedong Kim, Andrew Lee, Kyungwoo Song

Yonsei University

Abstract

Mechanistic interpretability seeks to explain model behavior by identifying internal structures that are causally responsible for its outputs. Dictionary-based explainers, including sparse autoencoders and transcoders, have become central tools for this purpose, but their faithfulness under out-of-distribution shifts remains underexplored. This work shows that distribution shift can rotate the subspace actively used by a model, causing dictionaries trained on in-distribution activations to become geometrically misaligned with the OOD-active subspace. We formalize this mismatch as a faithfulness gap and introduce the Geometry-Adaptive Explainer (GAE), which realigns the dictionary to the OOD-active subspace while preserving the original feature structure. GAE uses only unlabeled OOD activations and requires no gradient updates, improving OOD faithfulness both theoretically and empirically across multiple models and shift settings.

Method

Experimental Results

GAE restores faithfulness under temporal, domain, and adversarial distribution shifts without gradient training.

Interactive Qualitative Analysis

Explore how GAE changes each feature's direct logit attribution across semantically related candidate tokens while keeping the encoder activations fixed.

Prompt prefix

Fixed class score -
GAE class score -
Delta -

Per-token Fixed vs GAE comparison

Rows are candidate tokens; paired cells show each feature's DLA before and after adaptation.

Citation

@article{lim2026geometry,
  title={Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift},
  author={Lim, Sungjun and Kim, Heedong and Lee, Andrew and Song, Kyungwoo},
  journal={arXiv preprint arXiv:2605.21849},
  year={2026},
  url={https://arxiv.org/abs/2605.21849}
}