Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While this may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing **FADE**: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for automatically evaluating feature-to-description alignment. **FADE** evaluates alignment across four key metrics – *Clarity, Responsiveness, Purity, and Faithfulness* – and systematically quantifies the causes of the misalignment between features and their descriptions. We apply **FADE** to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release **FADE** as an open-source package at: [github.com/brunibrun/FADE](https://github.com/brunibrun/FADE).
@inproceedings{puri-etal-2025-fade,title={{FADE}: Why Bad Descriptions Happen to Good Features},author={Puri, Bruno and Jain, Aakriti and Golimblevskaia, Elena and Kahardipraja, Patrick and Wiegand, Thomas and Samek, Wojciech and Lapuschkin, Sebastian},editor={Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher},booktitle={Findings of the Association for Computational Linguistics: ACL 2025},month=jul,year={2025},address={Vienna, Austria},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2025.findings-acl.881/},pages={17138--17160},isbn={979-8-89176-256-5},}
A Close Look at Decomposition-based XAI-Methods for Transformer Language Models
Leila Arras, Bruno Puri, Patrick Kahardipraja, Sebastian Lapuschkin, and Wojciech Samek
@misc{arras2025closelookdecompositionbasedxaimethods,title={A Close Look at Decomposition-based XAI-Methods for Transformer Language Models},author={Arras, Leila and Puri, Bruno and Kahardipraja, Patrick and Lapuschkin, Sebastian and Samek, Wojciech},year={2025},eprint={2502.15886},archiveprefix={arXiv},primaryclass={cs.CL},}
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin
@misc{kahardipraja2025atlasincontextlearningattention,title={The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation},author={Kahardipraja, Patrick and Achtibat, Reduan and Wiegand, Thomas and Samek, Wojciech and Lapuschkin, Sebastian},year={2025},eprint={2505.15807},archiveprefix={arXiv},primaryclass={cs.CL},}
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs
Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin
@misc{hatefi2025attributionguidedpruningcompressioncircuit,title={Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs},author={Hatefi, Sayed Mohammad Vakilzadeh and Dreyer, Maximilian and Achtibat, Reduan and Kahardipraja, Patrick and Wiegand, Thomas and Samek, Wojciech and Lapuschkin, Sebastian},year={2025},eprint={2506.13727},archiveprefix={arXiv},primaryclass={cs.LG},}
2024
ACL
When Only Time Will Tell: Interpreting How Transformers Process Local Ambiguities Through the Lens of Restart-Incrementality
Brielen Madureira, Patrick Kahardipraja, and David Schlangen
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024
Incremental models that process sentences one token at a time will sometimes encounter points where more than one interpretation is possible. Causal models are forced to output one interpretation and continue, whereas models that can revise may edit their previous output as the ambiguity is resolved. In this work, we look at how restart-incremental Transformers build and update internal states, in an effort to shed light on what processes cause revisions not viable in autoregressive models. We propose an interpretable way to analyse the incremental states, showing that their sequential structure encodes information on the garden path effect and its resolution. Our method brings insights on various bidirectional encoders for contextualised meaning representation and dependency parsing, contributing to show their advantage over causal models when it comes to revisions.
@inproceedings{madureira-etal-2024-time,title={When Only Time Will Tell: Interpreting How Transformers Process Local Ambiguities Through the Lens of Restart-Incrementality},author={Madureira, Brielen and Kahardipraja, Patrick and Schlangen, David},editor={Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek},booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=aug,year={2024},address={Bangkok, Thailand},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.acl-long.260},pages={4722--4749},}
2023
ACL
TAPIR: Learning Adaptive Revision for Incremental Natural Language Understanding with a Two-Pass Model
Patrick Kahardipraja, Brielen Madureira, and David Schlangen
In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023
Language is by its very nature incremental in how it is produced and processed. This property can be exploited by NLP systems to produce fast responses, which has been shown to be beneficial for real-time interactive applications. Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers. RNNs are fast but monotonic (cannot correct earlier output, which can be necessary in incremental processing). Transformers, on the other hand, consume whole sequences, and hence are by nature non-incremental. A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise. However, this method becomes costly as the sentence grows longer. In this work, we propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy. Experimental results on sequence labelling show that our model has better incremental performance and faster inference speed compared to restart-incremental Transformers, while showing little degradation on full sequences.
@inproceedings{kahardipraja-etal-2023-tapir,title={{TAPIR}: Learning Adaptive Revision for Incremental Natural Language Understanding with a Two-Pass Model},author={Kahardipraja, Patrick and Madureira, Brielen and Schlangen, David},booktitle={Findings of the Association for Computational Linguistics: ACL 2023},month=jul,year={2023},address={Toronto, Canada},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2023.findings-acl.257},doi={10.18653/v1/2023.findings-acl.257},pages={4173--4197},}
SIGDIAL
The Road to Quality is Paved with Good Revisions: A Detailed Evaluation Methodology for Revision Policies in Incremental Sequence Labelling
Brielen Madureira, Patrick Kahardipraja, and David Schlangen
In Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Sep 2023
Incremental dialogue model components produce a sequence of output prefixes based on incoming input. Mistakes can occur due to local ambiguities or to wrong hypotheses, making the ability to revise past outputs a desirable property that can be governed by a policy. In this work, we formalise and characterise edits and revisions in incremental sequence labelling and propose metrics to evaluate revision policies. We then apply our methodology to profile the incremental behaviour of three Transformer-based encoders in various tasks, paving the road for better revision policies.
@inproceedings{madureira-etal-2023-road,title={The Road to Quality is Paved with Good Revisions: A Detailed Evaluation Methodology for Revision Policies in Incremental Sequence Labelling},author={Madureira, Brielen and Kahardipraja, Patrick and Schlangen, David},booktitle={Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue},month=sep,year={2023},address={Prague, Czechia},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2023.sigdial-1.14},pages={156--167},}
2021
EMNLP
Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU
Patrick Kahardipraja, Brielen Madureira, and David Schlangen
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Nov 2021
Incremental processing allows interactive systems to respond based on partial inputs, which is a desirable property e.g. in dialogue agents. The currently popular Transformer architecture inherently processes sequences as a whole, abstracting away the notion of time. Recent work attempts to apply Transformers incrementally via restart-incrementality by repeatedly feeding, to an unchanged model, increasingly longer input prefixes to produce partial outputs. However, this approach is computationally costly and does not scale efficiently for long sequences. In parallel, we witness efforts to make Transformers more efficient, e.g. the Linear Transformer (LT) with a recurrence mechanism. In this work, we examine the feasibility of LT for incremental NLU in English. Our results show that the recurrent LT model has better incremental performance and faster inference speed compared to the standard Transformer and LT with restart-incrementality, at the cost of part of the non-incremental (full sequence) quality. We show that the performance drop can be mitigated by training the model to wait for right context before committing to an output and that training with input prefixes is beneficial for delivering correct partial outputs.
@inproceedings{kahardipraja-etal-2021-towards,title={Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental {NLU}},author={Kahardipraja, Patrick and Madureira, Brielen and Schlangen, David},booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2021},address={Online and Punta Cana, Dominican Republic},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2021.emnlp-main.90},doi={10.18653/v1/2021.emnlp-main.90},pages={1178--1189},}
2020
CODI
Exploring Span Representations in Neural Coreference Resolution
Patrick Kahardipraja, Olena Vyshnevska, and Sharid Loáiciga
In Proceedings of the First Workshop on Computational Approaches to Discourse, Nov 2020
In coreference resolution, span representations play a key role to predict coreference links accurately. We present a thorough examination of the span representation derived by applying BERT on coreference resolution (Joshi et al., 2019) using a probing model. Our results show that the span representation is able to encode a significant amount of coreference information. In addition, we find that the head-finding attention mechanism involved in creating the spans is crucial in encoding coreference knowledge. Last, our analysis shows that the span representation cannot capture non-local coreference as efficiently as local coreference.
@inproceedings{kahardipraja-etal-2020-exploring,title={Exploring Span Representations in Neural Coreference Resolution},author={Kahardipraja, Patrick and Vyshnevska, Olena and Lo{\'a}iciga, Sharid},booktitle={Proceedings of the First Workshop on Computational Approaches to Discourse},month=nov,year={2020},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2020.codi-1.4},doi={10.18653/v1/2020.codi-1.4},pages={32--41},}