Output-Centric Feature Descriptions Improve AI Model Interpretability

Automated Interpretability of AI Models: Output-Centric Feature Descriptions for Improved Understanding

The interpretability of large language models (LLMs) is a central topic in current AI research. It is about making the functionality of these complex models more comprehensible and gaining insights into their decision-making. A common approach to automated interpretability is to provide natural language descriptions for the concepts represented by individual features in the model. These descriptions are usually based on input data that activates the respective feature. A feature can be a dimension or a direction in the model's representation space.

Current methods for automated interpretability generate descriptions based on input data that maximally activate a feature. However, this input-centric approach has some disadvantages. Identifying the activating inputs is computationally intensive, and the resulting descriptions often do not capture the causal effect of a feature on the model output. However, the mechanistic role of a feature in model behavior is determined both by how inputs cause the activation of a feature and by how the feature activation influences the outputs.

New research suggests that output-centric methods can improve automated interpretability. These methods focus on the effects of feature activation on the model outputs. One approach is to analyze the tokens whose weights are highest after feature stimulation. Another approach considers the tokens with the highest weights after applying the vocabulary's "unembedding" head directly to the feature. These output-centric descriptions capture the causal effect of a feature on the model output better than input-centric descriptions.

Studies have shown that combining input- and output-centric methods leads to the best results. The combination of both approaches allows for a more comprehensive and accurate description of the features and their role in model behavior. Output-centric descriptions can also be used to find inputs that activate features previously considered "dead" because no activating inputs were found.

Research on the interpretability of LLMs is still ongoing, but output-centric methods offer promising possibilities for improving the comprehensibility and transparency of these complex models. These advances are crucial for the responsible use of AI in various application areas.

For Mindverse, a German company specializing in AI-powered content creation, image generation, and research, these developments are of particular interest. Mindverse offers an all-in-one content platform and develops customized AI solutions such as chatbots, voicebots, AI search engines, and knowledge systems. The improved interpretability of AI models is of great importance for Mindverse and its customers, as it helps to strengthen trust in AI systems and promote their acceptance in various industries.

Bibliography: Gur-Arieh, Y., Mayan, R., Agassy, C., Geiger, A., & Geva, M. (2025). Enhancing Automated Interpretability with Output-Centric Feature Descriptions. arXiv preprint arXiv:2501.08319v1. Belle, V., & Papantonis, I. (2021). Interpretable and explainable machine learning: A methods-centric overview with concrete examples. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(5), e1421. Mollas, I., Bassiliades, N., & Tsoumakas, G. (2023). Truthful meta-explanations for local interpretability of machine learning models. Applied Intelligence, 53, 26927-26948. Samek, W., Wiegand, T., & Müller, K. R. (2023). Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. Springer. ChatPaper. Enhancing Automated Interpretability with Output-Centric Feature Descriptions. https://www.chatpaper.com/chatpaper/fr?id=3&date=1736870400&page=1 Moosbauer, J. (2019). Explainable AI: Evaluating the Explainability of Machine Learning Models. Sokol, K., & Flach, P. (2020). Towards user-centric explanations for explainable models: A review. arXiv preprint arXiv:2010.07881. van der Schaar, M., & Maxfield, N. (2021). Making machine learning interpretable: a dialog with clinicians. Atanasova, P., Simons, M., Mussmann, S., & Werkmeister, T. (2023). Evaluating xai: A comparison of rule-based and example-based explanations. arXiv preprint arXiv:2309.01029. Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., & Kagal, L. (2018). Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) (pp. 80-89). IEEE.

Output-Centric Feature Descriptions Improve AI Model Interpretability

Top post

Automated Interpretability of AI Models: Output-Centric Feature Descriptions for Improved Understanding

Related blog

Multi-Turn Jailbreaks and Defenses: Enhancing LLM Security

Off-Policy Learning Enhances Reasoning Abilities in AI Models

SphereDiff Generates Seamless 360° Panoramas Without Finetuning