Qwen Releases Open-Source Tools to Decode What's Happening Inside Language Models
The Qwen team released Qwen-Scope, an open-source suite of sparse autoencoders trained on seven Qwen model variants. The release includes 14 sets of autoencoder weights that map internal model computations to human-readable concepts - solving a concrete problem developers face: when an LLM misbehaves, there's almost no way to diagnose why.
Sparse autoencoders work as translation layers between raw neural network data and interpretable features. When a language model processes text, it produces thousands of hidden numbers that resist direct analysis. An autoencoder decomposes these into a dictionary of sparse features, where each one typically corresponds to something concrete: a language, a writing style, a safety-relevant behavior.
The suite covers five dense models (ranging from 1.7B to 27B parameters) and two mixture-of-experts models. For each transformer layer across all seven backbones, Qwen-Scope trained a separate autoencoder to reconstruct internal activations using sparse features - keeping only the top 50 or 100 active features per input.
Four Practical Applications
Steering Model Behavior Without Retraining
The most immediate use case is inference-time steering. By adding or subtracting a feature direction from a model's internal state during generation, developers can push output toward or away from specific behaviors - no weight updates required.
In one example, a Qwen3 model prompted in English unexpectedly mixed in Chinese. Ranking autoencoder features by activation strength revealed a highly active Chinese-language feature. Suppressing it during generation removed the mixing entirely. A second example showed that activating a classical-Chinese feature successfully steered story continuation toward classical literary style.
Evaluating Benchmarks Without Running Models
Running LLMs across large benchmark datasets is computationally expensive. Qwen-Scope offers a cheaper alternative: analyzing which autoencoder features activate on benchmark samples, then using feature overlap as a proxy for benchmark similarity.
The research team tested this approach on 17 widely-used benchmarks including MMLU, GSM8K, MATH, and GPQA-Diamond. Their feature redundancy metric achieved a correlation of 0.85 with actual performance-based redundancy - without running a single model evaluation. The analysis revealed that 63% of GSM8K's features are already covered by MATH, suggesting evaluation suites containing MATH could safely drop GSM8K with minimal loss of discriminative value.
Building Toxicity Classifiers and Generating Safety Data
Autoencoder features work effectively as lightweight classifiers. The research team built a multilingual toxicity classifier across 13 languages using a simple two-stage pipeline: identify features that fire more frequently on toxic examples, then apply a rule across those features on held-out data. No additional classifier head. No gradient-based fitting.
On English, this achieved F1 scores above 0.90 on both the 1.7B and 8B Qwen3 models. Features discovered in English transferred meaningfully to other languages without rediscovery, with performance varying by linguistic distance - strongest for European languages like Russian and French, weaker for Arabic, Chinese, and Amharic.
On the synthesis side, the team introduced a feature-driven pipeline for generating safety training data. The method identifies safety-relevant features missing from existing supervision, generates prompt-completion pairs designed to activate those features, and verifies retention. Under matched budget constraints, feature-driven synthesis achieved 99.74% coverage of target safety features, compared to substantially lower coverage from natural sampling or random safety synthesis. Adding 4,000 feature-driven synthetic examples to 4,000 real safety examples produced safety accuracy of 77.75% - approaching the performance of training on 120,000 safety-only examples.
Improving Models During Training
The most technically novel contribution uses autoencoder features as signals during supervised fine-tuning and reinforcement learning, not just inference.
For fine-tuning, the team addressed code-switching - where multilingual LLMs spontaneously produce tokens in unintended languages. Their method, called Sparse Autoencoder-guided Supervised Fine-Tuning (SASFT), identifies language-specific features, then adds a regularization loss that suppresses those features during training on non-target-language data. Across five models spanning three families (Gemma-2, Llama-3.1, and Qwen3) and three target languages (Chinese, Russian, Korean), SASFT achieved over 50% reduction in code-switching in most settings, with complete elimination in some configurations while maintaining performance on multilingual benchmarks.
For reinforcement learning, the team tackled endless repetition - a low-frequency failure mode where models loop in repeated content. Standard online RL rarely encounters repetitive rollouts, so it can't learn a strong corrective signal. Using autoencoder feature steering, they synthetically generated repetition-biased rollouts and incorporated them as rare negative samples in the DAPO RL pipeline. Repetition ratio dropped sharply and consistently across three Qwen3 models while maintaining competitive benchmark performance.
The weights and technical details are publicly available. Developers working with Generative AI and LLM systems can use these tools for debugging and optimization. The release is particularly relevant for teams building AI for IT & Development workflows.
Your membership also unlocks: