The Business & Technology Network
Helping Business Interpret and Use Technology
S M T W T F S
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
 
 
 
 
 
 
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 

MIT and ETH Zurich unveil SDFT to stop AI from forgetting old skills

Tags: new
DATE POSTED:February 18, 2026
MIT and ETH Zurich unveil SDFT to stop AI from forgetting old skills

Researchers at MIT, the Improbable AI Lab, and ETH Zurich have developed a new technique called self-distillation fine-tuning (SDFT) for large language models (LLMs).

SDFT enables LLMs to acquire new skills and knowledge without losing prior capabilities, addressing a challenge known as catastrophic forgetting. This method allows models to learn from demonstrations and their own experiments by leveraging in-context learning. Experiments show SDFT outperforms traditional supervised fine-tuning (SFT) and mitigates limitations found in reinforcement learning algorithms.

For enterprise applications, SDFT permits a single model to accumulate multiple skills while maintaining performance on earlier tasks. This allows AI agents to adapt to changing business environments, acquire proprietary knowledge, and gain new skills without extensive retraining or degradation of general reasoning abilities.

Current LLMs are static after deployment and do not update parameters to learn new skills or knowledge. Continual learning, which facilitates knowledge accumulation similar to human learning, is necessary for adaptive AI.

On-policy learning, where a model learns from self-generated data to correct errors, is an effective learning method. This contrasts with mimicking static datasets. Without on-policy learning, models can experience catastrophic forgetting, losing previous knowledge when learning new tasks.

Reinforcement learning (RL), a form of on-policy learning, requires an explicit reward function for scoring outputs. While effective for clear outcomes like math, defining such a function for many real-world enterprise scenarios, such as writing a legal brief, is challenging or impossible. RL methods also struggle to teach entirely new information, such as specific company protocols, because the model lacks initial knowledge to generate positive signals for learning.

The standard alternative, supervised fine-tuning (SFT), trains models on fixed datasets of expert demonstrations. SFT provides clear ground truth but is “off-policy.” Models mimic data rather than learning from attempts, often failing to generalize to out-of-distribution examples and experiencing catastrophic forgetting.

SDFT combines elements of SFT and RL, enabling on-policy learning using prerecorded demonstrations without needing a reward function. It uses distillation, where a student model mimics a teacher. The researchers utilized the model’s own in-context learning (ICL) capabilities to create a feedback loop within a single model. ICL allows LLMs to solve new problems using provided examples without parameter updates.

During training, the model functions in two roles:

  • The teacher: A frozen version of the model receives the query and expert demonstrations. It uses ICL to deduce the correct answer and reasoning.
  • The student: This version receives only the query, simulating real-world deployment.

The teacher provides feedback when the student generates an answer. The student then updates its parameters to align with the teacher’s distribution. This creates an on-policy learning loop. Supervision comes from the model’s interactions and outputs, allowing self-correction of reasoning trajectories without an external reward signal.

Researchers validated SDFT using the open-weight Qwen 2.5 model on three enterprise skills: science Q&A, software tool use, and medical reasoning. SDFT learned new tasks more effectively. On the Science Q&A benchmark, SDFT achieved 70.2% accuracy, compared to 66.2% for SFT.

SDFT preserves original knowledge. When an SFT model learned the science task, its ability to answer general questions (e.g., logic, humanities) declined. The SDFT model improved on the science task while its “Previous Tasks” score remained stable at 64.5%. This suggests companies can specialize models for departments without degrading basic reasoning or common sense.

In a knowledge injection simulation using a dataset of fictional “2025 Natural Disasters,” an SFT model memorized facts but struggled with reasoning, while an SDFT model scored 98% on indirect reasoning questions by internalizing the logic.

In a sequential learning experiment, the SDFT model accumulated science, tool use, and medical skills without regression, unlike the standard model whose performance fluctuated. This capability eliminates the need for managing multiple models (model zoos) for different tasks, potentially reducing inference costs.

The code for SDFT is available on GitHub for integration into training workflows. Idan Shenfeld, a doctorate student at MIT and co-author of the paper, noted that the SDFT pipeline is similar to the RL pipeline, requiring online response generation during training. The team is integrating SDFT into Hugging Face’s Transformer Reinforcement Learning (TRL) library.

SDFT requires strong in-context learning capabilities in models, typically around 4 billion parameters with newer architectures like Qwen 3. Shenfeld anticipates 1 billion-parameter models will soon be sufficient. The technique demands approximately 2.5 times the compute of standard fine-tuning. SDFT is suited for organizations requiring a single model to accumulate multiple skills, especially where defining an RL reward function is difficult or impossible.

SDFT is approximately four times slower and requires 2.5 times more computational power (FLOPs) than standard fine-tuning because the model actively generates “rollouts” during training for comparison with the teacher. However, better knowledge retention may avoid costly multi-stage retraining often needed to address catastrophic forgetting. Smaller models (e.g., 3 billion parameters) initially struggled due to insufficient “intelligence” to act as teachers. Shenfeld noted that the Qwen 3 4B model is strong enough, and future 1B models may also have sufficient ICL capabilities.

The ultimate goal is to move beyond static models to systems that improve with use, leading to continuous improvement by harnessing inference compute.

Featured image credit

Tags: new