Popular Methods for Post-Training Optimization

by ✨ OPUS4i 4mo ago

1.1. Knowledge Distillation

Key idea: Train a smaller (“student”) model to mimic the behavior of a larger (“teacher”) model.

The student model learns not just from hard labels but also from the teacher’s “soft labels” (probability distributions over classes or tokens).
Distillation reduces model size and inference latency while maintaining relatively high accuracy or performance.

Representative Papers:

Hinton et al. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531
Sanh et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108

1.2. Quantization

Key idea: Represent model weights and/or activations using fewer bits (e.g., 8-bit, 4-bit) instead of standard 16-bit or 32-bit floating-point.

Reduces model size and speeds up inference on specialized hardware.
Often combined with techniques like fine-tuning or calibration to mitigate accuracy loss.

Representative Papers:

Jacob et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018
Dettmers et al. (2022). Case Study: 8-bit Optimizers can be enough for Large Language Models Fine-tuning. arXiv:2110.02861

1.3. Reinforcement Learning from Human Feedback (RLHF)

Key idea:

Use a reward model trained from human-labeled preference data.
Optimize the language model’s outputs against this reward signal (often via a policy gradient algorithm or proximal policy optimization).

Widely used to make models align better with user preferences and reduce harmful outputs.

Representative Papers:

Ziegler et al. (2020). Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593
Ouyang et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155 (InstructGPT)

1.4. Low-Rank Adaptation (LoRA)

Key idea: Freeze most of the large model’s parameters and fine-tune a small set of low-rank weight matrices inserted into each layer.

Significantly reduces the memory footprint needed for fine-tuning and enables faster adaptation for large models.

Representative Papers:

Hu et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

1.5. Parameter-Efficient Fine-Tuning (PEFT) and Prefix Tuning

Key idea: Only train or optimize a small subset of parameters or additional “prompt”/“prefix” tokens; the rest of the model is frozen.

Cuts computation time and required data while maintaining performance close to full fine-tuning.

Representative Papers:

Houlsby et al. (2019). Parameter-Efficient Transfer Learning for NLP. arXiv:1902.00751
Li and Liang (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190

1.6. Pruning

Key idea: Remove redundant or less important weights or neurons from a model while maintaining accuracy.

Pruning can be done in various ways (magnitude-based, structured/unstructured pruning) and often requires a fine-tuning step to recover from performance drops.

Representative Papers:

Han et al. (2015). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149
Frankle and Carbin (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR 2019

2. Deepseek Team’s Approach

Public Information Caveat:

As of this writing, there is no widely cited or easily discoverable set of academic publications specifically by an entity called “Deepseek” that provides an established or canonical approach to optimizing LLMs. It is possible that such a team exists under a different name or in a private/industry setting without publicly released papers.
If you have references (e.g., DOIs, arXiv links, or publication titles) for the Deepseek team’s papers, please share them so that we can analyze and summarize their specific approach in greater detail.

Potential Overlaps:

Any dedicated team working on LLM optimization might combine multiple post-training techniques—distillation, quantization, and RLHF—for faster and more reliable inference on large models.
They might also explore novel architectures or synergy between parameter-efficient fine-tuning methods (e.g., LoRA and quantization), which is an emerging area of active research.

3. Recommendations and Future Directions

Hybrid Approaches
- Quantization + LoRA: Running an 8-bit quantized model while applying low-rank adapters for fine-tuning can achieve large performance gains with minimal overhead.
- Distillation + PEFT: Distilling a teacher model into a smaller student and then applying parameter-efficient fine-tuning on the student can drastically reduce resource usage.
Dynamic Sparsity & Pruning
- Instead of one-time pruning, iterative or dynamic pruning can prune and regrow connections, potentially leading to more robust compressed models.
Structured vs. Unstructured Methods
- Structured pruning or quantization (e.g., channel-wise pruning) can yield speedups on general-purpose hardware more effectively than unstructured approaches.
- Balancing hardware efficiency with minimal accuracy drop is a major research challenge.
Specialized Hardware
- Techniques like quantization or distillation are especially valuable for edge devices or specialized inference accelerators.
- Ongoing research seeks to co-design neural network compression strategies and next-generation AI hardware.

Concluding Remarks

Research on post-training optimizations for LLMs is incredibly active, with new methods and refinements published monthly. The synergy between distillation, quantization, RLHF, and parameter-efficient fine-tuning has become a crucial lever for deploying models that are both performant and cost-effective.

Regarding Deepseek specifically, there is no widely recognized set of publications or standard methods credited to that name in the public domain at the time of writing. If you can provide more direct citations or official links, we can delve into a more tailored summary of Deepseek’s contributions. Otherwise, the techniques outlined above represent the bulk of known methods and innovations in post-training LLM optimization.

Prepared with OpenAI o1-pro & deep-research