지식 증류 편집하기

'''지식 증류'''(Knowledge Distillation, KD)는 대형 신경망(Teacher Model)이 학습한 지식을 작은 신경망(Student Model)에 전이(distill)하여 효율적 성능을 달성하는 모델 압축 기법이다.<ref>Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv:1503.02531 (2015).</ref>  대형 모델이 가진 복잡한 표현과 분류 경계 정보를 소형 모델이 간접적으로 학습하게 함으로써, 정확도를 유지하면서도 메모리·연산량을 크게 줄일 수 있다.
==개요==
*대형 모델(Teacher)의 출력 확률분포(soft label)를 통해 소형 모델(Student)을 학습시킨다.
*Soft label은 각 클래스에 대한 확률을 제공하므로, 단일 1-hot 정답(hard label)보다 풍부한 관계 정보를 담는다.
*Student 모델은 Teacher의 지식을 근사하며, 실제 label과 soft label 모두를 학습 대상으로 삼는다.
==수식==
\[ p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}, \quad L = (1-\alpha)L_{CE}(y,q) + \alpha T^2 L_{KL}(p_T,p_S) \]
*\(T\): temperature (보통 T>1) — 확률분포를 부드럽게 하여 class 간 상대적 신뢰도를 학습하게 함.
*\(L_{CE}\): hard label 기반 cross-entropy loss
*\(L_{KL}\): Teacher 분포 \(p_T\) 와 Student 분포 \(p_S\) 간 Kullback-Leibler divergence
*\(α\): 두 loss 항의 가중치 조절 계수
==주요 종류==
*'''Response-based Distillation'''
**Teacher의 최종 출력(logit 또는 soft label)을 활용하여 Student 모델을 지도.
**Soft label이 hard label보다 더 풍부한 정보(클래스 유사도)를 제공.
**Student의 loss = α·Distillation Loss + (1−α)·Cross-Entropy Loss.
*'''Feature-based Distillation'''
**Teacher의 중간 feature map 출력을 추출하여 Student 가 feature level 에서도 교사 모델의 표현을 따라가도록 지도.
**학습 수렴 속도를 높이고 일반화 성능을 개선. <ref>Srivastava, Rupesh K., Klaus Greff, and Jürgen Schmidhuber. "Training very deep networks." NeurIPS 28 (2015).</ref>
*'''Hint-based Distillation (FitNets)'''
**얕고 가벼운 Student가 깊은 Teacher feature를 L2 distance 기반으로 맞추도록 지도. <ref>Romero, Adriana, et al. "FitNets: Hints for thin deep nets." arXiv:1412.6550 (2014).</ref>
*'''Online Distillation (Deep Mutual Learning)'''
**여러 Student 모델이 동시에 학습하며 서로의 출력을 교환해 지식을 공유. <ref>Zhang, Ying, et al. "Deep mutual learning." CVPR (2018).</ref>
*'''Self Distillation'''
**Teacher 없이 자기 자신의 깊은 층 출력을 얕은 층에 지도하여 내부 지식을 전이.
**결정 경계를 부드럽게 하고 overfitting을 줄임. <ref>Zhang, Linfeng, et al. "Be your own teacher." ICCV (2019).</ref>
*'''Teacher-Assistant Distillation'''
**Teacher 와 Student 사이 크기 격차가 너무 클 때 중간 크기의 TA (Teacher Assistant)를 두어 단계적으로 전이. <ref>Mirzadeh, Seyed Iman, et al. "Improved knowledge distillation via teacher assistant." AAAI (2020).</ref>
*'''Multi-Teacher Distillation'''
**여러 Teacher의 출력을 평균 또는 가중 조합하여 Student를 학습. <ref>You, Shan, et al. "Learning from multiple teacher networks." KDD (2017).</ref>
*'''Cross-Modal Distillation'''
**서로 다른 도메인(예: 분류 → 세그멘테이션, 텍스트 ↔ 비전) 간 지식을 전이하여 레이블이 적은 과제를 보완.
*'''Distillation for Quantization/Pruning'''
**Quantization-aware Training(QAT)이나 Iterative Pruning 중 손실 보상을 위해 Teacher 출력으로 보정.
**압축 모델의 성능 저하를 완화.
*'''Step-by-Step Distillation for LLM'''
**대형 언어모델(LLM)이 생성한 정답 및 이유(rationale)를 작은 모델이 학습하는 새로운 증류 방식. <ref>Hsieh, Cheng-Yu, et al. "Distilling step-by-step!" arXiv:2305.02301 (2023).</ref>
**Chain-of-Thought (CoT) 추론을 통해 다단계 추론 능력을 전이. <ref>Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in LLMs." NeurIPS 35 (2022).</ref>
==교사-학생 관계==
*Student 모델은 Teacher와 구조적으로 유사한 경향을 가짐.
**예: Teacher의 얕은 버전, pruned 모델, 또는 quantized 모델.
*Teacher 용량이 너무 클 경우 오히려 학습이 불안정해질 수 있음. <ref>Cho, Jang Hyun, and Bharath Hariharan. "On the efficacy of knowledge distillation." ICCV (2019).</ref>
**최적 Teacher 규모를 탐색하거나 조기 중단(early stopping)으로 균형을 맞춤.
==응용==
*모델 압축 및 경량화 (모바일·엣지 디바이스 배포)
*LLM 압축 및 추론 가속
*Pruning 및 Quantization 보정
*Cross-modal 지식 이전 (비전·언어 통합)
==관련 개념==
*[[저랭크 분해]]
==참고 문헌==
*Hinton et al., "Distilling the Knowledge in a Neural Network," arXiv:1503.02531 (2015).
*Cho & Hariharan, "On the Efficacy of Knowledge Distillation," ICCV (2019).
*Mirzadeh et al., "Improved Knowledge Distillation via Teacher Assistant," AAAI (2020).
*You et al., "Learning from Multiple Teacher Networks," KDD (2017).
*Zhang et al., "Be Your Own Teacher (Self Distillation)," ICCV (2019).
*Hsieh et al., "Distilling Step-by-Step," arXiv:2305.02301 (2023).
*Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in LLMs," NeurIPS (2022).

== 각주 ==
<references />

[[분류:인공지능]]
[[분류:딥 러닝]]