대형 언어 모델 효율화 편집하기 (부분)

==가지치기 (Pruning)==
불필요하거나 기여도가 낮은 가중치를 제거하여 모델 크기를 줄이고 추론 속도를 향상시키는 방법. [[신경망 가지치기]]에 대한 일반적 내용은 해당 문서 참조

'''LLM 특화 프루닝 기법'''
*'''KV Cache Pruning''' <ref>Ge, Suyu, et al. "Model tells you what to discard: Adaptive KV cache compression for LLMs." arXiv, 2023.  </ref>
**Key/Value 벡터(Kᵢⱼ, Vᵢⱼ)를 토큰 중요도 기반으로 압축.  
**Attention score에 따라 불필요한 KV를 삭제하여 메모리 사용량 감소.  

*'''Wanda (Weight and Activation Aware Pruning)''' <ref>arXiv:2306.11695, 2023.  </ref>
**Magnitude뿐 아니라 Activation 크기를 반영하여 중요도를 계산.  
**LLM 구조에 최적화된 post-training pruning 기법.  

*'''Streaming-LLM'''<ref>Xiao, Guangxuan, et al. "Efficient streaming language models with attention sinks." arXiv, 2023.  </ref>
**Attention Sink 문제를 완화하며, 초기 토큰의 KV를 부분 유지.  
**Sliding Window Attention과 KV 캐시 재활용을 결합하여 효율적 스트리밍 추론 구현.