Huffman Code
IT 위키
Huffman Code is a lossless data compression algorithm that assigns variable-length binary codes to input characters based on their frequency. It is widely used in data compression applications such as file compression and encoding.
Concept[편집 | 원본 편집]
Huffman coding works by assigning shorter codes to more frequent characters and longer codes to less frequent characters, minimizing the total number of bits used to represent the data.
Algorithm[편집 | 원본 편집]
The Huffman coding algorithm follows these steps:
- Calculate Frequency: Count the occurrence of each character in the input data.
- Build a Priority Queue: Create a min-heap where each node represents a character and its frequency.
- Construct a Huffman Tree:
- Extract the two nodes with the lowest frequency.
- Merge them into a new node with their combined frequency.
- Repeat until only one node remains, which becomes the root of the tree.
 
- Assign Binary Codes: Traverse the tree and assign binary values (0 for left, 1 for right) to generate unique Huffman codes for each character.
Example[편집 | 원본 편집]
Given the input string hello world, the frequency table and corresponding Huffman codes might be:
| Character | Frequency | Huffman Code | 
|---|---|---|
| h | 1 | 1100 | 
| e | 1 | 1101 | 
| l | 3 | 10 | 
| o | 2 | 00 | 
| w | 1 | 1110 | 
| r | 1 | 1111 | 
| d | 1 | 01 | 
The compressed representation replaces each character with its Huffman code.
Implementation[편집 | 원본 편집]
A simple implementation of Huffman coding in Python:
import heapq
from collections import Counter, namedtuple
class Node(namedtuple("Node", ["char", "freq", "left", "right"])):
    def __lt__(self, other):
        return self.freq < other.freq
def build_huffman_tree(text):
    freq = Counter(text)
    heap = [Node(char, freq, None, None) for char, freq in freq.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        merged = Node(None, left.freq + right.freq, left, right)
        heapq.heappush(heap, merged)
    return heap[0]
def generate_codes(node, prefix="", codebook={}):
    if node:
        if node.char:
            codebook[node.char] = prefix
        generate_codes(node.left, prefix + "0", codebook)
        generate_codes(node.right, prefix + "1", codebook)
    return codebook
text = "hello world"
tree = build_huffman_tree(text)
codes = generate_codes(tree)
print(codes)
Advantages[편집 | 원본 편집]
- Lossless Compression
- Ensures no data loss in reconstruction.
 
- Efficient Encoding
- Optimized for frequent characters, reducing overall storage size.
 
- Used in Many Applications
- Commonly applied in ZIP, GZIP, JPEG, and other compression formats.
 
Limitations[편집 | 원본 편집]
- Requires Frequency Table
- The table must be transmitted alongside the encoded data.
 
- Not Optimal for Small Texts
- Overhead can be significant for short messages.
 
- Sensitive to Input Data
- Performance depends on character frequency distribution.
 
Applications[편집 | 원본 편집]
- File Compression
- Used in ZIP and GZIP formats.
 
- Multimedia Encoding
- Applied in JPEG and MP3 compression.
 
- Data Transmission
- Efficient encoding in communication protocols.
 

