Huffman Code
IT 위키
Huffman Code is a lossless data compression algorithm that assigns variable-length binary codes to input characters based on their frequency. It is widely used in data compression applications such as file compression and encoding.
1 Concept[편집 | 원본 편집]
Huffman coding works by assigning shorter codes to more frequent characters and longer codes to less frequent characters, minimizing the total number of bits used to represent the data.
2 Algorithm[편집 | 원본 편집]
The Huffman coding algorithm follows these steps:
- Calculate Frequency: Count the occurrence of each character in the input data.
- Build a Priority Queue: Create a min-heap where each node represents a character and its frequency.
- Construct a Huffman Tree:
- Extract the two nodes with the lowest frequency.
- Merge them into a new node with their combined frequency.
- Repeat until only one node remains, which becomes the root of the tree.
- Assign Binary Codes: Traverse the tree and assign binary values (0 for left, 1 for right) to generate unique Huffman codes for each character.
3 Example[편집 | 원본 편집]
Given the input string hello world, the frequency table and corresponding Huffman codes might be:
Character | Frequency | Huffman Code |
---|---|---|
h | 1 | 1100 |
e | 1 | 1101 |
l | 3 | 10 |
o | 2 | 00 |
w | 1 | 1110 |
r | 1 | 1111 |
d | 1 | 01 |
The compressed representation replaces each character with its Huffman code.
4 Implementation[편집 | 원본 편집]
A simple implementation of Huffman coding in Python:
import heapq
from collections import Counter, namedtuple
class Node(namedtuple("Node", ["char", "freq", "left", "right"])):
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(text):
freq = Counter(text)
heap = [Node(char, freq, None, None) for char, freq in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
merged = Node(None, left.freq + right.freq, left, right)
heapq.heappush(heap, merged)
return heap[0]
def generate_codes(node, prefix="", codebook={}):
if node:
if node.char:
codebook[node.char] = prefix
generate_codes(node.left, prefix + "0", codebook)
generate_codes(node.right, prefix + "1", codebook)
return codebook
text = "hello world"
tree = build_huffman_tree(text)
codes = generate_codes(tree)
print(codes)
5 Advantages[편집 | 원본 편집]
- Lossless Compression
- Ensures no data loss in reconstruction.
- Efficient Encoding
- Optimized for frequent characters, reducing overall storage size.
- Used in Many Applications
- Commonly applied in ZIP, GZIP, JPEG, and other compression formats.
6 Limitations[편집 | 원본 편집]
- Requires Frequency Table
- The table must be transmitted alongside the encoded data.
- Not Optimal for Small Texts
- Overhead can be significant for short messages.
- Sensitive to Input Data
- Performance depends on character frequency distribution.
7 Applications[편집 | 원본 편집]
- File Compression
- Used in ZIP and GZIP formats.
- Multimedia Encoding
- Applied in JPEG and MP3 compression.
- Data Transmission
- Efficient encoding in communication protocols.