Huffman Code

IT 위키

Huffman Code is a lossless data compression algorithm that assigns variable-length binary codes to input characters based on their frequency. It is widely used in data compression applications such as file compression and encoding.

1 Concept[편집 | 원본 편집]

Huffman coding works by assigning shorter codes to more frequent characters and longer codes to less frequent characters, minimizing the total number of bits used to represent the data.

2 Algorithm[편집 | 원본 편집]

The Huffman coding algorithm follows these steps:

  1. Calculate Frequency: Count the occurrence of each character in the input data.
  2. Build a Priority Queue: Create a min-heap where each node represents a character and its frequency.
  3. Construct a Huffman Tree:
    • Extract the two nodes with the lowest frequency.
    • Merge them into a new node with their combined frequency.
    • Repeat until only one node remains, which becomes the root of the tree.
  4. Assign Binary Codes: Traverse the tree and assign binary values (0 for left, 1 for right) to generate unique Huffman codes for each character.

3 Example[편집 | 원본 편집]

Given the input string hello world, the frequency table and corresponding Huffman codes might be:

Character Frequency Huffman Code
h 1 1100
e 1 1101
l 3 10
o 2 00
w 1 1110
r 1 1111
d 1 01

The compressed representation replaces each character with its Huffman code.

4 Implementation[편집 | 원본 편집]

A simple implementation of Huffman coding in Python:

import heapq
from collections import Counter, namedtuple

class Node(namedtuple("Node", ["char", "freq", "left", "right"])):
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text):
    freq = Counter(text)
    heap = [Node(char, freq, None, None) for char, freq in freq.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        merged = Node(None, left.freq + right.freq, left, right)
        heapq.heappush(heap, merged)

    return heap[0]

def generate_codes(node, prefix="", codebook={}):
    if node:
        if node.char:
            codebook[node.char] = prefix
        generate_codes(node.left, prefix + "0", codebook)
        generate_codes(node.right, prefix + "1", codebook)
    return codebook

text = "hello world"
tree = build_huffman_tree(text)
codes = generate_codes(tree)
print(codes)

5 Advantages[편집 | 원본 편집]

  • Lossless Compression
    • Ensures no data loss in reconstruction.
  • Efficient Encoding
    • Optimized for frequent characters, reducing overall storage size.
  • Used in Many Applications
    • Commonly applied in ZIP, GZIP, JPEG, and other compression formats.

6 Limitations[편집 | 원본 편집]

  • Requires Frequency Table
    • The table must be transmitted alongside the encoded data.
  • Not Optimal for Small Texts
    • Overhead can be significant for short messages.
  • Sensitive to Input Data
    • Performance depends on character frequency distribution.

7 Applications[편집 | 원본 편집]

  • File Compression
    • Used in ZIP and GZIP formats.
  • Multimedia Encoding
    • Applied in JPEG and MP3 compression.
  • Data Transmission
    • Efficient encoding in communication protocols.

8 See Also[편집 | 원본 편집]