MapReduce
IT 위키
MapReduce is a programming model and framework designed for processing and generating large datasets in a distributed computing environment. It simplifies the processing of big data by dividing tasks into two primary phases: the Map phase and the Reduce phase. Developed by Google, MapReduce has become a foundational concept in distributed data processing systems, such as Apache Hadoop.
1 Key Concepts[편집 | 원본 편집]
- Map Phase: Processes input data and converts it into key-value pairs. Each pair is processed independently, enabling parallelism.
- Shuffle and Sort Phase: Groups and sorts intermediate key-value pairs by their keys, preparing them for the Reduce phase.
- Reduce Phase: Aggregates or processes the sorted key-value pairs to produce the final output.
2 How MapReduce Works[편집 | 원본 편집]
MapReduce operates in the following steps:
- Input data is divided into smaller splits or chunks for parallel processing.
- The Map phase processes each chunk to generate intermediate key-value pairs.
- Intermediate key-value pairs are grouped and sorted by keys in the Shuffle and Sort phase.
- The Reduce phase aggregates the grouped key-value pairs to generate the final output.
- Results are written to distributed storage.
2.1 Example Workflow[편집 | 원본 편집]
Suppose we want to count the number of occurrences of each word in a large dataset:
- The input text is divided into multiple chunks.
- In the Map phase, each mapper emits a key-value pair for every word (e.g., "word" → 1).
- In the Shuffle and Sort phase, intermediate key-value pairs are grouped by word (e.g., "word" → [1, 1, 1]).
- In the Reduce phase, reducers sum the counts for each word (e.g., "word" → 3).
- The final counts are written to a file (e.g., "word: 3").
3 Advantages[편집 | 원본 편집]
- Scalability: Can process massive datasets by distributing tasks across multiple nodes.
- Fault Tolerance: Automatically handles node failures by re-executing failed tasks.
- Simplicity: Abstracts the complexity of distributed processing, allowing developers to focus on the logic of Map and Reduce functions.
- Parallelism: Processes data concurrently, reducing execution time.
4 Limitations[편집 | 원본 편집]
- High Latency: The shuffle and sort phase introduces significant overhead, making MapReduce unsuitable for low-latency tasks.
- Limited Flexibility: Requires problems to be expressed in terms of Map and Reduce, which may not fit all use cases.
- Iterative Processing: Inefficient for iterative tasks, such as machine learning, as each iteration requires reading and writing to disk.
5 Applications[편집 | 원본 편집]
MapReduce is widely used in:
- Data Analytics: Processing logs, clickstream data, and web analytics.
- Search Indexing: Building and updating search indexes for search engines.
- Machine Learning: Processing large-scale training datasets.
- ETL (Extract, Transform, Load): Cleaning and transforming large datasets for data warehouses.
- Big Data Processing: Handling large datasets in industries such as finance, healthcare, and telecommunications.
6 MapReduce in Apache Hadoop[편집 | 원본 편집]
Apache Hadoop is one of the most popular frameworks for implementing MapReduce. Hadoop extends MapReduce with additional features:
- Distributed File System (HDFS): Provides storage for input and output data.
- Resource Management (YARN): Manages cluster resources and schedules MapReduce tasks.
- Fault Tolerance: Automatically replicates data and re-executes failed tasks.
7 Comparison with Other Frameworks[편집 | 원본 편집]
Feature | MapReduce (Hadoop) | Spark | Flink |
---|---|---|---|
Execution Model | Batch Processing | Batch and Stream Processing | Stream Processing |
Latency | High | Low | Very Low |
Ease of Use | Moderate | High (with APIs like PySpark) | Moderate |
Fault Tolerance | High | High | High |
Use Cases | ETL, log processing | Machine learning, real-time analytics | Real-time analytics, complex event processing |