MapReduce 편집하기

'''MapReduce''' is a programming model and framework designed for processing and generating large datasets in a distributed computing environment. It simplifies the processing of big data by dividing tasks into two primary phases: the '''Map''' phase and the '''Reduce''' phase. Developed by Google, MapReduce has become a foundational concept in distributed data processing systems, such as Apache Hadoop.
==Key Concepts==
*'''Map Phase:''' Processes input data and converts it into key-value pairs. Each pair is processed independently, enabling parallelism.
*'''Shuffle and Sort Phase:''' Groups and sorts intermediate key-value pairs by their keys, preparing them for the Reduce phase.
*'''Reduce Phase:''' Aggregates or processes the sorted key-value pairs to produce the final output.
==How MapReduce Works==
MapReduce operates in the following steps:
#Input data is divided into smaller splits or chunks for parallel processing.
#The '''Map phase''' processes each chunk to generate intermediate key-value pairs.
#Intermediate key-value pairs are grouped and sorted by keys in the '''Shuffle and Sort phase'''.
#The '''Reduce phase''' aggregates the grouped key-value pairs to generate the final output.
#Results are written to distributed storage.
===Example Workflow===
Suppose we want to count the number of occurrences of each word in a large dataset:
#The input text is divided into multiple chunks.
#In the '''Map phase''', each mapper emits a key-value pair for every word (e.g., "word" → 1).
#In the '''Shuffle and Sort phase''', intermediate key-value pairs are grouped by word (e.g., "word" → [1, 1, 1]).
#In the '''Reduce phase''', reducers sum the counts for each word (e.g., "word" → 3).
#The final counts are written to a file (e.g., "word: 3").
==Advantages==
*'''Scalability:''' Can process massive datasets by distributing tasks across multiple nodes.
*'''Fault Tolerance:''' Automatically handles node failures by re-executing failed tasks.
*'''Simplicity:''' Abstracts the complexity of distributed processing, allowing developers to focus on the logic of Map and Reduce functions.
*'''Parallelism:''' Processes data concurrently, reducing execution time.
==Limitations==
*'''High Latency:''' The shuffle and sort phase introduces significant overhead, making MapReduce unsuitable for low-latency tasks.
*'''Limited Flexibility:''' Requires problems to be expressed in terms of Map and Reduce, which may not fit all use cases.
*'''Iterative Processing:''' Inefficient for iterative tasks, such as machine learning, as each iteration requires reading and writing to disk.
==Applications==
MapReduce is widely used in:
*'''Data Analytics:''' Processing logs, clickstream data, and web analytics.
*'''Search Indexing:''' Building and updating search indexes for search engines.
*'''Machine Learning:''' Processing large-scale training datasets.
*'''ETL (Extract, Transform, Load):''' Cleaning and transforming large datasets for data warehouses.
*'''Big Data Processing:''' Handling large datasets in industries such as finance, healthcare, and telecommunications.
==MapReduce in Apache Hadoop==
Apache Hadoop is one of the most popular frameworks for implementing MapReduce. Hadoop extends MapReduce with additional features:
*'''Distributed File System (HDFS):''' Provides storage for input and output data.
*'''Resource Management (YARN):''' Manages cluster resources and schedules MapReduce tasks.
*'''Fault Tolerance:''' Automatically replicates data and re-executes failed tasks.
==Comparison with Other Frameworks==
{| class="wikitable"
!Feature!!MapReduce (Hadoop)!!Spark!!Flink
|-
|'''Execution Model'''||Batch Processing||Batch and Stream Processing||Stream Processing
|-
|'''Latency'''||High||Low||Very Low
|-
|'''Ease of Use'''||Moderate||High (with APIs like PySpark)||Moderate
|-
|'''Fault Tolerance'''||High||High||High
|-
|'''Use Cases'''||ETL, log processing||Machine learning, real-time analytics||Real-time analytics, complex event processing
|}
==See Also==
*[[Apache Hadoop]]
*[[Apache Spark]]
*[[Distributed Computing]]
*[[Big Data]]
*[[ETL Process]]
*[[Google File System (GFS)]]
설명	입력하는 내용	문서에 나오는 결과
기울임꼴	''기울인 글씨''	기울인 글씨
굵게	'''굵은 글씨'''	굵은 글씨
굵고 기울인 글씨	'''''굵고 기울인 글씨'''''	*굵고 기울인 글씨*