Resilient Distributed Datasets 편집하기

'''Resilient Distributed Datasets (RDDs)''' are the fundamental data structure in [[Apache Spark]] that provide fault-tolerant, parallel computation on large datasets. RDDs enable efficient distributed data processing while ensuring resilience to failures.
==Overview==
RDDs are immutable, distributed collections of objects that can be processed in parallel. They are designed to optimize large-scale data processing by:
*'''Fault Tolerance:''' Automatically recovering lost data using lineage (recomputing from original data).
*'''In-Memory Processing:''' Storing intermediate results in memory to improve performance.
*'''Lazy Evaluation:''' Transformations are not executed immediately but only when an action is triggered.
*'''Partitioning:''' Data is split across nodes to allow parallel execution.
==Key Features==
*'''Immutability:''' Once created, RDDs cannot be modified; transformations create new RDDs.
*'''Lineage Tracking:''' Maintains a history of transformations to recompute lost partitions.
*'''Lazy Evaluation:''' Delays execution until an action (e.g., count, collect) is called.
*'''Fault Tolerance:''' Automatically recomputes lost partitions without replicating data.
*'''Parallel Computation:''' Distributes tasks across nodes in a Spark cluster.
==Creating RDDs==
RDDs can be created in two main ways:
#'''From an existing collection:'''
<syntaxhighlight lang="python">
data = [1, 2, 3, 4, 5]
rdd = sparkContext.parallelize(data)
</syntaxhighlight>
#'''From an external data source:'''
<syntaxhighlight lang="python">
rdd = sparkContext.textFile("hdfs://path/to/file.txt")
</syntaxhighlight>
==Transformations and Actions==
RDDs support two types of operations:
===Transformations (Lazy Evaluation)===
Transformations produce new RDDs from existing ones but do not execute immediately:
*'''map(func)''' – Applies a function to each element.
*'''filter(func)''' – Keeps elements that satisfy a condition.
*'''flatMap(func)''' – Similar to map but allows returning multiple values per input.
*'''union(rdd)''' – Merges two RDDs.
===Actions (Trigger Execution)===
Actions compute and return results or store data:
*'''collect()''' – Returns all elements to the driver.
*'''count()''' – Returns the number of elements in the RDD.
*'''reduce(func)''' – Aggregates elements using a function.
*'''saveAsTextFile(path)''' – Saves the RDD to a storage location.
==RDD Lineage and Fault Tolerance==
RDDs achieve fault tolerance through lineage tracking:
*Instead of replicating data, Spark logs the sequence of transformations.
*If a node fails, Spark recomputes lost partitions from the original dataset.
*This approach minimizes storage overhead while ensuring reliability.
==Comparison with Other Distributed Data Models==
{| class="wikitable"
!Feature!!RDDs (Spark)!!MapReduce (Hadoop)!!DataFrames (Spark)
|-
|'''Data Processing'''||In-memory||Disk-based||Optimized execution plans
|-
|'''Fault Tolerance'''||Lineage (recomputes lost data)||Replication||Lineage (like RDDs)
|-
|'''Performance'''||Fast (RAM-based)||Slow (disk I/O)||Faster (columnar storage)
|-
|'''Ease of Use'''||Low (requires functional programming)||Low (requires custom Java/Python)||High (SQL-like API)
|}
==Advantages==
*'''High Performance:''' In-memory computation reduces I/O overhead.
*'''Scalability:''' Designed to handle petabyte-scale data.
*'''Fault Tolerance:''' Efficient recovery via lineage tracking.
*'''Flexible API:''' Supports functional programming in Scala, Python, Java.
==Limitations==
*'''Complex API:''' Requires functional programming knowledge.
*'''High Memory Usage:''' Inefficient for certain workloads compared to optimized data structures like DataFrames.
*'''No Schema Optimization:''' Unlike DataFrames, RDDs do not optimize queries automatically.
==Applications==
*'''Big Data Processing:''' Used in large-scale ETL and analytics pipelines.
*'''Machine Learning:''' Supports distributed ML algorithms via [[MLlib]].
*'''Graph Processing:''' Backbone of [[GraphX]] for scalable graph analytics.
==See Also==
*[[Apache Spark]]
*[[DataFrames (Spark)]]
*[[Hadoop MapReduce]]
*[[Distributed Computing]]
*[[Big Data Processing]]
[[분류:Distributed Computing]]