Resilient Distributed Datasets
IT 위키
Resilient Distributed Datasets (RDDs) are the fundamental data structure in Apache Spark that provide fault-tolerant, parallel computation on large datasets. RDDs enable efficient distributed data processing while ensuring resilience to failures.
1 Overview[편집 | 원본 편집]
RDDs are immutable, distributed collections of objects that can be processed in parallel. They are designed to optimize large-scale data processing by:
- Fault Tolerance: Automatically recovering lost data using lineage (recomputing from original data).
- In-Memory Processing: Storing intermediate results in memory to improve performance.
- Lazy Evaluation: Transformations are not executed immediately but only when an action is triggered.
- Partitioning: Data is split across nodes to allow parallel execution.
2 Key Features[편집 | 원본 편집]
- Immutability: Once created, RDDs cannot be modified; transformations create new RDDs.
- Lineage Tracking: Maintains a history of transformations to recompute lost partitions.
- Lazy Evaluation: Delays execution until an action (e.g., count, collect) is called.
- Fault Tolerance: Automatically recomputes lost partitions without replicating data.
- Parallel Computation: Distributes tasks across nodes in a Spark cluster.
3 Creating RDDs[편집 | 원본 편집]
RDDs can be created in two main ways:
- From an existing collection:
data = [1, 2, 3, 4, 5]
rdd = sparkContext.parallelize(data)
- From an external data source:
rdd = sparkContext.textFile("hdfs://path/to/file.txt")
4 Transformations and Actions[편집 | 원본 편집]
RDDs support two types of operations:
4.1 Transformations (Lazy Evaluation)[편집 | 원본 편집]
Transformations produce new RDDs from existing ones but do not execute immediately:
- map(func) – Applies a function to each element.
- filter(func) – Keeps elements that satisfy a condition.
- flatMap(func) – Similar to map but allows returning multiple values per input.
- union(rdd) – Merges two RDDs.
4.2 Actions (Trigger Execution)[편집 | 원본 편집]
Actions compute and return results or store data:
- collect() – Returns all elements to the driver.
- count() – Returns the number of elements in the RDD.
- reduce(func) – Aggregates elements using a function.
- saveAsTextFile(path) – Saves the RDD to a storage location.
5 RDD Lineage and Fault Tolerance[편집 | 원본 편집]
RDDs achieve fault tolerance through lineage tracking:
- Instead of replicating data, Spark logs the sequence of transformations.
- If a node fails, Spark recomputes lost partitions from the original dataset.
- This approach minimizes storage overhead while ensuring reliability.
6 Comparison with Other Distributed Data Models[편집 | 원본 편집]
Feature | RDDs (Spark) | MapReduce (Hadoop) | DataFrames (Spark) |
---|---|---|---|
Data Processing | In-memory | Disk-based | Optimized execution plans |
Fault Tolerance | Lineage (recomputes lost data) | Replication | Lineage (like RDDs) |
Performance | Fast (RAM-based) | Slow (disk I/O) | Faster (columnar storage) |
Ease of Use | Low (requires functional programming) | Low (requires custom Java/Python) | High (SQL-like API) |
7 Advantages[편집 | 원본 편집]
- High Performance: In-memory computation reduces I/O overhead.
- Scalability: Designed to handle petabyte-scale data.
- Fault Tolerance: Efficient recovery via lineage tracking.
- Flexible API: Supports functional programming in Scala, Python, Java.
8 Limitations[편집 | 원본 편집]
- Complex API: Requires functional programming knowledge.
- High Memory Usage: Inefficient for certain workloads compared to optimized data structures like DataFrames.
- No Schema Optimization: Unlike DataFrames, RDDs do not optimize queries automatically.
9 Applications[편집 | 원본 편집]
- Big Data Processing: Used in large-scale ETL and analytics pipelines.
- Machine Learning: Supports distributed ML algorithms via MLlib.
- Graph Processing: Backbone of GraphX for scalable graph analytics.