Resilient Distributed Datasets

IT 위키

Resilient Distributed Datasets (RDDs) are the fundamental data structure in Apache Spark that provide fault-tolerant, parallel computation on large datasets. RDDs enable efficient distributed data processing while ensuring resilience to failures.

1 Overview[편집 | 원본 편집]

RDDs are immutable, distributed collections of objects that can be processed in parallel. They are designed to optimize large-scale data processing by:

  • Fault Tolerance: Automatically recovering lost data using lineage (recomputing from original data).
  • In-Memory Processing: Storing intermediate results in memory to improve performance.
  • Lazy Evaluation: Transformations are not executed immediately but only when an action is triggered.
  • Partitioning: Data is split across nodes to allow parallel execution.

2 Key Features[편집 | 원본 편집]

  • Immutability: Once created, RDDs cannot be modified; transformations create new RDDs.
  • Lineage Tracking: Maintains a history of transformations to recompute lost partitions.
  • Lazy Evaluation: Delays execution until an action (e.g., count, collect) is called.
  • Fault Tolerance: Automatically recomputes lost partitions without replicating data.
  • Parallel Computation: Distributes tasks across nodes in a Spark cluster.

3 Creating RDDs[편집 | 원본 편집]

RDDs can be created in two main ways:

  1. From an existing collection:
data = [1, 2, 3, 4, 5]
rdd = sparkContext.parallelize(data)
  1. From an external data source:
rdd = sparkContext.textFile("hdfs://path/to/file.txt")

4 Transformations and Actions[편집 | 원본 편집]

RDDs support two types of operations:

4.1 Transformations (Lazy Evaluation)[편집 | 원본 편집]

Transformations produce new RDDs from existing ones but do not execute immediately:

  • map(func) – Applies a function to each element.
  • filter(func) – Keeps elements that satisfy a condition.
  • flatMap(func) – Similar to map but allows returning multiple values per input.
  • union(rdd) – Merges two RDDs.

4.2 Actions (Trigger Execution)[편집 | 원본 편집]

Actions compute and return results or store data:

  • collect() – Returns all elements to the driver.
  • count() – Returns the number of elements in the RDD.
  • reduce(func) – Aggregates elements using a function.
  • saveAsTextFile(path) – Saves the RDD to a storage location.

5 RDD Lineage and Fault Tolerance[편집 | 원본 편집]

RDDs achieve fault tolerance through lineage tracking:

  • Instead of replicating data, Spark logs the sequence of transformations.
  • If a node fails, Spark recomputes lost partitions from the original dataset.
  • This approach minimizes storage overhead while ensuring reliability.

6 Comparison with Other Distributed Data Models[편집 | 원본 편집]

Feature RDDs (Spark) MapReduce (Hadoop) DataFrames (Spark)
Data Processing In-memory Disk-based Optimized execution plans
Fault Tolerance Lineage (recomputes lost data) Replication Lineage (like RDDs)
Performance Fast (RAM-based) Slow (disk I/O) Faster (columnar storage)
Ease of Use Low (requires functional programming) Low (requires custom Java/Python) High (SQL-like API)

7 Advantages[편집 | 원본 편집]

  • High Performance: In-memory computation reduces I/O overhead.
  • Scalability: Designed to handle petabyte-scale data.
  • Fault Tolerance: Efficient recovery via lineage tracking.
  • Flexible API: Supports functional programming in Scala, Python, Java.

8 Limitations[편집 | 원본 편집]

  • Complex API: Requires functional programming knowledge.
  • High Memory Usage: Inefficient for certain workloads compared to optimized data structures like DataFrames.
  • No Schema Optimization: Unlike DataFrames, RDDs do not optimize queries automatically.

9 Applications[편집 | 원본 편집]

  • Big Data Processing: Used in large-scale ETL and analytics pipelines.
  • Machine Learning: Supports distributed ML algorithms via MLlib.
  • Graph Processing: Backbone of GraphX for scalable graph analytics.

10 See Also[편집 | 원본 편집]