Resilient Distributed Datasets

Resilient Distributed Datasets (RDDs) are the fundamental data structure in Apache Spark that provide fault-tolerant, parallel computation on large datasets. RDDs enable efficient distributed data processing while ensuring resilience to failures.

1 Overview[편집 | 원본 편집]

RDDs are immutable, distributed collections of objects that can be processed in parallel. They are designed to optimize large-scale data processing by:

Fault Tolerance: Automatically recovering lost data using lineage (recomputing from original data).
In-Memory Processing: Storing intermediate results in memory to improve performance.
Lazy Evaluation: Transformations are not executed immediately but only when an action is triggered.
Partitioning: Data is split across nodes to allow parallel execution.

2 Key Features[편집 | 원본 편집]

Immutability: Once created, RDDs cannot be modified; transformations create new RDDs.
Lineage Tracking: Maintains a history of transformations to recompute lost partitions.
Lazy Evaluation: Delays execution until an action (e.g., count, collect) is called.
Fault Tolerance: Automatically recomputes lost partitions without replicating data.
Parallel Computation: Distributes tasks across nodes in a Spark cluster.

3 Creating RDDs[편집 | 원본 편집]

RDDs can be created in two main ways:

From an existing collection:

data = [1, 2, 3, 4, 5]
rdd = sparkContext.parallelize(data)

From an external data source:

rdd = sparkContext.textFile("hdfs://path/to/file.txt")

4 Transformations and Actions[편집 | 원본 편집]

RDDs support two types of operations:

4.1 Transformations (Lazy Evaluation)[편집 | 원본 편집]

Transformations produce new RDDs from existing ones but do not execute immediately:

map(func) – Applies a function to each element.
filter(func) – Keeps elements that satisfy a condition.
flatMap(func) – Similar to map but allows returning multiple values per input.
union(rdd) – Merges two RDDs.

4.2 Actions (Trigger Execution)[편집 | 원본 편집]

Actions compute and return results or store data:

collect() – Returns all elements to the driver.
count() – Returns the number of elements in the RDD.
reduce(func) – Aggregates elements using a function.
saveAsTextFile(path) – Saves the RDD to a storage location.

5 RDD Lineage and Fault Tolerance[편집 | 원본 편집]

RDDs achieve fault tolerance through lineage tracking:

Instead of replicating data, Spark logs the sequence of transformations.
If a node fails, Spark recomputes lost partitions from the original dataset.
This approach minimizes storage overhead while ensuring reliability.

6 Comparison with Other Distributed Data Models[편집 | 원본 편집]

Feature	RDDs (Spark)	MapReduce (Hadoop)	DataFrames (Spark)
Data Processing	In-memory	Disk-based	Optimized execution plans
Fault Tolerance	Lineage (recomputes lost data)	Replication	Lineage (like RDDs)
Performance	Fast (RAM-based)	Slow (disk I/O)	Faster (columnar storage)
Ease of Use	Low (requires functional programming)	Low (requires custom Java/Python)	High (SQL-like API)

7 Advantages[편집 | 원본 편집]

High Performance: In-memory computation reduces I/O overhead.
Scalability: Designed to handle petabyte-scale data.
Fault Tolerance: Efficient recovery via lineage tracking.
Flexible API: Supports functional programming in Scala, Python, Java.

8 Limitations[편집 | 원본 편집]

Complex API: Requires functional programming knowledge.
High Memory Usage: Inefficient for certain workloads compared to optimized data structures like DataFrames.
No Schema Optimization: Unlike DataFrames, RDDs do not optimize queries automatically.

9 Applications[편집 | 원본 편집]

Big Data Processing: Used in large-scale ETL and analytics pipelines.
Machine Learning: Supports distributed ML algorithms via MLlib.
Graph Processing: Backbone of GraphX for scalable graph analytics.

10 See Also[편집 | 원본 편집]

익명 사용자

검색

Resilient Distributed Datasets

이름공간

더 보기

문서 행위

목차

1 Overview[편집 | 원본 편집]

2 Key Features[편집 | 원본 편집]

3 Creating RDDs[편집 | 원본 편집]

4 Transformations and Actions[편집 | 원본 편집]

4.1 Transformations (Lazy Evaluation)[편집 | 원본 편집]

4.2 Actions (Trigger Execution)[편집 | 원본 편집]

5 RDD Lineage and Fault Tolerance[편집 | 원본 편집]

6 Comparison with Other Distributed Data Models[편집 | 원본 편집]

7 Advantages[편집 | 원본 편집]

8 Limitations[편집 | 원본 편집]

9 Applications[편집 | 원본 편집]

10 See Also[편집 | 원본 편집]

둘러보기

둘러보기

광고

위키 도구

위키 도구

익명 사용자

검색

Resilient Distributed Datasets

1 Overview[편집 | 원본 편집]

2 Key Features[편집 | 원본 편집]

3 Creating RDDs[편집 | 원본 편집]

4 Transformations and Actions[편집 | 원본 편집]

4.1 Transformations (Lazy Evaluation)[편집 | 원본 편집]

4.2 Actions (Trigger Execution)[편집 | 원본 편집]

5 RDD Lineage and Fault Tolerance[편집 | 원본 편집]

6 Comparison with Other Distributed Data Models[편집 | 원본 편집]

7 Advantages[편집 | 원본 편집]

8 Limitations[편집 | 원본 편집]

9 Applications[편집 | 원본 편집]

10 See Also[편집 | 원본 편집]

둘러보기

위키 도구

문서 도구

분류 목록