Apache Spark RDD Operation: 두 판 사이의 차이
(Created page with "'''Apache Spark RDD Operation''' refers to the various transformations and actions that can be applied to Resilient Distributed Datasets (RDDs) in Apache Spark. RDD operations enable efficient parallel processing of large datasets in a distributed environment. ==Types of RDD Operations== RDD operations are classified into two types: *'''Transformations:''' Lazy operations that return a new RDD without immediate execution. *'''Actions:''' Operations that trigger c...") |
편집 요약 없음 |
||
1번째 줄: | 1번째 줄: | ||
'''Apache Spark RDD Operation''' refers to the various transformations and actions that can be applied to [[Resilient Distributed Datasets (RDDs)]] in [[Apache Spark]]. RDD operations enable efficient parallel processing of large datasets in a distributed environment. | '''Apache Spark RDD Operation''' refers to the various transformations and actions that can be applied to [[Resilient Distributed Datasets (RDDs)]] in [[Apache Spark]]. RDD operations enable efficient parallel processing of large datasets in a distributed environment. | ||
==Types of RDD Operations== | ==Types of RDD Operations == | ||
RDD operations are classified into two types: | RDD operations are classified into two types: | ||
*'''Transformations:''' Lazy operations that return a new RDD without immediate execution. | *'''Transformations:''' Lazy operations that return a new RDD without immediate execution. | ||
8번째 줄: | 8번째 줄: | ||
===Common Transformations=== | ===Common Transformations=== | ||
{| class="wikitable" | {| class="wikitable" | ||
!Transformation!!Description!!Example | !Transformation!!Description!!Example!!Result | ||
|- | |- | ||
|'''map(func)'''||Applies a function to each element in the RDD.|| | | '''map(func)'''||Applies a function to each element in the RDD.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([1, 2, 3, 4]) | |||
mapped_rdd = rdd.map(lambda x: x * 2) | |||
print(mapped_rdd.collect()) | |||
</syntaxhighlight>|| `[2, 4, 6, 8]` | |||
|- | |- | ||
|'''filter(func)'''||Retains elements that satisfy a condition.|| | | '''filter(func)'''||Retains elements that satisfy a condition.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([1, 2, 3, 4, 5, 6]) | |||
filtered_rdd = rdd.filter(lambda x: x % 2 == 0) | |||
print(filtered_rdd.collect()) | |||
</syntaxhighlight>||`[2, 4, 6]` | |||
|- | |- | ||
|'''flatMap(func)'''||Similar to map but allows multiple outputs per input.|| | |'''flatMap(func)'''|| Similar to map but allows multiple outputs per input.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize(["hello world", "spark rdd"]) | |||
flat_mapped_rdd = rdd.flatMap(lambda line: line.split(" ")) | |||
print(flat_mapped_rdd.collect()) | |||
</syntaxhighlight>||`["hello", "world", "spark", "rdd"]` | |||
|- | |- | ||
|'''union(rdd)'''||Merges two RDDs.|| | |'''union(rdd)'''|| Merges two RDDs.|| <syntaxhighlight lang="python"> | ||
rdd1 = sparkContext.parallelize([1, 2, 3]) | |||
rdd2 = sparkContext.parallelize([4, 5, 6]) | |||
union_rdd = rdd1.union(rdd2) | |||
print(union_rdd.collect()) | |||
</syntaxhighlight>||`[1, 2, 3, 4, 5, 6]` | |||
|- | |- | ||
|'''distinct()'''||Removes duplicate elements.|| | |'''distinct()'''||Removes duplicate elements. ||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([1, 2, 2, 3, 4, 4, 5]) | |||
distinct_rdd = rdd.distinct() | |||
print(distinct_rdd.collect()) | |||
</syntaxhighlight>||`[1, 2, 3, 4, 5]` | |||
|- | |- | ||
|'''groupByKey()'''||Groups data by key (for (K,V) pairs).|| | |'''groupByKey()'''|| Groups data by key (for (K,V) pairs).||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)]) | |||
grouped_rdd = rdd.groupByKey().mapValues(list) | |||
print(grouped_rdd.collect()) | |||
</syntaxhighlight>||`[('a', [1, 3]), ('b', [2])]` | |||
|- | |- | ||
|'''reduceByKey(func)'''||Merges values for each key using a function.|| | |'''reduceByKey(func)'''||Merges values for each key using a function.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)]) | |||
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y) | |||
print(reduced_rdd.collect()) | |||
</syntaxhighlight>||`[('a', 4), ('b', 2)]` | |||
|- | |- | ||
|'''sortByKey()'''||Sorts RDD by key.|| | |'''sortByKey()'''||Sorts RDD by key.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([("b", 2), ("a", 1), ("c", 3)]) | |||
sorted_rdd = rdd.sortByKey() | |||
print(sorted_rdd.collect()) | |||
</syntaxhighlight>||`[('a', 1), ('b', 2), ('c', 3)]` | |||
|} | |} | ||
==Actions== | ==Actions== | ||
Actions compute and return results or store RDD data. They trigger the execution of all previous transformations. | Actions compute and return results or store RDD data. They trigger the execution of all previous transformations. | ||
===Common Actions=== | ===Common Actions=== | ||
{| class="wikitable" | {| class="wikitable" | ||
!Action!!Description!!Example | !Action!!Description!!Example!!Result | ||
|- | |- | ||
|'''collect()'''||Returns all elements of the RDD to the driver.|| | |'''collect()'''|| Returns all elements of the RDD to the driver.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([1, 2, 3, 4]) | |||
print(rdd.collect()) | |||
</syntaxhighlight>||`[1, 2, 3, 4]` | |||
|- | |- | ||
|'''count()'''||Returns the number of elements.|| | |'''count()'''||Returns the number of elements.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([1, 2, 3, 4]) | |||
print(rdd.count()) | |||
</syntaxhighlight>|| `4` | |||
|- | |- | ||
|'''reduce(func)'''||Aggregates elements using a binary function.|| | |'''reduce(func)'''||Aggregates elements using a binary function.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([1, 2, 3, 4]) | |||
sum_result = rdd.reduce(lambda x, y: x + y) | |||
print(sum_result) | |||
</syntaxhighlight>||`10` | |||
|- | |- | ||
|'''first()'''||Returns the first element.|| | |'''first()'''||Returns the first element.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([10, 20, 30]) | |||
print(rdd.first()) | |||
</syntaxhighlight>||`10` | |||
|- | |- | ||
|'''take(n)'''||Returns the first n elements.|| | |'''take(n)'''||Returns the first n elements.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([10, 20, 30, 40]) | |||
print(rdd.take(2)) | |||
</syntaxhighlight>||`[10, 20]` | |||
|- | |- | ||
|'''foreach(func)'''||Applies a function to each element.|| | |'''foreach(func)'''||Applies a function to each element.||<syntaxhighlight lang="python"> | ||
rdd = sparkContext.parallelize([1, 2, 3]) | |||
rdd.foreach(lambda x: print(x)) | |||
</syntaxhighlight>||`1, 2, 3` (printed output) | |||
|} | |} | ||
==Lazy Evaluation in RDDs== | ==Lazy Evaluation in RDDs== | ||
RDD transformations are '''lazy''', meaning they do not execute immediately. Instead, Spark builds a '''DAG (Directed Acyclic Graph)''' representing operations, which is only executed when an action is called. | RDD transformations are '''lazy''', meaning they do not execute immediately. Instead, Spark builds a '''DAG (Directed Acyclic Graph)''' representing operations, which is only executed when an action is called. | ||
100번째 줄: | 142번째 줄: | ||
*[[Big Data Processing]] | *[[Big Data Processing]] | ||
*[[Parallel Computing]] | *[[Parallel Computing]] | ||
[[Category:Distributed Computing]] |
2025년 2월 12일 (수) 09:37 기준 최신판
Apache Spark RDD Operation refers to the various transformations and actions that can be applied to Resilient Distributed Datasets (RDDs) in Apache Spark. RDD operations enable efficient parallel processing of large datasets in a distributed environment.
1 Types of RDD Operations[편집 | 원본 편집]
RDD operations are classified into two types:
- Transformations: Lazy operations that return a new RDD without immediate execution.
- Actions: Operations that trigger computation and return results or store data.
2 Transformations[편집 | 원본 편집]
Transformations are applied to RDDs to produce new RDDs. They are lazy, meaning they are not executed until an action is performed.
2.1 Common Transformations[편집 | 원본 편집]
Transformation | Description | Example | Result |
---|---|---|---|
map(func) | Applies a function to each element in the RDD. | rdd = sparkContext.parallelize([1, 2, 3, 4])
mapped_rdd = rdd.map(lambda x: x * 2)
print(mapped_rdd.collect())
|
`[2, 4, 6, 8]` |
filter(func) | Retains elements that satisfy a condition. | rdd = sparkContext.parallelize([1, 2, 3, 4, 5, 6])
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
print(filtered_rdd.collect())
|
`[2, 4, 6]` |
flatMap(func) | Similar to map but allows multiple outputs per input. | rdd = sparkContext.parallelize(["hello world", "spark rdd"])
flat_mapped_rdd = rdd.flatMap(lambda line: line.split(" "))
print(flat_mapped_rdd.collect())
|
`["hello", "world", "spark", "rdd"]` |
union(rdd) | Merges two RDDs. | rdd1 = sparkContext.parallelize([1, 2, 3])
rdd2 = sparkContext.parallelize([4, 5, 6])
union_rdd = rdd1.union(rdd2)
print(union_rdd.collect())
|
`[1, 2, 3, 4, 5, 6]` |
distinct() | Removes duplicate elements. | rdd = sparkContext.parallelize([1, 2, 2, 3, 4, 4, 5])
distinct_rdd = rdd.distinct()
print(distinct_rdd.collect())
|
`[1, 2, 3, 4, 5]` |
groupByKey() | Groups data by key (for (K,V) pairs). | rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)])
grouped_rdd = rdd.groupByKey().mapValues(list)
print(grouped_rdd.collect())
|
`[('a', [1, 3]), ('b', [2])]` |
reduceByKey(func) | Merges values for each key using a function. | rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)])
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
print(reduced_rdd.collect())
|
`[('a', 4), ('b', 2)]` |
sortByKey() | Sorts RDD by key. | rdd = sparkContext.parallelize([("b", 2), ("a", 1), ("c", 3)])
sorted_rdd = rdd.sortByKey()
print(sorted_rdd.collect())
|
`[('a', 1), ('b', 2), ('c', 3)]` |
3 Actions[편집 | 원본 편집]
Actions compute and return results or store RDD data. They trigger the execution of all previous transformations.
3.1 Common Actions[편집 | 원본 편집]
Action | Description | Example | Result |
---|---|---|---|
collect() | Returns all elements of the RDD to the driver. | rdd = sparkContext.parallelize([1, 2, 3, 4])
print(rdd.collect())
|
`[1, 2, 3, 4]` |
count() | Returns the number of elements. | rdd = sparkContext.parallelize([1, 2, 3, 4])
print(rdd.count())
|
`4` |
reduce(func) | Aggregates elements using a binary function. | rdd = sparkContext.parallelize([1, 2, 3, 4])
sum_result = rdd.reduce(lambda x, y: x + y)
print(sum_result)
|
`10` |
first() | Returns the first element. | rdd = sparkContext.parallelize([10, 20, 30])
print(rdd.first())
|
`10` |
take(n) | Returns the first n elements. | rdd = sparkContext.parallelize([10, 20, 30, 40])
print(rdd.take(2))
|
`[10, 20]` |
foreach(func) | Applies a function to each element. | rdd = sparkContext.parallelize([1, 2, 3])
rdd.foreach(lambda x: print(x))
|
`1, 2, 3` (printed output) |
4 Lazy Evaluation in RDDs[편집 | 원본 편집]
RDD transformations are lazy, meaning they do not execute immediately. Instead, Spark builds a DAG (Directed Acyclic Graph) representing operations, which is only executed when an action is called.
Example:
rdd = sparkContext.textFile("data.txt") # No execution yet
words = rdd.flatMap(lambda line: line.split()) # Still not executed
word_count = words.count() # Now execution starts
5 Persistence and Caching[편집 | 원본 편집]
RDDs can be cached in memory to speed up iterative computations:
- cache() – Stores the RDD in memory.
- persist(storage_level) – Stores the RDD using different storage levels (e.g., memory, disk).
Example:
rdd = sparkContext.textFile("data.txt").cache()
print(rdd.count()) # RDD is now cached in memory
6 Comparison with DataFrames[편집 | 원본 편집]
Feature | RDD | DataFrame |
---|---|---|
Abstraction Level | Low (Resilient Distributed Dataset) | High (Table-like structure) |
Performance | Slower (No optimizations) | Faster (Uses Catalyst Optimizer) |
Storage Format | Unstructured | Schema-based |
Ease of Use | Requires functional transformations | SQL-like API |
7 Advantages of RDDs[편집 | 원본 편집]
- Fault Tolerant: Uses lineage to recompute lost partitions.
- Parallel Execution: Automatically distributes computations across nodes.
- Immutable and Lazy Evaluation: Optimizes execution by avoiding unnecessary computations.
8 Limitations of RDDs[편집 | 원본 편집]
- Higher Memory Usage: No schema-based optimizations.
- Verbose API: Requires functional programming.
- Less Optimized than DataFrames: Lacks query optimizations found in Spark DataFrames.
9 Applications[편집 | 원본 편집]
- Processing large-scale unstructured data.
- Complex transformations requiring fine-grained control.
- Iterative machine learning computations.