Apache Spark RDD Operation 편집하기

'''Apache Spark RDD Operation''' refers to the various transformations and actions that can be applied to [[Resilient Distributed Datasets (RDDs)]] in [[Apache Spark]]. RDD operations enable efficient parallel processing of large datasets in a distributed environment.
==Types of RDD Operations ==
RDD operations are classified into two types:
*'''Transformations:''' Lazy operations that return a new RDD without immediate execution.
*'''Actions:''' Operations that trigger computation and return results or store data.
==Transformations==
Transformations are applied to RDDs to produce new RDDs. They are lazy, meaning they are not executed until an action is performed.
===Common Transformations===
{| class="wikitable"
!Transformation!!Description!!Example!!Result
|-
| '''map(func)'''||Applies a function to each element in the RDD.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([1, 2, 3, 4])
mapped_rdd = rdd.map(lambda x: x * 2)
print(mapped_rdd.collect())
</syntaxhighlight>|| `[2, 4, 6, 8]`
|-
| '''filter(func)'''||Retains elements that satisfy a condition.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([1, 2, 3, 4, 5, 6])
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
print(filtered_rdd.collect())
</syntaxhighlight>||`[2, 4, 6]`
|-
|'''flatMap(func)'''|| Similar to map but allows multiple outputs per input.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize(["hello world", "spark rdd"])
flat_mapped_rdd = rdd.flatMap(lambda line: line.split(" "))
print(flat_mapped_rdd.collect())
</syntaxhighlight>||`["hello", "world", "spark", "rdd"]`
|-
|'''union(rdd)'''|| Merges two RDDs.|| <syntaxhighlight lang="python">
rdd1 = sparkContext.parallelize([1, 2, 3])
rdd2 = sparkContext.parallelize([4, 5, 6])
union_rdd = rdd1.union(rdd2)
print(union_rdd.collect())
</syntaxhighlight>||`[1, 2, 3, 4, 5, 6]`
|-
|'''distinct()'''||Removes duplicate elements. ||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([1, 2, 2, 3, 4, 4, 5])
distinct_rdd = rdd.distinct()
print(distinct_rdd.collect())
</syntaxhighlight>||`[1, 2, 3, 4, 5]`
|-
|'''groupByKey()'''|| Groups data by key (for (K,V) pairs).||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)])
grouped_rdd = rdd.groupByKey().mapValues(list)
print(grouped_rdd.collect())
</syntaxhighlight>||`[('a', [1, 3]), ('b', [2])]`
|-
|'''reduceByKey(func)'''||Merges values for each key using a function.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)])
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
print(reduced_rdd.collect())
</syntaxhighlight>||`[('a', 4), ('b', 2)]`
|-
|'''sortByKey()'''||Sorts RDD by key.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([("b", 2), ("a", 1), ("c", 3)])
sorted_rdd = rdd.sortByKey()
print(sorted_rdd.collect())
</syntaxhighlight>||`[('a', 1), ('b', 2), ('c', 3)]`
|}
==Actions==
Actions compute and return results or store RDD data. They trigger the execution of all previous transformations.
===Common Actions===
{| class="wikitable"
!Action!!Description!!Example!!Result
|-
|'''collect()'''|| Returns all elements of the RDD to the driver.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([1, 2, 3, 4])
print(rdd.collect())
</syntaxhighlight>||`[1, 2, 3, 4]`
|-
|'''count()'''||Returns the number of elements.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([1, 2, 3, 4])
print(rdd.count())
</syntaxhighlight>|| `4`
|-
|'''reduce(func)'''||Aggregates elements using a binary function.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([1, 2, 3, 4])
sum_result = rdd.reduce(lambda x, y: x + y)
print(sum_result)
</syntaxhighlight>||`10`
|-
|'''first()'''||Returns the first element.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([10, 20, 30])
print(rdd.first())
</syntaxhighlight>||`10`
|-
|'''take(n)'''||Returns the first n elements.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([10, 20, 30, 40])
print(rdd.take(2))
</syntaxhighlight>||`[10, 20]`
|-
|'''foreach(func)'''||Applies a function to each element.||<syntaxhighlight lang="python">
rdd = sparkContext.parallelize([1, 2, 3])
rdd.foreach(lambda x: print(x))
</syntaxhighlight>||`1, 2, 3` (printed output)
|}
==Lazy Evaluation in RDDs==
RDD transformations are '''lazy''', meaning they do not execute immediately. Instead, Spark builds a '''DAG (Directed Acyclic Graph)''' representing operations, which is only executed when an action is called.

Example:<syntaxhighlight lang="python">
rdd = sparkContext.textFile("data.txt")  # No execution yet
words = rdd.flatMap(lambda line: line.split())  # Still not executed
word_count = words.count()  # Now execution starts
</syntaxhighlight>
==Persistence and Caching==
RDDs can be cached in memory to speed up iterative computations:
*'''cache()''' – Stores the RDD in memory.
*'''persist(storage_level)''' – Stores the RDD using different storage levels (e.g., memory, disk).
Example:<syntaxhighlight lang="python">
rdd = sparkContext.textFile("data.txt").cache()
print(rdd.count())  # RDD is now cached in memory
</syntaxhighlight>
==Comparison with DataFrames==
{| class="wikitable"
!Feature!!RDD!!DataFrame
|-
|Abstraction Level||Low (Resilient Distributed Dataset)||High (Table-like structure)
|-
|Performance||Slower (No optimizations)||Faster (Uses Catalyst Optimizer)
|-
|Storage Format||Unstructured||Schema-based
|-
|Ease of Use||Requires functional transformations||SQL-like API
|}
==Advantages of RDDs==
*'''Fault Tolerant:''' Uses lineage to recompute lost partitions.
*'''Parallel Execution:''' Automatically distributes computations across nodes.
*'''Immutable and Lazy Evaluation:''' Optimizes execution by avoiding unnecessary computations.
==Limitations of RDDs==
*'''Higher Memory Usage:''' No schema-based optimizations.
*'''Verbose API:''' Requires functional programming.
*'''Less Optimized than DataFrames:''' Lacks query optimizations found in Spark DataFrames.
==Applications==
*Processing large-scale unstructured data.
*Complex transformations requiring fine-grained control.
*Iterative machine learning computations.
==See Also==
*[[Apache Spark]]
*[[Resilient Distributed Datasets (RDDs)]]
*[[Spark DataFrame]]
*[[Big Data Processing]]
*[[Parallel Computing]]
[[Category:Distributed Computing]]