익명 사용자
로그인하지 않음
토론
기여
계정 만들기
로그인
IT 위키
검색
Apache Spark RDD Operation
편집하기
IT 위키
이름공간
문서
토론
더 보기
더 보기
문서 행위
읽기
편집
원본 편집
역사
경고:
로그인하지 않았습니다. 편집을 하면 IP 주소가 공개되게 됩니다.
로그인
하거나
계정을 생성하면
편집자가 사용자 이름으로 기록되고, 다른 장점도 있습니다.
스팸 방지 검사입니다. 이것을 입력하지
마세요
!
'''Apache Spark RDD Operation''' refers to the various transformations and actions that can be applied to [[Resilient Distributed Datasets (RDDs)]] in [[Apache Spark]]. RDD operations enable efficient parallel processing of large datasets in a distributed environment. ==Types of RDD Operations == RDD operations are classified into two types: *'''Transformations:''' Lazy operations that return a new RDD without immediate execution. *'''Actions:''' Operations that trigger computation and return results or store data. ==Transformations== Transformations are applied to RDDs to produce new RDDs. They are lazy, meaning they are not executed until an action is performed. ===Common Transformations=== {| class="wikitable" !Transformation!!Description!!Example!!Result |- | '''map(func)'''||Applies a function to each element in the RDD.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([1, 2, 3, 4]) mapped_rdd = rdd.map(lambda x: x * 2) print(mapped_rdd.collect()) </syntaxhighlight>|| `[2, 4, 6, 8]` |- | '''filter(func)'''||Retains elements that satisfy a condition.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([1, 2, 3, 4, 5, 6]) filtered_rdd = rdd.filter(lambda x: x % 2 == 0) print(filtered_rdd.collect()) </syntaxhighlight>||`[2, 4, 6]` |- |'''flatMap(func)'''|| Similar to map but allows multiple outputs per input.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize(["hello world", "spark rdd"]) flat_mapped_rdd = rdd.flatMap(lambda line: line.split(" ")) print(flat_mapped_rdd.collect()) </syntaxhighlight>||`["hello", "world", "spark", "rdd"]` |- |'''union(rdd)'''|| Merges two RDDs.|| <syntaxhighlight lang="python"> rdd1 = sparkContext.parallelize([1, 2, 3]) rdd2 = sparkContext.parallelize([4, 5, 6]) union_rdd = rdd1.union(rdd2) print(union_rdd.collect()) </syntaxhighlight>||`[1, 2, 3, 4, 5, 6]` |- |'''distinct()'''||Removes duplicate elements. ||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([1, 2, 2, 3, 4, 4, 5]) distinct_rdd = rdd.distinct() print(distinct_rdd.collect()) </syntaxhighlight>||`[1, 2, 3, 4, 5]` |- |'''groupByKey()'''|| Groups data by key (for (K,V) pairs).||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)]) grouped_rdd = rdd.groupByKey().mapValues(list) print(grouped_rdd.collect()) </syntaxhighlight>||`[('a', [1, 3]), ('b', [2])]` |- |'''reduceByKey(func)'''||Merges values for each key using a function.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)]) reduced_rdd = rdd.reduceByKey(lambda x, y: x + y) print(reduced_rdd.collect()) </syntaxhighlight>||`[('a', 4), ('b', 2)]` |- |'''sortByKey()'''||Sorts RDD by key.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([("b", 2), ("a", 1), ("c", 3)]) sorted_rdd = rdd.sortByKey() print(sorted_rdd.collect()) </syntaxhighlight>||`[('a', 1), ('b', 2), ('c', 3)]` |} ==Actions== Actions compute and return results or store RDD data. They trigger the execution of all previous transformations. ===Common Actions=== {| class="wikitable" !Action!!Description!!Example!!Result |- |'''collect()'''|| Returns all elements of the RDD to the driver.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([1, 2, 3, 4]) print(rdd.collect()) </syntaxhighlight>||`[1, 2, 3, 4]` |- |'''count()'''||Returns the number of elements.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([1, 2, 3, 4]) print(rdd.count()) </syntaxhighlight>|| `4` |- |'''reduce(func)'''||Aggregates elements using a binary function.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([1, 2, 3, 4]) sum_result = rdd.reduce(lambda x, y: x + y) print(sum_result) </syntaxhighlight>||`10` |- |'''first()'''||Returns the first element.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([10, 20, 30]) print(rdd.first()) </syntaxhighlight>||`10` |- |'''take(n)'''||Returns the first n elements.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([10, 20, 30, 40]) print(rdd.take(2)) </syntaxhighlight>||`[10, 20]` |- |'''foreach(func)'''||Applies a function to each element.||<syntaxhighlight lang="python"> rdd = sparkContext.parallelize([1, 2, 3]) rdd.foreach(lambda x: print(x)) </syntaxhighlight>||`1, 2, 3` (printed output) |} ==Lazy Evaluation in RDDs== RDD transformations are '''lazy''', meaning they do not execute immediately. Instead, Spark builds a '''DAG (Directed Acyclic Graph)''' representing operations, which is only executed when an action is called. Example:<syntaxhighlight lang="python"> rdd = sparkContext.textFile("data.txt") # No execution yet words = rdd.flatMap(lambda line: line.split()) # Still not executed word_count = words.count() # Now execution starts </syntaxhighlight> ==Persistence and Caching== RDDs can be cached in memory to speed up iterative computations: *'''cache()''' – Stores the RDD in memory. *'''persist(storage_level)''' – Stores the RDD using different storage levels (e.g., memory, disk). Example:<syntaxhighlight lang="python"> rdd = sparkContext.textFile("data.txt").cache() print(rdd.count()) # RDD is now cached in memory </syntaxhighlight> ==Comparison with DataFrames== {| class="wikitable" !Feature!!RDD!!DataFrame |- |Abstraction Level||Low (Resilient Distributed Dataset)||High (Table-like structure) |- |Performance||Slower (No optimizations)||Faster (Uses Catalyst Optimizer) |- |Storage Format||Unstructured||Schema-based |- |Ease of Use||Requires functional transformations||SQL-like API |} ==Advantages of RDDs== *'''Fault Tolerant:''' Uses lineage to recompute lost partitions. *'''Parallel Execution:''' Automatically distributes computations across nodes. *'''Immutable and Lazy Evaluation:''' Optimizes execution by avoiding unnecessary computations. ==Limitations of RDDs== *'''Higher Memory Usage:''' No schema-based optimizations. *'''Verbose API:''' Requires functional programming. *'''Less Optimized than DataFrames:''' Lacks query optimizations found in Spark DataFrames. ==Applications== *Processing large-scale unstructured data. *Complex transformations requiring fine-grained control. *Iterative machine learning computations. ==See Also== *[[Apache Spark]] *[[Resilient Distributed Datasets (RDDs)]] *[[Spark DataFrame]] *[[Big Data Processing]] *[[Parallel Computing]] [[Category:Distributed Computing]]
요약:
IT 위키에서의 모든 기여는 크리에이티브 커먼즈 저작자표시-비영리-동일조건변경허락 라이선스로 배포된다는 점을 유의해 주세요(자세한 내용에 대해서는
IT 위키:저작권
문서를 읽어주세요). 만약 여기에 동의하지 않는다면 문서를 저장하지 말아 주세요.
또한, 직접 작성했거나 퍼블릭 도메인과 같은 자유 문서에서 가져왔다는 것을 보증해야 합니다.
저작권이 있는 내용을 허가 없이 저장하지 마세요!
취소
편집 도움말
(새 창에서 열림)
둘러보기
둘러보기
대문
최근 바뀜
광고
위키 도구
위키 도구
특수 문서 목록
문서 도구
문서 도구
사용자 문서 도구
더 보기
여기를 가리키는 문서
가리키는 글의 최근 바뀜
문서 정보
문서 기록