익명 사용자
로그인하지 않음
토론
기여
계정 만들기
로그인
IT 위키
검색
Resilient Distributed Datasets
편집하기
IT 위키
이름공간
문서
토론
더 보기
더 보기
문서 행위
읽기
편집
원본 편집
역사
경고:
로그인하지 않았습니다. 편집을 하면 IP 주소가 공개되게 됩니다.
로그인
하거나
계정을 생성하면
편집자가 사용자 이름으로 기록되고, 다른 장점도 있습니다.
스팸 방지 검사입니다. 이것을 입력하지
마세요
!
'''Resilient Distributed Datasets (RDDs)''' are the fundamental data structure in [[Apache Spark]] that provide fault-tolerant, parallel computation on large datasets. RDDs enable efficient distributed data processing while ensuring resilience to failures. ==Overview== RDDs are immutable, distributed collections of objects that can be processed in parallel. They are designed to optimize large-scale data processing by: *'''Fault Tolerance:''' Automatically recovering lost data using lineage (recomputing from original data). *'''In-Memory Processing:''' Storing intermediate results in memory to improve performance. *'''Lazy Evaluation:''' Transformations are not executed immediately but only when an action is triggered. *'''Partitioning:''' Data is split across nodes to allow parallel execution. ==Key Features== *'''Immutability:''' Once created, RDDs cannot be modified; transformations create new RDDs. *'''Lineage Tracking:''' Maintains a history of transformations to recompute lost partitions. *'''Lazy Evaluation:''' Delays execution until an action (e.g., count, collect) is called. *'''Fault Tolerance:''' Automatically recomputes lost partitions without replicating data. *'''Parallel Computation:''' Distributes tasks across nodes in a Spark cluster. ==Creating RDDs== RDDs can be created in two main ways: #'''From an existing collection:''' <syntaxhighlight lang="python"> data = [1, 2, 3, 4, 5] rdd = sparkContext.parallelize(data) </syntaxhighlight> #'''From an external data source:''' <syntaxhighlight lang="python"> rdd = sparkContext.textFile("hdfs://path/to/file.txt") </syntaxhighlight> ==Transformations and Actions== RDDs support two types of operations: ===Transformations (Lazy Evaluation)=== Transformations produce new RDDs from existing ones but do not execute immediately: *'''map(func)''' – Applies a function to each element. *'''filter(func)''' – Keeps elements that satisfy a condition. *'''flatMap(func)''' – Similar to map but allows returning multiple values per input. *'''union(rdd)''' – Merges two RDDs. ===Actions (Trigger Execution)=== Actions compute and return results or store data: *'''collect()''' – Returns all elements to the driver. *'''count()''' – Returns the number of elements in the RDD. *'''reduce(func)''' – Aggregates elements using a function. *'''saveAsTextFile(path)''' – Saves the RDD to a storage location. ==RDD Lineage and Fault Tolerance== RDDs achieve fault tolerance through lineage tracking: *Instead of replicating data, Spark logs the sequence of transformations. *If a node fails, Spark recomputes lost partitions from the original dataset. *This approach minimizes storage overhead while ensuring reliability. ==Comparison with Other Distributed Data Models== {| class="wikitable" !Feature!!RDDs (Spark)!!MapReduce (Hadoop)!!DataFrames (Spark) |- |'''Data Processing'''||In-memory||Disk-based||Optimized execution plans |- |'''Fault Tolerance'''||Lineage (recomputes lost data)||Replication||Lineage (like RDDs) |- |'''Performance'''||Fast (RAM-based)||Slow (disk I/O)||Faster (columnar storage) |- |'''Ease of Use'''||Low (requires functional programming)||Low (requires custom Java/Python)||High (SQL-like API) |} ==Advantages== *'''High Performance:''' In-memory computation reduces I/O overhead. *'''Scalability:''' Designed to handle petabyte-scale data. *'''Fault Tolerance:''' Efficient recovery via lineage tracking. *'''Flexible API:''' Supports functional programming in Scala, Python, Java. ==Limitations== *'''Complex API:''' Requires functional programming knowledge. *'''High Memory Usage:''' Inefficient for certain workloads compared to optimized data structures like DataFrames. *'''No Schema Optimization:''' Unlike DataFrames, RDDs do not optimize queries automatically. ==Applications== *'''Big Data Processing:''' Used in large-scale ETL and analytics pipelines. *'''Machine Learning:''' Supports distributed ML algorithms via [[MLlib]]. *'''Graph Processing:''' Backbone of [[GraphX]] for scalable graph analytics. ==See Also== *[[Apache Spark]] *[[DataFrames (Spark)]] *[[Hadoop MapReduce]] *[[Distributed Computing]] *[[Big Data Processing]] [[분류:Distributed Computing]]
요약:
IT 위키에서의 모든 기여는 크리에이티브 커먼즈 저작자표시-비영리-동일조건변경허락 라이선스로 배포된다는 점을 유의해 주세요(자세한 내용에 대해서는
IT 위키:저작권
문서를 읽어주세요). 만약 여기에 동의하지 않는다면 문서를 저장하지 말아 주세요.
또한, 직접 작성했거나 퍼블릭 도메인과 같은 자유 문서에서 가져왔다는 것을 보증해야 합니다.
저작권이 있는 내용을 허가 없이 저장하지 마세요!
취소
편집 도움말
(새 창에서 열림)
둘러보기
둘러보기
대문
최근 바뀜
광고
위키 도구
위키 도구
특수 문서 목록
문서 도구
문서 도구
사용자 문서 도구
더 보기
여기를 가리키는 문서
가리키는 글의 최근 바뀜
문서 정보
문서 기록