Apache Spark™ 结构化流中简化状态跟踪的介绍.pdf

上传人： Fl****zo

编号：718615

2025-06-22

PDF 21页 3.56MB

《Apache Spark™ 结构化流中简化状态跟踪的介绍.pdf》由会员分享，可在线阅读，更多相关《Apache Spark™ 结构化流中简化状态跟踪的介绍.pdf（21页珍藏版）》请在三个皮匠报告上搜索。

1、Introducing Simplified State Trackingin Apache SparkStructured StreamingCraig LukasikJune 2025Fish&Wildlife ServiceMission:support recreational fishing,tribal subsistence fisheries,and the recovery and restoration of imperiled species.Interventions include:Egg Distribution Fish StockingState Reader

2、APIReview of key Structured Streaming conceptsStructured Streaming5A partition is a division of a datastream into smaller,manageable segments.Benefit:parallelismConceptsFor each partition,it contains the end offset that this batch will process up to.The starting offset is implicitly the end offset f

3、rom the previous batch.Benefit:recovery of failed tasks;helps avoid job failurePartitions(“division of labor”)Offsets(a“To-Do”list)Records the ending offset that was processed upon micro-batch completion.Benefit:recovery of a failed or stopped jobCommits(the“Done Log”):Consumer code:sensor.connect(s

4、tarting_timestamp=20241202124212,host=ohio_ne_42)Streaming events:start:20241202124212,end:20241202126555,observations:river_segment_id:ohio_ne_42,sequence_number:20241242124212,species:Blue Bass,length_cn:52,age_years:2,sex:M,Fish Population MonitoringPotential ways to partition inbound data:River

5、segmentSpeciesEnd partition and position to be sent for processing river_segment_id:4634,seq_number:20241242124212Partition and positionconsumed and committed to sinkPartitionOffsetsCommitsGoal:monitor fish size,species,sex,and age via IoT sensor dataConcepts&definitionsIn the checkpoint directory,u

6、nder the state subdirectory.This directory contains:Operator subdirectoriesPartition subdirectoriesEnables advanced operationsWindowed aggregations(e.g.,running counts,sums)Stream-stream joinsDeduplication across batchesCustom stateful logic(e.g.,sessionization)Fault toleranceYou

Apache Spark™ 结构化流中简化状态跟踪的介绍.pdf

相关报告