《Apache Spark™ 结构化流中简化状态跟踪的介绍.pdf》由会员分享,可在线阅读,更多相关《Apache Spark™ 结构化流中简化状态跟踪的介绍.pdf(21页珍藏版)》请在三个皮匠报告上搜索。
1、Introducing Simplified State Trackingin Apache SparkStructured StreamingCraig LukasikJune 2025Fish&Wildlife ServiceMission:support recreational fishing,tribal subsistence fisheries,and the recovery and restoration of imperiled species.Interventions include:Egg Distribution Fish StockingState Reader
2、APIReview of key Structured Streaming conceptsStructured Streaming5A partition is a division of a datastream into smaller,manageable segments.Benefit:parallelismConceptsFor each partition,it contains the end offset that this batch will process up to.The starting offset is implicitly the end offset f
3、rom the previous batch.Benefit:recovery of failed tasks;helps avoid job failurePartitions(“division of labor”)Offsets(a“To-Do”list)Records the ending offset that was processed upon micro-batch completion.Benefit:recovery of a failed or stopped jobCommits(the“Done Log”):Consumer code:sensor.connect(s
4、tarting_timestamp=20241202124212,host=ohio_ne_42)Streaming events:start:20241202124212,end:20241202126555,observations:river_segment_id:ohio_ne_42,sequence_number:20241242124212,species:Blue Bass,length_cn:52,age_years:2,sex:M,Fish Population MonitoringPotential ways to partition inbound data:River
5、segmentSpeciesEnd partition and position to be sent for processing river_segment_id:4634,seq_number:20241242124212Partition and positionconsumed and committed to sinkPartitionOffsetsCommitsGoal:monitor fish size,species,sex,and age via IoT sensor dataConcepts&definitionsIn the checkpoint directory,u
6、nder the state subdirectory.This directory contains:Operator subdirectoriesPartition subdirectoriesEnables advanced operationsWindowed aggregations(e.g.,running counts,sums)Stream-stream joinsDeduplication across batchesCustom stateful logic(e.g.,sessionization)Fault toleranceYou