《结构化流式处理:揭开任意有状态操作的神秘面纱.pdf》由会员分享,可在线阅读,更多相关《结构化流式处理:揭开任意有状态操作的神秘面纱.pdf(48页珍藏版)》请在三个皮匠报告上搜索。
1、Demystifying Arbitrary Stateful OperationsTaking the fear out of flatMapGroupsWithStateand applyInPandasWithStateAngela Chu,Databricks Sr.Solution Architect/Streaming SMEDatabricks2023My source data was already in the exact format I needed,and my consumers business requirements were all simple and s
2、traightforward to implement.-NobodyWhat well coverWhat are Arbitrary Stateful Operations?Summary of a real-world use caseThinking about flatMapGroupsWithState/applyInPandasWithStatelogicBreaking down the logic into its components data structures,initialization logic,steady-state logic and timeoutsIn
3、tegrate into your Spark Structured StreamStreaming best practices that applyAdditional things to keep in mindArbitrary Stateful OperationsSubstitute the word“Arbitrary”with“Custom”.Spark Structured Streaming has many built-in stateful operations(windowed aggregations,deduplication,etc)where with the
4、 use of watermarks the system automatically stores and manages data in-between microbatches.This data is known as the state.There are use cases that require more complex logic than is available in the built-in operations.In this case,users must be able to define custom state handling code to control
5、 what is stored in-between microbatches and how results are computed.These are arbitrary stateful operations.What are they?Customer Use CaseWhen a transaction record is received for a user,the count of transactions for that user that occurred within the last 5 minutes of the transaction time is calc
6、ulated and written to a table.Only the current count for a given user is kept.If no transactions are received for a user,the count should automatically go down.An ML model uses the count to determine if too many have occurred within the last 5 minutes.Transaction count within the last 5 minutesCusto