《优化批处理和流式聚合.pdf》由会员分享,可在线阅读,更多相关《优化批处理和流式聚合.pdf(28页珍藏版)》请在三个皮匠报告上搜索。
1、Jacek Laskowski/jaceklaskowskiOptimizing Batch and Streaming AggregationsData+AI Summit 2023About the SpeakerJacek Laskowski is a Freelance IT ConsultantSpecializing in Apache Spark,Delta Lake,Databricks,Apache Kafka(incl.Kafka Streams and ksqlDB)Best known by The Internals Of online booksContact me
2、 at jacekjapila.plFollow me at JacekLaskowskiConnect on LinkedInTable of Contents1.The Intro to The Internals of Structured Queries2.The Internals of Aggregate Queries3.Scala UDAFs and Aggregators4.Streaming Aggregates5.Streaming Aggregates Performance Tuning Gig6.Things to Watch Out For(Recap)The I
3、ntro toThe Internals ofStructured QueriesStructured Queries Apache Spark is a general-purpose distributed compute platform Spark SQL is a module of Apache Spark to describe batch queries over structured and semi-structured datasets(of any size)Spark Structured Streaming is a module of Apache Spark f
4、or streaming queries over unbounded data Queries are described using High-Level Query OperatorsDataFrame APISQL In most cases,optimizing streaming queries is to optimize corresponding batch queriesNo need to focus on streaming features(less to worry about)Caveat:streaming issues may really be relate
5、d to how streaming queries workHigh-Level Query Language-DataFrame APIHigh-Level Query Language-SQLQueryExecutionQueryExecution is the execution pipeline(workflow)of a structured queryMade up of execution phasesLogical and Physical OperatorsLogical Operators are building blocks of logical query plan
6、s in Spark SQLAggregateJoinLocalRelationLogicalRDDMergeIntoTableProjectSortPhysical Operators are executable nodes of physical query plans in Spark SQLAdaptiveSparkPlanExecBroadcastHashJoinExecHashAggregateExecObjectHashAggregateExecProjectExecSortAggregateExecThe Internals of Aggregate QueriesAggre