《声明式管道:Apache Spark 生态系统的下一步.pdf》由会员分享,可在线阅读,更多相关《声明式管道:Apache Spark 生态系统的下一步.pdf(74页珍藏版)》请在三个皮匠报告上搜索。
1、Declarative Pipelines in Apache SparkSandy Ryza&Michael ArmbrustData+AI Summit 2025Over a Decade ofSimplifying Apache Spark#avg revenue by date and regionrdd.mapValues(lambda revenue:(revenue,1).reduceByKey(lambda a,b:(a0+b0,a1+b1).mapValues(lambda x:x0/x1)Spark RDDsDistributed computing goes functi
2、onal#TODO:Abstract away complexities like computing avg#TODO:Distribute compute across 1000s of machines#TODO:Abstract away complexities like compute avg#avg revenue by date and regionSELECT date,region,avg(revenue)FROM salesGROUP BY date,regionSpark RDDsSpark SQLDistributed queries go declarative#T
3、ODO:Efficiently update as new data arrives#TODO:Efficiently update as new data arrives#avg revenue by date and regionSELECT date,region,avg(revenue)FROM STREAM salesGROUP BY date,regionSpark RDDsSpark SQL#TODO:Transactionally store in the cloud Structured StreamingIncremental processing goes declara
4、tive#TODO:Transactionally store in the cloud#avg revenue by date and regiondata.groupBy(date,region).agg(avg(revenue).writeStream .mode(append).format(delta).table(daily_revenue)Spark RDDsSpark SQL#TODO:Make it an e2e production pipeline Structured StreamingDelta LakeStorage becomes transactionalTab
5、le creation+evolutionParallel executionCI/CDSELECT date,revenue,avg(revenue)FROM STREAM json./salesGROUP BY date,revenueDependencyManagementCheckpointsRetriesCompilationBut data pipelines are still complexData LakeOrchestration Data WarehouseStreamingBIData ScienceGenerativeAIMachine LearningThats w
6、hy we built DLT“DLT hides the complexity of modern data engineering under its simple,intuitive,declarative programming model.”Jian(Miracle)ZhouSenior Engineering ManagerNavy Federal Credit UnionNow its time to take the next stepSpark Declarative PipelinesANNOUNCINGWe are contributing Declarative Pip