1、Declarative Pipelines in Apache SparkSandy Ryza&Michael ArmbrustData+AI Summit 2025Over a Decade ofSimplifying Apache Spark#avg revenue by date and regionrdd.mapValues(lambda revenue:(revenue,1).reduceByKey(lambda a,b:(a0+b0,a1+b1).mapValues(lambda x:x0/x1)Spark RDDsDistributed computing goes functi
2、onal#TODO:Abstract away complexities like computing avg#TODO:Distribute compute across 1000s of machines#TODO:Abstract away complexities like compute avg#avg revenue by date and regionSELECT date,region,avg(revenue)FROM salesGROUP BY date,regionSpark RDDsSpark SQLDistributed queries go declarative#T
3、ODO:Efficiently update as new data arrives#TODO:Efficiently update as new data arrives#avg revenue by date and regionSELECT date,region,avg(revenue)FROM STREAM salesGROUP BY date,regionSpark RDDsSpark SQL#TODO:Transactionally store in the cloud Structured StreamingIncremental processing goes declara
4、tive#TODO:Transactionally store in the cloud#avg revenue by date and regiondata.groupBy(date,region).agg(avg(revenue).writeStream .mode(append).format(delta).table(daily_revenue)Spark RDDsSpark SQL#TODO:Make it an e2e production pipeline Structured StreamingDelta LakeStorage becomes transactionalTab
5、le creation+evolutionParallel executionCI/CDSELECT date,revenue,avg(revenue)FROM STREAM json./salesGROUP BY date,revenueDependencyManagementCheckpointsRetriesCompilationBut data pipelines are still complexData LakeOrchestration Data WarehouseStreamingBIData ScienceGenerativeAIMachine LearningThats w
6、hy we built DLT“DLT hides the complexity of modern data engineering under its simple,intuitive,declarative programming model.”Jian(Miracle)ZhouSenior Engineering ManagerNavy Federal Credit UnionNow its time to take the next stepSpark Declarative PipelinesANNOUNCINGWe are contributing Declarative Pip