《自动执行 Apache Spark™ 3 迁移并在Airbnb进行验证.pdf》由会员分享,可在线阅读,更多相关《自动执行 Apache Spark™ 3 迁移并在Airbnb进行验证.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、Databricks2023Jason XuData Warehouse Infra TeamZoe LinData Warehouse Infra TeamAutomate Apache Spark 3 Migration w/Validation at Airbnb AgendaOverview:Spark at AirbnbSpark 3 Migration ChallengesMigration w/Validation FrameworkLessons LearnedFuture workSpark at Airbnb Spark at AirbnbAmazon EMR-based
2、Spark clustersYarn resource scheduling15+shared and dedicated clustersTotal 500TB memory process power(1500 m5.24xlarge instances)Hive MetastoreApache AirflowJupyter notebookSpark infraAmazon S3Amazon EMRSpark InfrastructureSpark at AirbnbSpark 3 migration started at Q2 2022Internal Spark 3 release
3、with our patchesTo deprecate Spark 2.4 in 2024Spark Stats150k+daily Spark applicationsUse 80%of cluster resources2x growth from 2022 to 2023Spark&Migration StatsDaily Spark applicationsSpark at AirbnbBreakdown of languages(before migration Q2 2022)Spark Usage LandscapePlatform generated jobsIncludes
4、:metrics platform(Minerva),feature platform(Chronon),event data ingestion,etc.User written jobsTeam and task specificMajority written in Scala,on top of the internal standard batch framework(Sputnik)Code resides in a monorepo for Java&ScalaSpark 3 Migration ChallengesSpark 3 Migration ChallengesGene
5、ric challenges Correctness How to validate output correctness and consistency?Reliability&Performance How to avoid application failure and performance degradation?Scalability How to efficiently migrate thousands of pipelines owned by various teams?Spark 3 Migration ChallengesCorrectnessReliability&P
6、erformanceScalabilityMajor version releaseresolved 3.4k+ticketsScala specific Scala 2.12 upgradeDependencies upgrade(public lib)&multi-version support(internal lib)code&SQL changesoutput data changesfailures&perf degradationsSpecific challengesSpark 3 API and behavior changesValidation is key to ens