1、Airbnb?HONGBO ZENGData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgenda13B35PB1400+Warehouse Size
2、#Events CollectedMachinesHadoop+Presto+SparkScale of Data Infrastructure at Airbnb5xYoY Data GrowthEvent LogsMySQL DumpsGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauData PlatformYarnHDFSHiveYarn5AirStreamEvent LogsMySQL Dum
3、psGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauData PlatformYarnHDFSHiveYarn6AirStreamData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaSetupSing
4、le HDFS,MR and Hive installationc3.8xlarge(32 cores/60G mem/640GB disk)+3TB of EBS volume800 nodesTested DN on different AZsAll data managed by HiveOriginal ClusterChallengesLimited isolation between production/adhocAdhoc-Difficult to meet SLAs-Harder for capacity plan Disaster recoveryDifficult rol
5、l outsTwo independent HDFS,MR,Hive metastoresd2.8xlarge w/48TB local250 instances in final setupReplication of common/critical data-Silver is superof GoldFor disaster recovery,separate AZsTwo ClustersGold ClusterHDFSHiveSilver ClusterReplicationYarnHDFSHiveYarnAdvantagesFailure isolation with user j
6、obsEasy capacity planningGuarantee SLAsAble to test new versionsDisaster RecoveryMulti-Cluster Trade-OffsDisadvantagesData synchronizationUser confusionOperational overheadAdvantagesFailure isolation with user jobsEasy capacity planningGuarantee SLAsAble to test new versionsDisaster RecoveryMulti-Cl