《1、李廷加-在字节跳动建立实时数据湖.pdf》由会员分享,可在线阅读,更多相关《1、李廷加-在字节跳动建立实时数据湖.pdf(41页珍藏版)》请在三个皮匠报告上搜索。
1、李延加/Gary LiSoftware Engineer ByteDanceApache Hudi PMC MemberBuildingBuilding TheThe RealReal-timetime DatalakeDatalakeatat ByteDanceByteDanceByteDanceByteDance DataData IntegrationIntegration#1WhyWhy HudiHudi#2DatalakeDatalake IntegrationIntegration SolutionSolution#3UseUse CasesCases#4FutureFuture
2、WorkWork#5#1#1ByteDanceByteDance DataData IntegrationIntegrationEvolutionEvolution ofof DataData IntegrationIntegration SystemSystematat ByteDanceByteDance2018Batch integrationbetweenheterogeneousdata sources2020MQ-HiveUnify streamingand batch2021MQ-DatalakeUnify datawarehouse anddatalakeApache Flin
3、kApache FlinkApache FlinkDataData IntegrationIntegration SystemSystem atat ByteDanceByteDanceSupport 50+channelsincluding DB,MQ,bigdata ecosystemThree mode:Batch Streaming IncrementalSupport all businesslines,such asDouyin,Toutiao,e.t.cBatchBatch ModeModeSupport 20+types ofsource and sink100k+jobs p
4、er dayMySQLData Integration EngineBatch ModeOracleHDFSRedisHiveESMysqlOracleHDFSRedisHiveESStreamingStreaming ModeModeDailyDaily throughput:throughput:MQ-Hive20+PB,10+trillions rowsMQ-HDFS100+PB,50+trillions rowsKafkaData Integration EngineStreaming ModeRocketMQHiveHDFSIncrementalIncremental Mode(CD
5、C)Mode(CDC)MySQLBatch ModeKafka(Binlog)Hive(T-1)HDFSStreamingModeSpark MergerHive(T)Support 5 types of CDCsourceMysql-hive 20k+jobsCore ODS tasksLargest hive table could be100+TB*CDC:Change data captureIncrementalIncremental Mode(CDC)Mode(CDC)MysqlBatch ModeKafka(Binlog)Hive(T-1)HDFSStreamingModeSpa
6、rk MergerHive(T)PainPain points:points:High computing cost,fullshuffle for each run Latency 1 hour,notavailable for real-time OLAP Complex pipeline,highoperation loadIncrementalIncremental Mode:Mode:Batch mode to bootstrap Streaming dump binlog toHDFS Spark merger to producehiveOurOur VisionVisionIn