1、李延加/Gary LiSoftware Engineer ByteDanceApache Hudi PMC MemberBuildingBuilding TheThe RealReal-timetime DatalakeDatalakeatat ByteDanceByteDanceByteDanceByteDance DataData IntegrationIntegration#1WhyWhy HudiHudi#2DatalakeDatalake IntegrationIntegration SolutionSolution#3UseUse CasesCases#4FutureFuture
2、WorkWork#5#1#1ByteDanceByteDance DataData IntegrationIntegrationEvolutionEvolution ofof DataData IntegrationIntegration SystemSystematat ByteDanceByteDance2018Batch integrationbetweenheterogeneousdata sources2020MQ-HiveUnify streamingand batch2021MQ-DatalakeUnify datawarehouse anddatalakeApache Flin
3、kApache FlinkApache FlinkDataData IntegrationIntegration SystemSystem atat ByteDanceByteDanceSupport 50+channelsincluding DB,MQ,bigdata ecosystemThree mode:Batch Streaming IncrementalSupport all businesslines,such asDouyin,Toutiao,e.t.cBatchBatch ModeModeSupport 20+types ofsource and sink100k+jobs p
4、er dayMySQLData Integration EngineBatch ModeOracleHDFSRedisHiveESMysqlOracleHDFSRedisHiveESStreamingStreaming ModeModeDailyDaily throughput:throughput:MQ-Hive20+PB,10+trillions rowsMQ-HDFS100+PB,50+trillions rowsKafkaData Integration EngineStreaming ModeRocketMQHiveHDFSIncrementalIncremental Mode(CD
5、C)Mode(CDC)MySQLBatch ModeKafka(Binlog)Hive(T-1)HDFSStreamingModeSpark MergerHive(T)Support 5 types of CDCsourceMysql-hive 20k+jobsCore ODS tasksLargest hive table could be100+TB*CDC:Change data captureIncrementalIncremental Mode(CDC)Mode(CDC)MysqlBatch ModeKafka(Binlog)Hive(T-1)HDFSStreamingModeSpa
6、rk MergerHive(T)PainPain points:points:High computing cost,fullshuffle for each run Latency 1 hour,notavailable for real-time OLAP Complex pipeline,highoperation loadIncrementalIncremental Mode:Mode:Batch mode to bootstrap Streaming dump binlog toHDFS Spark merger to producehiveOurOur VisionVisionIn