1、网易 马进ArcticArctic:基于:基于Flink+icebergFlink+iceberg 的湖仓一体的湖仓一体Arctic:Using flink and iceberg to build netease lakehouse背景与目标背景与目标#1Background and goalsArctic Arctic 特性特性#2Arctic featuresArctic Arctic 架构架构#3Arctic architecture总结总结#4Summary#1#1背景与目标背景与目标Background and goals离线数据中台体系数据传输batch transferbatc
2、h tableolap systemBI/AI businessspark/prestohivespark/hivespark/hiveimpala/sparksql数据开发数据资产数据血缘数据质量数据分析有数数据模型任务运维Timing scheduling drive数据服务实时数仓体系cdc/event fetchstream tableolap systemBI/AI businessflinkcdc/canal/ndckafkaflinkimpala+kududruid/doris/clickhouseredis/mysql/oracle实时开发实时血缘实时任务监控运维flinkba
3、tch transferbatch tableolap systemBI/AI businessTiming scheduling driveEvent drive存在的问题主题域/数据分层modelDQCpropertiessensorslogsdatabases数据源demanddemanddemand数据治理VS点对点开发datalakedata governingp2p developing目标:实时数据中台batch transferfused tableolap systemBI/AI businessspark/hive/flinkspark/flinkimpala/sparks
4、ql数据开发数据资产数据血缘数据质量数据分析有数数据模型任务运维数据服务cdc/event fetchTiming scheduling driveEvent drive流批一体目标拆解存储流批一体开发流批一体工具流批一体统一 schema统一存储引擎统一存储介质消除二义性使用一套代码覆盖实时和离线场景统一 UDF统一开发规范数据模型数据资产数据质量数据血缘数据传输one data fits allone tool fits allone code fits allArctic 需求支持基于主键的流式更新支持流式读/增量读(Stream and CDC)支持各类引擎的并发读写,提供ACID 保
5、障提供分钟级数据延迟的 olap 能力提供湖仓一体服务,而不是软件库support streaming update based on primary keysupport streaming/incremental read(stream and CDC)support concurrent reading or writing with ACIDprovide minute latency olap abilityprovide lakehouse service,not libs#2 2Arctic Arctic 特性特性Arctic featureschange file(insert
6、/update/delete)base file(insert)tmp filenew base tablechange tablebase tabletmp filesSTEP 1:Hive MR 增量传输方案map and shufflereduceIO 效率低(IO/有效更新)写放大与实时性 trade off没有 ACID 保障适合小时级别的增量同步low io efficiencywrite amplification vs data latencyno ACID guaranteesuited for hour latencySTEP 2:基于 Bucket 数据整理方案chang