《FFA2024分论坛-流批一体.pdf》由会员分享,可在线阅读,更多相关《FFA2024分论坛-流批一体.pdf(210页珍藏版)》请在三个皮匠报告上搜索。
1、Flink Materialized Table:构建流批一体 ETL刘大龙Apache Flink CommitterA User Story Of Data EngineerMaterialized Table 构建流批一体 ETLDemoA User Story Of Data Engineer构建企业级湖仓架构Data AnalyticsLakehouseODS原始数据DWD明细数据DWS汇总数据DashboardBI&ReportsData ScienceOSS但现实离线同步DataXBatch ETLQuickBIIcebergIcebergIcebergBatch ETL但现实离
2、线同步DataXIcebergIcebergIcebergBatch ETLBatch ETLQuickBI每天调度但现实却很复杂离线同步DataXBatch ETLQuickBIIcebergIcebergIcebergBatch ETL每天调度实时同步CanalStream ETLQuickBIStream ETLFlink数仓Stream ETLFlinkFlink但现实却很复杂两套引擎,两套代码,统计口径不一致!离线同步DataXIcebergIcebergIcebergBatch ETLBatch ETLQuickBI每天调度实时同步CanalQuickBIStream ETLFli
3、nkStream ETL数仓FlinkStream ETLFlinkSQL 代码不能复用,数据不一致流计算成本高,回刷低效两套存储、两套计算,成本高开发运维两套 Pipeline增量计算,一套架构批计算回刷代码无法复用问题总结Lambda架构Lakehouse架构Materialized Table 构建流批一体 ETLMaterialized Table:声明式 ETL 统一流批作业请添加标题字内容这是普通内容字请添加标题字内容这是普通内容字IngestMaterialized Table数据摄数据湖仓计算引擎业务逻辑CREATE MATERIALIZED TABLE customer_or
4、dersFRESHNESS=INTERVAL 1 MINUTEASSELECT*FROM ordersLEFT JOIN customersON orders.customer_id=customers.id;orders(Table)customers(Table)根据新鲜度自动选择流批模式自动刷新结果湖表业务时效性任意 QueryCREATE TABLE IF NOT EXISTS customer_orders(.);SET bizdate=20241125;INSERT OVERWRITE customer_ordersPARTITION(ds=$bizdate)ASSELECT*FR
5、OM ordersLEFT JOIN customersON orders.customer_id=customers.id;WHERE orders.ds=$bizdateALTER TABLE customer_ordersPARTITION(ds=$bizdate)MERGE SMALLFILES;调度周期/天(手工配置)CREATE MATERIALIZED TABLE customer_ordersPARTITION BY(ds)FRESHNESS=INTERVAL 1 MINUTEASSELECT*FROM ordersLEFT JOIN customersON orders.cu
6、stomer_id=customers.id;传统数仓 ETL新代流批体 ETL表管理T+1处理T+1处理件管理业务价值从命令式 ETL 到声明式 ETLMaterialized Table 让你更加专注业务价值SET bizdate=20241125;INSERT OVERWRITE customer_ordersPARTITION(ds=$bizdate)ASSELECT*FROM ordersLEFT JOIN customersON orders.customer_id=customers.id;WHERE orders.ds=$bizdateALTER MATERIALIZED TA