当前位置：首页 > 报告详情

通过 CDC、Apache Spark™ 流和 Delta Lake 解锁近实时数据复制.pdf

上传人： 2*** 编号：139075 2023-06-04 PDF PDF 26页 1.50MB

该报告所属合集： 2023年数据和人工智能峰会（data+ai summit2023）演讲PPT合集

打包下载报告合集

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载报告到电脑，查找使用更方便

VIP专享文档

书签

分享

收藏

已收藏

版权投诉

/26

立即下载

word格式文档无特别注明外均可编辑修改，预览文件经过压缩，下载原文更清晰！

三个皮匠报告文库所有资源均是客户上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作商用。

《通过 CDC、Apache Spark™ 流和 Delta Lake 解锁近实时数据复制.pdf》由会员分享，可在线阅读，更多相关《通过 CDC、Apache Spark™ 流和 Delta Lake 解锁近实时数据复制.pdf（26页珍藏版）》请在三个皮匠报告上搜索。

1、Unlocking Near Real Time Data Replication with CDC,Apache Spark Streaming,and Delta LakeDatabricks2023Ivan Peng and Phani NalluriHow many orders did DoorDash do yesterday?How many orders did DoorDash do yesterday?Get me data from databasesselect*from table_nameGet me data from databases,fastselect*f

2、rom table_name where updated_at$LATEST_DATEmergeGet me data from databases,fast and as the schema changesselect*from information.schemas where name=table_name;mergeselect*from information.schemas where name=table_name;reconcilepageincompatibleselect*from information.schemas where name=table_name;mer

3、geselect*from information.schemas where name=table_name;reconcilepageincompatiblex1000Somewhere in there is a migration from Redshift to Snowflake,and building a whole orchestration system around the tasks HistoryAKA the State of Data at DoorDash,2020 90%of 1000 DB tables were dumped to Snowflake vi

4、a naive dump Incremental tables required:Table to have an updated_at fieldIndex on that fieldApplication to update that field on every write operation CDC was present,but in its infancy at DoorDashProject PeptoAlleviating indigestion of data processingRequirementsHave better data freshness than 24 h

5、oursOwn our data on a modern Lakehouse platformHandle schema evolution and backfillsEnable analytical workloads that otherwise would have been run on the production databasesDesign TenetsLean into CDC/Kafka across all database flavorsBuild a self-serve platform to democratize onboarding of tablesWri

6、te-once,read manyLeverage streaming checkpointing to bypass late-arriving dataOperational simplicityProject PeptoWhat we are not A coupled service with databases A real-time system that feeds into online servicesProject PeptoHighlighted Design Decisions Not-ka

DoorDash通过使用CDC、Apache Spark Streaming和Delta Lake，实现了近实时数据复制。主要内容包括：从数据库中快速获取数据，并在数据结构变化时进行合并；在Redshift到Snowflake的数据迁移过程中，建立了一个数据处理流程；DoorDash在2020年的数据处理状态，包括90%的数据表已通过简单转储方式导入Snowflake，要求增量表具有`updated_at`字段，并对该字段进行索引，应用程序在每次写操作时更新该字段；项目Pepto旨在提高数据新鲜度，拥有自己的数据在现代数据湖平台上的所有权，处理数据模式演变和后填充，并允许原本在生产数据库上运行的分析工作负载。设计原则包括：在所有数据库类型中使用CDC/Kafka，构建一个自助服务平台以民主化表的上线，实现一次写入、多次读取，利用流式检查点绕过迟到的数据，以及操作简单性。项目Pepto不是与数据库紧密耦合的服务，也不是实时系统。设计决策包括：非Kappa架构，使用“模式注册表”固定模式，选择Delta Lake作为其他表格式的替代品。在稳态模式、重建模式和批量合并模式下运行。项目Pepto的成果包括：表上线时间缩短至小于1小时，自助服务；运行在1000个EC2节点上的450个流，每天输入约800GB数据，重写约80TB数据，数据新鲜度约为7-30分钟。挑战和学习包括：检查点解决了许多问题，类型转换很困难，每个适配器都有两个序列化器，大型表在操作上具有挑战性，状态管理困难，Databricks API的幂等性保证简化了很多问题。未来的工作包括：将在线数据库的Ad Hoc查询迁移到Delta Lake工作负载，在Medallion架构中进行流式PII模糊处理，以及处理源数据的模式变更。

"DoorDash如何实现数据实时复制？" "如何在Delta Lake上处理数据库表的schema演变？" "DoorDash如何通过Project Pepto提高数据处理效率？"

全行业研究报告分享下载平台

0731-84720580
商务合作：really158d
友链申请 (QQ)：1737380874

关于我们

更多

关于我们

三个皮匠报告微信公众号

三个皮匠报告微信小程序

扫码咨询网站充值下载问题

友情链接：

营销自动化亿欧智库微播易阿里妈妈

copyright@2008-2013 长沙景略智创信息技术有限公司版权所有网站备案/许可证号：湘B2-20190120 | 工信部备案号：湘ICP备17000430号-2 | 公安备案号：湘公网安备43010402001071号

客服

小程序

服务号

折叠