《数据编排的未来:基于资产的编排.pdf》由会员分享,可在线阅读,更多相关《数据编排的未来:基于资产的编排.pdf(39页珍藏版)》请在三个皮匠报告上搜索。
1、Sandy Ryza(s_ryz)Lead Engineer,Dagster Project-ElementlAsset-Based Data OrchestrationData practitioners build and maintain data pipelinesWhats a data pipeline?Data AssetComputationData AssetComputationData AssetData AssetComputationComputationData AssetData AssetTableFileML modelData pipelines span
2、entire organizationsApp DBCRMMarketing AnalyticsCore EntitiesRecommender SystemThird PartyProduct AnalyticsAutomatically updating data assetsWhy update a data asset?Inputs have changedChanging inputsChanging constantlyNew partition every dayWhy update a data asset?Inputs have changedCode has changed
3、Code changesUpdated business logicWhy update a data asset?Inputs have changedCode has changedFresh data is neededFresh data is neededBy 9 am daily,for exec meetingAs soon as new data arrivesAutomatically updating data assets:how?The status quo:workflow enginesDAG of tasksRun the DAG every hour/day/w
4、hateverWorkflow engines:not actually the best way to schedule data pipelines?Forces running in lockstepCaught between doing redundant work and stale dataCode managementWhat DAG should this new data asset be a part of?Monolithic DAG objectsAlerts when tasks fail vs.when data is lateA different way:As
5、set-based orchestrationGoals of asset-based orchestrationOutcomesMake data ready on timeAvoid redundant workExpress scheduling in terms of the data assetsWhen does source data change?How fresh do data assets need to be?Understand scheduling decisionsBuilding a pipelineaka defining some data assetsAs
6、set-based orchestration in DagsterAuto-materialize policiesThe root of the graph?Source assetssource assetWhat about code changes?Lazy auto-materializationDownstream assetUpstream assetFreshness policiesrunfraudulent_logins_modelmidnightrun events_tablemidnigh