1、2024 Databricks Inc.All rights reserved1Why Version Control Is Why Version Control Is EssentialEssential For Your Lakehouse For Your Lakehouse ArchitectureArchitectureOz Katz,Oz Katz,June 20242024 Databricks Inc.All rights reserved2DATA LAKEHOUSEDATA LAKEHOUSEHumans,organizing things as humans tend
2、to do.2024 Databricks Inc.All rights reservedYES,YES,THIS IS MY THIS IS MY ACTUAL ACTUAL DESKTOPDESKTOP2024 Databricks Inc.All rights reserved4HUMANSHUMANSmessmess=*len(len(engineersengineers)*len(len(data_scientistsdata_scientists)ARE CREATIVE,MESSYARE CREATIVE,MESSY2024 Databricks Inc.All rights r
3、eserved5HUMANSHUMANSNAMING THINGSNAMING THINGS2024 Databricks Inc.All rights reserved6YOUR CODE YOUR CODE(ANOTHER HUMAN ARTIFACT)(ANOTHER HUMAN ARTIFACT)*IS*MUCH BETTER,I ASSURE YOU*IS*MUCH BETTER,I ASSURE YOUWHATWHATWHOWHOWHENWHEN2024 Databricks Inc.All rights reserved7BUT DATA IS BUT DATA IS HARDE
4、RHARDER2024 Databricks Inc.All rights reserved8WHERE DO I RUN TESTS?WHERE DO I RUN TESTS?8HERE?HERE?HERE?HERE?2024 Databricks Inc.All rights reserved9WHEN DO I RUN TESTS?WHEN DO I RUN TESTS?Our ETL startsOur ETL startsSpark writesSpark writesa bunch of data toa bunch of data toa bunch of tablesa bun
5、ch of tablesSomeone elses Someone elses sensor kicks off a sensor kicks off a dependent ETLdependent ETLSomeone elses Someone elses Spark job reads our Spark job reads our output as its inputoutput as its inputWe run fancy We run fancy anomaly detection,anomaly detection,assertions and assertions an
6、d other testsother testsWe find out theresWe find out theresa problem!The tests a problem!The tests did their job!did their job!But its too late and But its too late and someones jobs have someones jobs have already completedalready completed(SPOILER:TOO LATE)(SPOILER:TOO LATE)2024 Databricks Inc.Al