《2-4 Lakehouse技术是数据仓库的未来.pdf》由会员分享,可在线阅读,更多相关《2-4 Lakehouse技术是数据仓库的未来.pdf(46页珍藏版)》请在三个皮匠报告上搜索。
1、Lakehouse Technology as theFuture of Data WarehousingWenchen FanAboutCloud-based data and AI platform for over 7000 customersOver 10 million VMs processing exabytes of data per dayExabytes of data under management800+engineersUsed for ETL,data science,ML and data warehousingThis TalkLakehouse system
2、s:what are they and why now?Building lakehouse systemsOngoing projectsWhat Matters to Data Platform Users?One might think performance,functions,etc,but these are secondary!The top problems enterprise data users have are often with the data itself:Access:can I even get this data in the platform I use
3、?Reliability:is the data correct?Timeliness:is the data fresh?Without great data,you cant do any analysis!Data Analyst Survey60%reported data quality as top challenge86%of analysts had to use stale data,with 41%using data that is 2 months old90%regularly had unreliable data sourcesGetting high-quali
4、ty,timely data is hard but its also a problem with system architectures!1980s:Data WarehousesETL data directly from operational database systemsRich management and performance features for SQL analytics:schemas,indexes,transactions,etcETLOperational DataData WarehousesBIReports2010s:New Problems for
5、 Data WarehousesCould not support rapidly growing unstructured and semi-structured data:time series,logs,images,documents,etcHigh cost to store large datasetsNo support for data science&MLETLOperational DataData WarehousesBIReports2010s:Data Lakes Low-cost storage to hold all raw data with a file AP
6、I(e.g.S3,HDFS)Open file formats(e.g.Parquet)accessible directly by ML/DS enginesETL jobs load specific data into warehouses,possibly for further ELTBIData ScienceMachine LearningStructured,Semi-structured&Unstructured DataReportsData WarehousesETLData Lake90%of enterprise dataTwo incompatible archit
7、ectures get in the wayVast amounts of RAW dataLogs,Texts,Audio,Video,ImagesData LakeGovernance and SecurityTable ACLsData Science&MLGovernance and SecurityFiles and BlobsDataStreamingBusiness IntelligenceSQLAnalyticsCopy subsets of dataStructured tablesData WarehouseAll of the data and very adaptabl
8、eHighly reliable and efficientVast amounts of RAW dataLogs,Texts,Audio,Video,ImagesData LakeGovernance and SecurityTable ACLsData Science&MLGovernance and SecurityFiles and BlobsDataStreamingBusiness IntelligenceSQLAnalyticsCopy subsets of dataStructured tablesData WarehouseHighly reliable and effic
9、ientAll of the data and very adaptableDisjointed and duplicative data silosIncompatible security and governance modelsIncomplete support for use casesTwo incompatible architectures get in the wayThere is no need to have two disparate platformsVast amounts of RAW dataLogs,Texts,Audio,Video,ImagesData
10、 LakeGovernance and SecurityTable ACLsData Science&MLGovernance and SecurityFiles and BlobsDataStreamingBusiness IntelligenceSQLAnalyticsStructured tablesData WarehouseHighly reliable and efficientAll of the data and very adaptableIncompatible security and governance modelsIncomplete support for use
11、 casesStructured tables and unstructured filesOpen,reliable data storage to efficiently handle all data typesGovernance and SecurityGovernance and SecurityFiles,Blobs,and Table ACLsData LakeGovernance and SecurityTable ACLsData Science&MLDataStreamingBusiness IntelligenceSQLAnalyticsAll of the data
12、and very adaptableIncomplete support for use casesOpen,reliable data storage to efficiently handle all data typesStructured tables and unstructured filesThere is no need to have two disparate platformsHighly reliable and efficientOne security and governance approach for all data assets on all clouds
13、All of the data and very adaptableData LakeData Science&MLDataStreamingBusiness IntelligenceSQLAnalyticsHighly reliable and efficientStructured tables and unstructured filesOne security and governance approach for all data assets on all cloudsOpen,reliable data storage to efficiently handle all data
14、 typesGovernance and SecurityFiles,Blobs,and Table ACLsBusiness IntelligenceSQLAnalyticsThere is no need to have two disparate platformsAll ML,SQL,BI,and Streaming use casesAll of the data and very adaptableData LakeData Science&MLDataStreamingStructured tables and unstructured filesOne security and
15、 governance approach for all data assets on all cloudsOpen,reliable data storage to efficiently handle all data typesGovernance and SecurityFiles,Blobs,and Table ACLsBusiness IntelligenceSQLAnalyticsThere is no need to have two disparate platformsAll ML,SQL,BI,and Streaming use casesThis TalkLakehou
16、se systems:what are they and why now?Building lakehouse systemsOngoing projectsThis is the lakehouse paradigmTechnologiesData LakeStructured tables and unstructured filesOne security and governance approach for all data assets on all cloudsOpen,reliable data storage to efficiently handle all data ty
17、pesGovernance and SecurityFiles,Blobs,and Table ACLsData Science&MLDataStreamingAll ML,SQL,BI,and Streaming use casesBusiness IntelligenceSQLAnalyticsUnity CatalogFine-grained governance for data and AIDelta LakeData reliability and performanceData ApplicationsKey Technologies Enabling Lakehouse1.Me
18、tadata layers on data lakes:add transactions,versioning&more2.Lakehouse engine designs:performant SQL on data lake storage3.Declarative I/O interfaces for data science&MLMetadata Layers on Data LakesTrack which files are part of a table version to offerrich management features like transactionsClien
19、ts can then access the underlying files at high speedExamples:ACIDClient ApplicationMetadata LayerData LakeWhich filesare part oftable v1?f1,f2 f3f1f2f3f4Problem:What if a query reads the table while the delete is running?Example:Traditional Data Lakefile1.parquetfile2.parquetfile3.parquet“events”ta
20、bleQuery:delete all events data about customer#17file1b.parquetfile3b.parquetrewriterewrite+delete file1.parquet+delete file3.parquetExample:file1.parquetfile2.parquetfile3.parquet“events”table_delta_log/v1.parquet/v2.parquetQuery:delete all events data about customer#17file1b.parquetfile3b.parquetr
21、ewriterewritetrack which files are part ofeach version of the table(e.g.v2=file1,file2,file3)_delta_log /v3.parquetatomically add new log filev3=file1b,file2,file3bClients always read a consistent table version!Armbrust et al,VLDB 2020Other Management Features with Time travel to old table versionsZ
22、ero-copy CLONE by forking the logDESCRIBE HISTORYSchema enforcement&constraintsSELECT*FROM my_tableTIMESTAMP AS OF“2020-05-01”CREATE TABLE my_table_devSHALLOW CLONE my_tableOther Management Features with Streaming I/O:treat a table as astream of changes to remove needfor message buses like KafkaSecu
23、re cross-organization sharingwith Delta SharingUsing cloud storage signed URLsto give clients fast access to dataspark.readStream .format(delta).table(events)Delta Table Sharing ServerProviderClientDirect file reads w/signed URLsDelta Sharing APIKey Technologies Enabling Lakehouse1.Metadata layers o
24、n data lakes:add transactions,versioning&more2.Lakehouse engine designs:performant SQL on data lake storage3.Declarative I/O interfaces for data science&MLThe ChallengeMost data warehouses have full control over the data storage system and query engine,so they design them togetherThe key idea in a L
25、akehouse is to store data in open storage formats(e.g.Parquet)for direct access from many systemsHow can we get great performance with these standard,open formats?Enabling Lakehouse PerformanceEven with a fixed,directly-accessible storage format,4 optimizations help:Auxiliary data structures like st
26、atistics and indexesData layout optimizations within filesCaching hot data in a fast formatExecution optimizations like vectorizationNew query engines such as Databricks Photon Engine use these ideasMinimize I/Os for cold dataMatch DW performanceon hot dataOptimization 1:Auxiliary Data StructuresEve
27、n if the base data is in Parquet,we can build other data structures to speed up queries,and maintain them transactionally Example:min/max zone maps for data skippingfile1.parquetfile2.parquetfile3.parquetyear:min 2018,max 2019uid:min 12000,max 23000year:min 2018,max 2020uid:min 12000,max 14000year:m
28、in 2020,max 2020uid:min 23000,max 25000Query:SELECT*FROM eventsWHERE year=2020 AND uid=24000updated transactionallywith Delta table logOptimization 1:Auxiliary Data StructuresEven if the base data is in Parquet,we can build other data structures to speed up queries,and maintain them transactionally
29、Example:min/max zone maps for data skippingfile1.parquetfile2.parquetfile3.parquetyear:min 2018,max 2019uid:min 12000,max 23000year:min 2018,max 2020uid:min 12000,max 14000year:min 2020,max 2020uid:min 23000,max 25000Query:SELECT*FROM eventsWHERE year=2020 AND uid=24000updated transactionallywith De
30、lta table logOptimization 2:Data LayoutEven with a fixed storage format such as Parquet,we can optimize the data layout within tables to minimize I/O Example:Z-order sorting for multi-dimensional clustering dimension 1dimension 2Delta Lake 2.0:We are open-sourcing all of DeltaUnlock the power of Del
31、ta LakeACID TransactionsScalable MetadataTime TravelUnified Batch/StreamingSchema EnforcementAudit HistoryDML OperationsCompactionMERGE EnhancementsStream EnhancementsSimplified LogstoreOPTIMIZEOPTIMIZE ZORDERChangedata feedTable RestoreS3 Multi-cluster writesDataSkipping via Column StatsMulti-partc
32、heckpoint writesGenerated ColumnsColumn MappingGenerated column support w/partitioningIdentityColumnsSubqueries in deletes and updatesClonesIceberg to Delta converterFast metadata only deletesComing Soon!Schema EvolutionOptimization 3:CachingMost data warehouses cache hot data in SSD or RAMCan do th
33、e same in Lakehouse,using the metadata layer for consistency Example:SSD cache in Photon EngineOptimization 4:Vectorized ExecutionMany existing ideas can also be applied over open formats like Parquet Example:Databricks Photonvectorized enginePhoton The Query Engine for Lakehouse SystemsThe SIGMOD 2
34、022 Best Industry Paper Awardis awarded annuallyto one paper based on“the combination of real-world impact,innovation,and qualityof the presentation.”Putting These Ideas TogetherLakehouse engines can match DW performance on either hot or cold data!Key Technologies Enabling Lakehouse1.Metadata layers
35、 on data lakes:add transactions,versioning&more2.Lakehouse engine designs:performant SQL on data lake storage3.Declarative I/O interfaces for data science&MLML over a Data Warehouse is PainfulUnlike SQL workloads,ML workloads need to process large amounts of data with non-SQL code(e.g.TensorFlow,XGB
36、oost)SQL over JDBC/ODBC is too slow for this at scaleExport data to a data lake?adds a third ETL step and more staleness!Maintain production datasets in both DW&lake?even more complexML over a LakehouseDirect access to data files without overloading the SQL frontendML frameworks already support read
37、ing Parquet!Declarative APIs such as Spark DataFrames can help optimize queries.model.fit(train_set)Lazily evaluated query planusersSELECT(kind=“buyer”)PROJECT(start_date,zip,)PROJECT(NULL 0)users=spark.table(“users”)buyers=usersusers.kind=“buyer”train_set=buyers“start_date”,“zip”,“product”.fillna(0
38、)Data-Integrated ML Goes Much FurtherDatabricks Machine Learning lets data and ML users collaborate:ML model metrics become tables thanks to MLflow TrackingFeature Store runs Delta for storage and Spark Streaming for pipelinesModels can be used in SQL or ETL jobsMuch simpler than using separatedata
39、and ML platformsSummaryLakehouse systems combine the benefits of data warehouses&lakesOpen interfaces for direct access froma wide range of toolsManagement features via metadatalayers(transactions,versioning,etc)Performance via new query enginesLow cost equal to cloud storageStreaming AnalyticsBIDat
40、a ScienceMachine LearningStructured,Semi-Structured&Unstructured DataResult:simplify data architectures to improve access,reliability&timelinessThis TalkLakehouse systems:what are they and why now?Building lakehouse systemsOngoing projectsWe Think Theres a Lot More to Do in Data!Enterprises are just
41、 starting to use large-scale data and MLIn five years,therell be 10-100 x more users working with these tools and 10-100 x more data and ML applicationsSome ongoing projects:declarative data pipelines(Delta Live Tables),centralized governance(Unity Catalog),and next-gen engine designsDelta Live Tabl
42、es:Declarative Data PipelinesDeclarativity was great in SQL,but SQL lives within a larger pipeline(e.g.,Airflow tasks)What if we had a data model of the pipelines ops and tables?Analyze cross-task,fork to test,roll back,inject checks,etcSee Michael Armbrusts blog post and demoUnity Catalog:Central G
43、overnance for Data&MLGovernance requirements for data are rapidly evolvingUnity Catalog provides rich yet efficient access control for millions of data&ML assetsAlso gives unified lineageSee Mateis blog postGRANT SELECT,EXEC ON DATABASE iot_data WHERE NOT TAGGED(pii)TO product_managersNew Engine Pro
44、jectsPhoton:native,vectorized engine for compute operatorsAether:ongoing effort to revamp entire scheduling&exec frameworkStreaming:just started a new team to revamp our engineConclusionDatabricks tackled one of the key problems orgs have:a simple platform to let diverse users work with all their data,in use cases from SQL to MLTheres a lot left to do in this space!