《2-4 Lakehouse技术是数据仓库的未来.pdf》由会员分享,可在线阅读,更多相关《2-4 Lakehouse技术是数据仓库的未来.pdf(46页珍藏版)》请在三个皮匠报告上搜索。
1、Lakehouse Technology as theFuture of Data WarehousingWenchen FanAboutCloud-based data and AI platform for over 7000 customersOver 10 million VMs processing exabytes of data per dayExabytes of data under management800+engineersUsed for ETL,data science,ML and data warehousingThis TalkLakehouse system
2、s:what are they and why now?Building lakehouse systemsOngoing projectsWhat Matters to Data Platform Users?One might think performance,functions,etc,but these are secondary!The top problems enterprise data users have are often with the data itself:Access:can I even get this data in the platform I use
3、?Reliability:is the data correct?Timeliness:is the data fresh?Without great data,you cant do any analysis!Data Analyst Survey60%reported data quality as top challenge86%of analysts had to use stale data,with 41%using data that is 2 months old90%regularly had unreliable data sourcesGetting high-quali
4、ty,timely data is hard but its also a problem with system architectures!1980s:Data WarehousesETL data directly from operational database systemsRich management and performance features for SQL analytics:schemas,indexes,transactions,etcETLOperational DataData WarehousesBIReports2010s:New Problems for
5、 Data WarehousesCould not support rapidly growing unstructured and semi-structured data:time series,logs,images,documents,etcHigh cost to store large datasetsNo support for data science&MLETLOperational DataData WarehousesBIReports2010s:Data Lakes Low-cost storage to hold all raw data with a file AP
6、I(e.g.S3,HDFS)Open file formats(e.g.Parquet)accessible directly by ML/DS enginesETL jobs load specific data into warehouses,possibly for further ELTBIData ScienceMachine LearningStructured,Semi-structured&Unstructured DataReportsData WarehousesETLData Lake90%of enterprise dataTwo incompatible archit