《连接大数据和人工智能:为 PySpark 提供 Lance 格式实现多模式人工智能数据管道.pdf》由会员分享,可在线阅读,更多相关《连接大数据和人工智能:为 PySpark 提供 Lance 格式实现多模式人工智能数据管道.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、J UNE 9-12 SAN FRANCI SCOEmpowering PySpark with Lance AI Datalakefor Multimodal AI Data PipelinesLu QiuLu Qiu/Database EngineerAllison WangAllison Wang/Staff Software EngineerJune 11,Wed,12:20PMThe The multimodalmultimodaldata challengedata challengeIts not just text and vectors Its not just text a
2、nd vectors anymoreanymoreModern data is multimodalmultimodal:images,PDFs,videos,audio,sensor data,even embedded links.AI and LLMs unlock this data.Traditional data stacks arent optimized when working with blobs and embeddings blobs and embeddings at scale.at scale.2Lance AI Datalake Multimodal data
3、storageMultimodal data storage(t ext,i mages,embeddi ngs,vi deos,et c.)Fast Data Access:Fast Data Access:Fast scanFast scanf or anal yt i cs /t r ai ni ng dat a l oadi ngO(1)random accessO(1)random accessf or s ear ch/s huf f l eScalable disk based indicesScalable disk based indices3Parquert+Iceberg
4、+Indices for AILanceLance Function4Data Evolution,more than just old-school schema evolutionExample:image processing5Store images in Lance AI Datalake and incrementally add embeddings and captionsWorld LabsExample:video processing6Store extracted images from video in Lance AI datalake and processTab
5、le 1:Videos&Video-Derived DataTable 2:Images&Image-Derived DataTable 3:Audio&Audio-Derived Data7Lance AI Search EngineLance AI Search EngineScalable:Compute-Storage SeparationHybrid:Vector+Full-text+SQLLocal experiment with python lancedb libraryMassive scalable with LanceDB Cloud/Enterprise offerin
6、g(Bonus:Same code!)8Lance AI DatalakeLance AI DatalakeFrictionless AI Data InfraLance AI DatalakePowers End-to-End AI Workflows with a Rich EcosystemPandasArrowAutoGen10Lance AI DatalakeWorld LabsData Infrac AgentsPowering the future of AIBridging Big Data&AIwith Spark Lance ConnectorThe ProblemThe