《LanceDB:用于服务生产规模 AI 应用的完整搜索和分析存储.pdf》由会员分享,可在线阅读,更多相关《LanceDB:用于服务生产规模 AI 应用的完整搜索和分析存储.pdf(21页珍藏版)》请在三个皮匠报告上搜索。
1、Lakehouse Architecture for AI DataSearch,Analytics,Processing,TrainingChang She and Zhidong Qu2025-06-10Chang SheChang SheCEO/CofounderZhidong QuZhidong QuSr Software EngineerWho we areCEO/Co-founder,LanceDBco-author of pandas2 decades building data toolsBuilding LanceDB for all the AI data that doe
2、snt fit neatly into pandas dataframesSr.Software Engineer,DatabricksFounding engineer-Mosaic AI Vector Search and Feature StoreProject Lead-Storage Optimized Mosaic AI Vector SearchChang SheZhidong(Zero)Qu3We Multimodal Lakehouse for AI dataNew data infrastructure challengesLance formatDiverse workl
3、oads:analytics,processing,training4What modern architecture for AI data infrastructure looks likeStorage-optimized Mosaic AI Vector SearchCloud-native Vector Search at Massive Scale using Lakehouse ArchitectureChallenges with first-gen Vector DBsCoupled Storage&ComputeCompute nodes and disk attached
4、 to them act as persistent storage layerStateful systemDifficult to operate at scaleVector indexes are memory residentFull precision embeddings are huge!Extremely high serving costScatter-gather queriesInherited from traditional search architectureVector indexes are coupled with immutable data fragm
5、entsQuery performance drops significantly as number of data segments scaleMerging segments involves expensive operation to rebuild the index6Cloud-native Vector SearchDecoupled Storage&ComputeVector indexes and raw data fragments live in durable cloud object storageQuery nodes are stateless and only
6、 cache data in local SSD/memory when neededDecoupled Ingestion&Serving ComputeIngestion runs on a fully distributed,in-house vector indexing engine built on SparkQuery runs on lightweight Rust serversCloud-native Vector Search ArchitectureUnparallel scalability at far lower cost by leveraging cloud