1、Vector Data LakeDo you need(more than)a vector database in 2023?Databricks2023PhD student at Stanford Universityhttps:/ Lance formatCo-author pandaschanghiskhanGenerative AI is missing a storage layerGPT-4,LLaMA,PaLM,AlpacaLangChain,LlamaIndex,AutoGPT Vector databases only deal with vectorsPgvector
2、and similar does not scaleNo effective solution at all for multi-modal dataLLMsVectorDBLangChainLlamaIndexMulti-modal dataImagesPoint-cloudsVideoAudioGen AI data flywheelStorage and compute costsTraining I/O performanceML debuggingML analyticsFlexible retrievalVectorsKeywordsSQLModelLLM dataVectorsD
3、ocumentsMetadataState of the world for structured dataOLTP database ETLBlob Store Data LakeLakehouse for OLAPDesired Properties:-Fast writes-Strong consistency-“Operational”SQL:e.g.selecting a row Desired Properties:-Efficient bulk updates-Fast full scans-Decouple compute/storageState of the world f
4、or unstructured data.png.pdf.txt.mp3.mp4EmbeddingsOur observationsTLDR:current cloud vector databases much resemble classic OLTP stores,optimized for operational workloads.Strong focus on write TPS and consistency:ElasticSearch/Milvus both offer strong write consistency.All vector databases focus on
5、 write throughput.Excels at point updates,low latency for point reads.Integrated compute-storage:requires heavy indexing that needs to be kept live in always-on RAM/SSD,Trino-for-embeddings dont really exist.But what about OLAP workloads?What even are OLAP workloads for embeddings?Recommendation mod
6、els:batch update embeddings,batch update recommendationsData analytics:which video genre contain the most inappropriate videos?ML training on embeddings.Characteristics of these workloadsObservations:-(Very large)batched nearest neighbor lookup/range search-Doesnt care too much latency-Data is usual