1、No More Runtime Setup!DaoCloud Fanshi ZhangLets Bundle,Distribute,Deploy,Scale LLMs Seamlessly with Ollama OperatorKubernetesFanshi ZhangSenior software engineerThe ChallengeDeploying and scaling LLMs is complexModel Distributing 101Overview of stepsTrain pre-trained modelsTrain LoRAMerge weightsQua
2、ntizeModel Distributing 101Mount with VolumesBundle into imagesWays to deploy modelsModel Distributing 101Weights are largeLLAMA 2 has roughly 83GB of weight¶meter?lesDistribute weights across deploying worker nodesInference server is distributed,each of worker node requires dedicated copy of we
3、ightsCaching and cold boot for serverless and edge scenariosFor serverless scenarios like WasmEdge,IoT,Ray,rolling update models is a challengeChallenges and complexitiesWeightsNodesModel Serving 101Bringing models to productionComplexdependenciesManaging dependenciesacross environments canbe tediou
4、s and error-prone.EnvironmentsetupSetting up environmentswith Python,CUDA,andmore is complex andtime-consuming.DistributionoverheadDistributing large modelsef?ciently remains asigni?cant challenge.Model Serving 101NVIDIA TritonThis is how Triton Inference Server can be used to serve models:Bringing
5、models to productionModel Serving 101TorchServeThis is how Triton Inference Server can be used to serve models:Bringing models to productionOllamaUniversal solution to model bundling,distributing,serving,etc.LightweightUniversal&CompatibleOllama-Bundling ModelsUniversal bundlingOllam-LoRA,Customizin
6、g,PromptingIntegrating LoRA for trainingOllama-DistributingJust like OCI imagesollama push model nameOCI Registryollama push model nameOCI readyOCI-Compatible DistributionOllama uses OCI-compatible formats for easy integration with existing container work?owsOllama-ServingOne sim