1、2024 Databricks Inc.All rights reservedVariant Data TypeVariant Data TypeMaking SemiMaking Semi-Structured Structured Data Fast and SimpleData Fast and SimpleGene Pang,Chenhao LiGene Pang,Chenhao Li20242024-0606-131312024 Databricks Inc.All rights reserved Motivation Variant Data Type Overview Using
2、 Variant Deep Dive:Variant Binary Format Performance2OUTLINEOUTLINE2024 Databricks Inc.All rights reserved Semi-structured data is partially structuredDoesnt fully adhere to relational table modelSchema may be unknown,or incompatible,or evolving JSON is very popular semi-structured data formatFlexib
3、le,and supported in most programming languagesHow do we store and process semi-structured data in the lakehouse?3SemiSemi-Structured Data in the LakehouseStructured Data in the Lakehouse2024 Databricks Inc.All rights reserved On ingestion,read data and infer schema(structs,arrays,scalars,etc.)Read q
4、ueries use the relational schema Performance same as structured/relational data4Schema InferenceSchema InferenceOption 1Option 12024 Databricks Inc.All rights reserved Inference must determine a schema that works with all the dataIf data is diverse,can produce huge,but sparse schemas Schema enforcem
5、ent is strictIncoming data must be compatible with schemaAccessing missing field may produce exceptions5Challenges with Schema InferenceChallenges with Schema InferenceTOO STRICTTOO STRICT2024 Databricks Inc.All rights reserved On ingestion,data is stored as stringNo schema enforcement on ingestion
6、Read queries parse the string during execution Maximum flexibility for any data6Treat Data as StringTreat Data as StringOption 2Option 22024 Databricks Inc.All rights reserved Parsing String in queries is slowTypically,data is read more than it is written,so expensive parsing is repeated for every q