《使用新的 Python 数据源 API 简化数据导入和导出.pdf》由会员分享,可在线阅读,更多相关《使用新的 Python 数据源 API 简化数据导入和导出.pdf(14页珍藏版)》请在三个皮匠报告上搜索。
1、Simplify Data Ingest and Egress with the New Python Data Source APICraig Lukasik3Agenda-Origin story-Concepts&definitions-Micro view(usage)-Macro view(enterprise)-Demo-?sWe write a lot of cool REST APIs,including for streaming use cases,and would love to just use them as a data source in Databricks
2、instead of writing all the plumbing code ourselves.Origin Story:Customer Feedback5https:/ that lead to change include:-Databricks mission alignment-Impact/benefit across the broad customer baseOrigin Story:Born In Spark 4(DBR 15.2+)6https:/spark.apache.org/news/spark-4.0.0-preview2.html(pictured)htt
3、ps:/ processing involves processing a large volume of data at once.Data is collected over a period,stored,and then processed in bulk.This method is suitable for scenarios where real-time processing is not required,and it allows for efficient handling of large datasets.Batch vs.StreamingStreaming pro
4、cessing involves continuously or incrementally processing data as it arrives.This method is suitable for scenarios where near-real-time processing is required.It allows for immediate insights and actions based on the latest data,making it ideal for applications like monitoring,alerting,and real-time
5、 analytics.BatchStreamingStructured Streaming8A partition is a division of a datastream into smaller,manageable segments.Benefit:parallelismConceptsFor each partition,it contains the end offset that this batch will process up to.The starting offset is implicitly the end offset from the previous batc
6、h.Benefit:recovery of failed tasks;helps avoid job failurePartitions(“division of labor”)Offsets(a“To-Do”list)Records the ending offset that was processed upon micro-batch completion.Benefit:recovery of a failed or stopped jobCommits(the“Done Log”):MicroFor Da