《流媒体应用程序中要避免的主要错误.pdf》由会员分享,可在线阅读,更多相关《流媒体应用程序中要避免的主要错误.pdf(36页珍藏版)》请在三个皮匠报告上搜索。
1、Top Mistakes to Avoid in Streaming ApplicationsVikas Reddy Aravabhumi,Staff Backline Engineer,DatabricksDatabricks2023Job Title:Staff Backline EngineerEmployment:Databricks(Since Nov 2019)Areas of Expertise:Streaming,Delta,Spark SQLIndustry Experience:12 yearsPrior Experience:Specialized in deployme
2、nts of Big Data pipelinesYou can follow me (https:/ MyselfVikas Reddy AravabhumiVikas Reddy AravabhumiThis Photo by Unknown Author is licensed under CC BY-NCTodays AgendaCommon Mistake scenariosNeglecting best practices during Kafka data retrievalPoor cluster selection for handling volatile dataUnde
3、restimating the significance of checkpoints in query restartsProblems with Trigger.Once()Lack of attention to error mitigation measuresDisregarding optimizations for stateful streamNeglecting precautionary measures during stateful stream restartsNeglecting best practices with foreachBatch()Streaming
4、 ApplicationMultiple phases of end-to-end flow1_DAIS_Title_SlideMistakes at Data Ingestion LayerHow Many of Your Streaming Applications Fetches the Data from Kafka?1.Data Ingestion1.Data Ingestion-PerformancePerformanceScenario:Scenario:Only the essential Kafka configurations were provided.Even with
5、 a higher number of executor cores in the cluster,the job duration for fetching data from Kafka consistently launches the same number of tasks.Issue:1 Kafka partition 1 task(/core)Idle cores=(Total number of Cores-Number of Kafka partitions)Not fully leveraging computing capabilities.Recommendation:
6、minPartitions configCluster executor cores exceed Kafka partitionsKafka partitions will split virtuallyMore parallelismBetter performance.option(minPartitions,sc.defaultParallelism).option(minPartitions,sc.defaultParallelism)When to avoid:Multiple Streams in 1 application:Balance core allocation for