《使用 PySpark 4.0 创建自定义 PySpark 流读取器.pdf》由会员分享,可在线阅读,更多相关《使用 PySpark 4.0 创建自定义 PySpark 流读取器.pdf(21页珍藏版)》请在三个皮匠报告上搜索。
1、Creating a Custom PySpark Stream Reader with PySpark 4.0Skyler Myers Entrada ConfidentialEntrada Confidential2Entrada ConfidentialThe ProblemPySpark natively supports many data sources,such as JDBC,ODBC,Kafka,Delta,etc.However,many of the more legacy systems,such as those that support JMS protocol,a
2、re not supported out-of-the-box.This has traditionally required complex workarounds with a lot of bespoke codeEntrada Confidential3PySpark Custom Data SourcesEnter the new PySpark 4.0 custom data sources featureIn DBR 15.3+(for streaming)you can implement the DataSource classes in PySpark to create
3、your own custom reader much easier than beforeThis allows you to connect to systems that use,for instance,JMS protocol for real-time alertingEntrada ConfidentialEntrada Confidential4Entrada ConfidentialWhat is JMS?A message broker service written in JavaThere are many implementations of it,with poss
4、ibly the most popular being Apache ActiveMQNormally have to write a connector in Java and write to an intermediary source that PySpark can read fromEntrada Confidential5Connecting to ActiveMQ to Read JMSActiveMQ is one of the most popular implementations of the JMS protocolThere are many ways to con
5、nect,including using the STOMP protocol with PythonHowever,PySpark does not support it as a data sourceEntrada ConfidentialEntrada Confidential6Connect via Python+UDFInstall stomp.py and override the included ConnectionListener class methods with your own specificationsTurn these functions into a UD
6、FHowever,UDFs are not the best for low-latency workloadsI get a DAB managed jobEntrada Confidential7A larger cluster with Photon is automatically attached based on the unique workspace configuration specified in the DAB configuration YAML file based on anticipated environment workloadAs opposed to t