如何解决Apache Spark结构化流式处理花费大量时间打印字数示例的输出
以下程序运行一个简单的字数来测试Spark结构化的流。我在终端上写单词,然后在另一个终端上运行程序。写完单词后,需要花费15到20秒的时间才能在第二个端子上显示输出。有没有一种方法可以减少输出时间,因为它很长。有人请帮助
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host","localhost") \
.option("port",9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value," ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
Terminal where I am connecting to port and writing the words
C:\Program Files (x86)\Nmap>ncat -lvp 9999
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Listening on :::9999
Ncat: Listening on 0.0.0.0:9999
Ncat: Connection from 127.0.0.1.
Ncat: Connection from 127.0.0.1:44577.
apacheapaop
apache
spark
apache
hadoop
hello
world
hello
hello guys guys
hello
Output terminal where I am counting words
Batch: 2
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
|apacheapaop| 1|
| hello| 1|
| apache| 2|
| spark| 1|
| | 5|
| hadoop| 1|
+-----------+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
|apacheapaop| 1|
| hello| 2|
| apache| 2|
| spark| 1|
| world| 1|
| | 5|
| hadoop| 1|
+-----------+-----+
-------------------------------------------
Batch: 4
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
| guys| 2|
|apacheapaop| 1|
| hello| 3|
| apache| 2|
| spark| 1|
| world| 1|
| | 6|
| hadoop| 1|
+-----------+-----+
-------------------------------------------
Batch: 5
-------------------------------------------
+-----------+-----+
| word|count|
+-----------+-----+
| guys| 2|
|apacheapaop| 1|
| hello| 4|
| apache| 2|
| spark| 1|
| world| 1|
| | 6|
| hadoop| 1|
+-----------+-----+
在终端上接收输出(每批)需要15-20秒...如何减少这种延迟
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。