如何解决Pyspark创建带有计算值的摘要表
def isOpen(self,ip,port):
s = socket.socket(socket.AF_INET,socket.soCK_STREAM)
try:
s.connect((ip,int(port)))
data=s.recv(1024)
if data== b'Hi':
print("connected")
return True
except:
print("not connected")
return False
def check_password(self):
self.isOpen('localhost',52000)
,我想创建一个汇总表,该表为所有夜间旅行和全天旅行计算+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 21:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
| 2019-01-01 10:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 22:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
+--------------------+---------------------+-------------+------------+-----+
(trip_rate
列除以total_amount
)。因此最终结果应如下所示:
trip_distance
这就是我想要做的:
+------------+-----------+
| day_night | trip_rate |
+------------+-----------+
|Day | 1.33 |
|Night | 1.92 |
+------------+-----------+
我不相信我什至没有正确的方法。我收到此错误:(
df2 = spark.createDataFrame(
[
('2019-01-01 09:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 21:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
('2019-01-01 10:01:00',#day
('2019-01-01 22:59:59',#night
],['tpep_pickup_datetime','tpep_dropoff_datetime','trip_distance','total_amount','day_night'] # add your columns label here
)
day_trip_rate = df2.where(df2.day_night == 'Day').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
night_trip_rate = df2.where(df2.day_night == 'Night').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
tpep_pickup_datetime raise AnalysisException(s.split(': ',1)[1],stackTrace) pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty,and '
解决方法
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df2.groupBy("day_night").agg(F.round(F.sum("total_amount")/F.sum("trip_distance"),2).alias('trip_rate'))\
.withColumn("day_night",F.when(col("day_night")=="true","Day").otherwise("Night")).show()
+---------+---------+
|day_night|trip_rate|
+---------+---------+
| Day| 1.33|
| Night| 1.92|
+---------+---------+
不进行四舍五入:
df2.groupBy("day_night").agg(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')\
.withColumn("day_night","Day").otherwise("Night")).show()
(您在day_night
构造代码中有df2
,但在显示表中有isDay
。在这里,我将字段名称视为day_night
。)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。