如何解决在pyspark中处理数组时发生TypeError
我正在尝试计算'user_features'和'movie_features'之间的点积(元素积和):
+------+-------+--------------------+--------------------+
|userId|movieId| user_features| movie_features|
+------+-------+--------------------+--------------------+
| 18| 1|[0.0,0.5,0.0,0...|[1,1...|
| 18| 2|[0.1,0...|
| 18| 3|[0.2,0.3,0...|[0,1...|
| 18| 4|[0.0,0.1,1...|
+------+-------+--------------------+--------------------+
数据类型:
df.printSchema()
_____________________________________________
root
|-- userId: integer (nullable = true)
|-- movieId: integer (nullable = true)
|-- user_features: array (nullable = false)
| |-- element: double (containsNull = true)
|-- movie_features: array (nullable = false)
| |-- element: float (containsNull = true)
None
我用这个
class Solution:
"""
Data reading,pre-processing...
"""
@udf("array<double>")
def miltiply(self,x,y):
if x and y:
return [float(a * b) for a,b in zip(x,y)]
def get_dot_product(self):
df = self.user_DF.crossJoin(self.movies_DF)
output = df.withColumn("zipxy",self.miltiply("user_features","movie_features")) \
.withColumn('sumxy',sum([F.col('zipxy').getItem(i) for i in range(20)]))
给出以下错误:
TypeError: Invalid argument,not a string or column: <__main__.solution instance at 0x000000000A777EC8> of type <type 'instance'>. For column literals,use 'lit','array','struct' or 'create_map' function.
我想念什么?我正在使用udf
进行此操作,因为我正在使用Spark 1.6,因此无法使用aggregate
或zip_with
函数。
解决方法
如果可以使用numpy
,那么
df = spark.createDataFrame([(18,1,[1,1],1])]).toDF('userId','movieId','user_features','movie_features')
import numpy as np
df.rdd.map(lambda x: (x[0],x[1],x[2],x[3],float(np.dot(np.array(x[2]),np.array(x[3]))))).toDF(df.columns + ['dot']).show()
+------+-------+-------------+--------------+---+
|userId|movieId|user_features|movie_features|dot|
+------+-------+-------------+--------------+---+
| 18| 1| [1,1]| [1,1]|2.0|
+------+-------+-------------+--------------+---+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。