如何解决用于匹配字符串的 Pyspark 函数
我有两张桌子
表 1:(comment_df)
| Date | Comment |
|:---- |:------:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday. |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. |
| 21/01/2020 | AMEX receipts from Euro Account delayed. |
表 2:(code_df)
| Tag | Comment |
|:---- |:------:|
| EURO | Euro Account to HSBC |
| Natwest | Euro Account to Natwest |
| AMEX | AMEX payment |
想要的表是
| Date | Comment | Tag |
|:---- |:------:| ----:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday. | EURO |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. | Natwest |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. | AMEX |
| 21/01/2020 | AMEX receipts from Euro Account delayed. | |
我可能可以使用 .contains 或 matcher(nlp.vocab?) 来处理一些类别。但我有 30 多个类别,而且列表会随着时间的推移而增长。所以我希望有一个使用 pyspark 的函数可以优雅地做到这一点。
干杯!
解决方法
left join
可能是合适的:
code_df = code_df.withColumnRenamed('Comment','Commentcode')
result = comment_df.join(code_df,comment_df.Comment.contains(code_df.Commentcode),'left').drop('Commentcode')
result.show(truncate=False)
+----------+---------------------------------------------------------------------+-------+
|Date |Comment |Tag |
+----------+---------------------------------------------------------------------+-------+
|20/01/2020|Transfer from Euro Account to HSBC account done on Monday. |EURO |
|20/01/2020|Brian initiated a Transfer from Euro Account to Natwest last Tuesday.|Natwest|
|21/01/2020|AMEX payment to Natwest was delayed for second time in a row. |AMEX |
|21/01/2020|AMEX receipts from Euro Account delayed. |null |
+----------+---------------------------------------------------------------------+-------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。