PySpark使用Regexp_extract和Col创建数据集

如何解决PySpark使用Regexp_extract和Col创建数据集

我需要帮助来创建一个数据集，以显示居住在德克萨斯州的人们的名字和姓氏以及他们的电话号码（phone1）的区号。这是我尝试使用的编码，这是我得到的数据集。

from pyspark.sql.functions import regexp_extract,col
regexp_extract(col('first_name + last_name'),'.by\s+(\w+)',1))


first_name  last_name   company_name    address           city    county      state   zip   phone1      
Billy       Thornton    Qdoba           8142 Yougla Road  Dallas  Fort Worth  TX      34218 689-956-0765
Joe         Swanson     Beachfront      9243 Trace Street  Miami  Dade        FL      56432 890-780-9674
Kevin       Knox        MSG             7683 brooklyn Ave  New York New York  NY      56987 850-342-1123
Bill        Lamb        AFT             6394 W Beast Dr   Houston   galveston TX      32804 407-413-4842
Raylene     Kampa       Hermar Inc      2046 SW Nylin Rd  Elkhart   Elkhart   IN      46514 574-499-1454

解决方法

现在我明白了。您的电话号码状态很容易拆分，因此请使用split。

df.show()

+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
|first_name|last_name|company_name|          address|    city|    county|state|  zip|      phone1|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
|     Billy| Thornton|       Qdoba| 8142 Yougla Road|  Dallas|Fort Worth|   TX|34218|689-956-0765|
|       Joe|  Swanson|  Beachfront|9243 Trace Street|   Miami|      Dade|   FL|56432|890-780-9674|
|     Kevin|     Knox|         MSG|7683 Brooklyn Ave|New York|  New York|   NY|56987|850-342-1123|
|      Bill|     Lamb|         AFT|  6394 W Beast Dr| Houston| Galveston|   TX|32804|407-413-4842|
|   Raylene|    Kampa|  Hermar Inc| 2046 SW Nylin Rd| Elkhart|   Elkhart|   IN|46514|574-499-1454|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+

df.filter("state = 'TX'") \
  .withColumn('area_code',split('phone1',"-")[0].alias('area_code')) \
  .select('first_name','last_name','state','area_code') \
  .show()

+----------+---------+-----+---------+
|first_name|last_name|state|area_code|
+----------+---------+-----+---------+
|     Billy| Thornton|   TX|      689|
|      Bill|     Lamb|   TX|      407|
+----------+---------+-----+---------+