如何解决如何使用 Python 中的 str.replace() 函数将地址替换为只有数字和一些字母?
我正在尝试在参考索引 (coClean) 上匹配左边和紧密地址(来自单独的表),我在 #Python #JupyterNotebook 中使用以下公式创建了该索引
import pandas as pd
df1=pd.read_csv("/content/Addmatchdf1.csv")
df2=pd.read_csv("/content/Addmatchdf2.csv")
import re
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])
df1["coClean"]=cleanAddress(df1["Address"])
df = pd.merge(df1,df2,on =['coClean'],how ='inner')
地址_x | coClean | 地址_y |
---|---|---|
7 Pindara Bvd LANGWARRIN VIC 3910 | 73910 | 7 Pindara Blv,Langwarrin,VIC 3910 |
2a Manor St BACCHUS MARSH VIC 3340 | 23340 | 维多利亚州 3340 巴克斯沼泽庄园街 2a 号 |
38 Sommersby Rd POINT COOK VIC 3030 | 383030 | 38 Sommersby Road,Point Cook,VIC 3030 |
17 Moira Avenue,Carnegie,Vic 3163 | 173163 | 17 Moira Avenue,Vic 3163 |
17 Moira Avenue,Vic 3163 | 173163 | 17 Newman Avenue,VIC 3163 |
17 Moira Avenue,Vic 3163 | 173163 | 17 Maroona Rd,Carnegie VIC 3163 |
显然,我面临的问题是,同一邮政编码下的某些地址具有相同的门牌号。但是由于参考索引相同,因此连接变得困难。
a. the house numbers
b. first four letters
c. postcode
因此,'23340'(2a manor street bacchus marsh vic 3340)的新引用变为 '2aman3340'?所以一个列表返回如下:
coClean |
---|
7pind3910 |
2aman3340 |
38somm3030 |
17moir3163 |
17newm3163 |
17maroo3163 |
def cleanAddress(series):
return series.str.lower().str.replace(r"[^a-z\d]","")
但是包含所有字母并不能解决问题,因为不同的表包含 street 作为 st。和路作为路。因此,更好的策略是依靠带有一些首字母的门牌号和邮政编码。
感谢您的友好建议。
更新: 我换了
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,"")
df1["coClean"]=cleanAddress(df1["Address"])
与
def cleanAddress(series):
coclen=""
number_of_letters=0
if series:
for i in range(len(series)):
if series[i].isnumeric():
coclen+=series[i]
elif series[i].isalpha():
number_of_letters+=1
coclen+=series[i]
if number_of_letters==4:
break
for i in range(i,len(series)):
if series[i].isnumeric():
coclen+=series[i]
return coclen
执行时返回错误
cleanAddress(df1["Address"])
The full error is as follows:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-b653a19f5638> in <module>()
----> 1 df1["coClean"]=cleanAddress(df1["Address"])
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
1328 def __nonzero__(self):
1329 raise ValueError(
-> 1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty,a.bool(),a.item(),a.any() or a.all()."
1332 )
ValueError: The truth value of a Series is ambiguous. Use a.empty,a.any() or a.all().
解决方法
import pandas as pd
df1 = pd.DataFrame({"Address_x":["7 Pindara Bvd LANGWARRIN VIC 3910","2a Manor St BACCHUS MARSH VIC 3340","38 Sommersby Rd POINT COOK VIC 3030","17 Moira Avenue,Carnegie,Vic 3163"],"Address_y":["7 Pindara Blv,Langwarrin,VIC 3910","2a Manor Street,BACCHUS MARSH,VIC 3340","38 Sommersby Road,Point Cook,VIC 3030",Vic 3163"]})
def cleanAddress(series):
cocleans=[]
for address in series:
number_of_letters=0
coclean=""
for i in range(len(address)):
if address[i].isnumeric():
coclean+=address[i]
elif address[i].isalpha():
number_of_letters+=1
coclean+=address[i]
if number_of_letters==4:
break
for i in range(i,len(address)):
if address[i].isnumeric():
coclean+=address[i]
cocleans.append(coclean.lower())
return cocleans
df1["coClean"]=cleanAddress(df1["Address_x"])
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。