微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何使用 Python 中的 str.replace() 函数将地址替换为只有数字和一些字母?

如何解决如何使用 Python 中的 str.replace() 函数将地址替换为只有数字和一些字母?

我正在尝试在参考索引 (coClean) 上匹配左边和紧密地址(来自单独的表),我在 #Python #JupyterNotebook 中使用以下公式创建了该索引

import pandas as pd
df1=pd.read_csv("/content/Addmatchdf1.csv")
df2=pd.read_csv("/content/Addmatchdf2.csv")

import re
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])

df1["coClean"]=cleanAddress(df1["Address"]) 
df = pd.merge(df1,df2,on =['coClean'],how ='inner') 

这会生成一个 coClean 作为参考索引。

地址_x coClean 地址_y
7 Pindara Bvd LANGWARRIN VIC 3910 73910 7 Pindara Blv,Langwarrin,VIC 3910
2a Manor St BACCHUS MARSH VIC 3340 23340 维多利亚州 3340 巴克斯沼泽庄园街 2a 号
38 Sommersby Rd​​ POINT COOK VIC 3030 383030 38 Sommersby Road,Point Cook,VIC 3030
17 Moira Avenue,Carnegie,Vic 3163 173163 17 Moira Avenue,Vic 3163
17 Moira Avenue,Vic 3163 173163 17 Newman Avenue,VIC 3163
17 Moira Avenue,Vic 3163 173163 17 Maroona Rd,Carnegie VIC 3163

显然,我面临的问题是,同一邮政编码下的某些地址具有相同的门牌号。但是由于参考索引相同,因此连接变得困难。

如何修改这个函数,使参考索引只包含

a. the house numbers
b. first four letters
c. postcode

因此,'23340'(2a manor street bacchus marsh vic 3340)的新引用变为 '2aman3340'?所以一个列表返回如下:

coClean
7pind3910
2aman3340
38somm3030
17moir3163
17newm3163
17maroo3163

我试图修改函数以包含所有字母和数字

def cleanAddress(series):
return series.str.lower().str.replace(r"[^a-z\d]","")

但是包含所有字母并不能解决问题,因为不同的表包含 street 作为 st。和路作为路。因此,更好的策略是依靠带有一些首字母的门牌号和邮政编码。

感谢您的友好建议。

更新: 我换了

def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,"")
df1["coClean"]=cleanAddress(df1["Address"])

def cleanAddress(series):
    coclen=""
    number_of_letters=0
    if series:
        for i in range(len(series)):
            if series[i].isnumeric():
                coclen+=series[i]
            elif series[i].isalpha():
                number_of_letters+=1
                coclen+=series[i]
                if number_of_letters==4:
                    break
        for i in range(i,len(series)):
            if series[i].isnumeric():
                coclen+=series[i]
    return coclen

执行时返回错误

cleanAddress(df1["Address"])

The full error is as follows:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-b653a19f5638> in <module>()
----> 1 df1["coClean"]=cleanAddress(df1["Address"])

1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
   1328     def __nonzero__(self):
   1329         raise ValueError(
-> 1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty,a.bool(),a.item(),a.any() or a.all()."
   1332         )

ValueError: The truth value of a Series is ambiguous. Use a.empty,a.any() or a.all().

解决方法

import pandas as pd
df1 = pd.DataFrame({"Address_x":["7 Pindara Bvd LANGWARRIN VIC 3910","2a Manor St BACCHUS MARSH VIC 3340","38 Sommersby Rd POINT COOK VIC 3030","17 Moira Avenue,Carnegie,Vic 3163"],"Address_y":["7 Pindara Blv,Langwarrin,VIC 3910","2a Manor Street,BACCHUS MARSH,VIC 3340","38 Sommersby Road,Point Cook,VIC 3030",Vic 3163"]})
def cleanAddress(series):
    cocleans=[]
    for address in series:
        number_of_letters=0
        coclean=""
        for i in range(len(address)):
            if address[i].isnumeric():
                coclean+=address[i]
            elif address[i].isalpha():
                number_of_letters+=1
                coclean+=address[i]
                if number_of_letters==4:
                    break
        for i in range(i,len(address)):
            if address[i].isnumeric():
                coclean+=address[i]
        cocleans.append(coclean.lower())
    return cocleans
df1["coClean"]=cleanAddress(df1["Address_x"])

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。