Python pandas/lists 算法性能设置numba / geopy / 半正弦距离优化时间用法

如何解决Python pandas/lists 算法性能设置numba / geopy / 半正弦距离优化时间用法

我编写了一个简短的脚本，其中包含 2 个 Pandas 数据框，一个包含城市资产的纬度和经度，另一个包含城市内便利设施的纬度和经度。

资产数据框有超过 160,000 行。对于不同的便利设施（全科医生服务、药房、学校等），我的代码最终将遍历一系列 csv 文件，并创建这些便利设施每个位置的经纬度数据框。

我目前正在遍历每个资产并使用 geopy.distance.distance() 创建从每个资产到每个便利设施的距离列表，并汇总到每个便利设施的距离的最小值。例如，对于包含 200 家药店位置的 csv 文件，我想找到最近药店与我的每个资产的距离。

目前我正在使用列表理解（两次）。这是我的代码：

# dataframe of amenity lats and lons
GP_Lats_lons = pd.read_csv('GPHealthCentreClinic.csv',usecols=(['lat','long']))

#dataframe of asset lats and lons (using database connection)
assets_df = pd.read_sql_query(query,conn) 

def get_distance(lat,lon):
    distances = [distance.distance((lat,lon),(x['lat'],x['long'])).km for y,x in GP_Lats_lons.iterrows()]  
    return min(distances)

assets_df['nearest_gp_km'] = [get_distance(x['Latitude'],x['Longitude']) for y,x in assets_df[['Latitude','Longitude']].iterrows()]

有没有人对如何优化上述代码的数据结构和算法有任何想法？列表理解是这样做的好/坏方法吗？ pd.DataFrame.apply() 中的 lambda 函数是不是更好的方法。

谢谢

解决方法

TLDR 的长答案；

使用组合而不是排列来驱动需要计算距离的地点集
使用 numba 和 haversine 函数比 geopy.distance 快 10 倍

设置

import requests
import pandas as pd
searchendpoint = "https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations"
# get all healthcare facilities in Herefordshire
dfhc = pd.concat([pd.json_normalize(requests
                             .get(searchendpoint,params={"PostCode":f"HR{i}","Status":"Active"})
                             .json()["Organisations"]) 
           for i in range(1,10)]).reset_index(drop=True)

# get geo data for all postcodes associated with healthcare facilities
# API batch restriction of 100 post codes per call
dfgeo = (pd.concat([pd.json_normalize(
    requests.post("http://api.postcodes.io/postcodes",json={"postcodes":dfhc.PostCode.unique().tolist()[i:i+100]}).json(),record_path="result")
 for i in range(0,len(dfhc.PostCode.unique()),100)])
 .rename(columns=lambda c: c.replace("result.",""))
 .reset_index(drop=True)
 .assign(coord=lambda dfa: dfa.longitude.combine(dfa.latitude,lambda x,y: (x,y,)))
)
dfgeo_missed = dfgeo.loc[dfgeo.postcode.isna()]
dfgeo = dfgeo.loc[~dfgeo.postcode.isna()]

numba / geopy / 半正弦距离

import geopy.distance
from numba import jit
import numpy as np

# a few ways to calculate distance between two (lon,lat) pairs
@jit(nopython=True)
def haversine_jit(x,y):
    # approximate radius of earth in km
    R = 6373.0

    s_lat = np.deg2rad(x[0])                    
    s_lng = np.deg2rad(x[1])     
    e_lat = np.deg2rad(y[0])                       
    e_lng = np.deg2rad(y[1])

    d = np.sin((e_lat - s_lat)/2)**2 + \
        np.cos(s_lat)*np.cos(e_lat) * \
        np.sin((e_lng - s_lng)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

def haversine(x,y):
    # approximate radius of earth in km
    R = 6373.0

    s_lat = np.deg2rad(x[0])                    
    s_lng = np.deg2rad(x[1])     
    e_lat = np.deg2rad(y[0])                       
    e_lng = np.deg2rad(y[1])

    d = np.sin((e_lat - s_lat)/2)**2 + \
        np.cos(s_lat)*np.cos(e_lat) * \
        np.sin((e_lng - s_lng)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))


def geopykm(x,y):
    return geopy.distance.distance(x,y).km

优化

import itertools

# optimisation - use just combinations not permuations of locations
dfcombis = (pd.DataFrame(itertools.combinations(dfgeo.postcode.values,2))
 .merge(dfgeo.loc[:,["postcode","coord","longitude","latitude"]],left_on=0,right_on="postcode")
 .merge(dfgeo.loc[:,left_on=1,right_on="postcode")
 .drop(columns=[0,1]))

def testit(df,calc=geopykm,col="km"):
    return df.assign(**{col:df.coord_x.combine(df.coord_y,calc)})

%timeit dfx = testit(dfcombis)
%timeit dfx = testit(dfcombis,calc=haversine)
%timeit dfx = testit(dfcombis,calc=haversine_jit)


dfx = testit(dfcombis,calc=haversine_jit,col="km")

时间

1.77 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs,1 loop each)
280 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs,1 loop each)
125 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs,10 loops each)

用法

# a to b is same as b to a,concat the reverse
# some locations have multiple facilities,include a to a is 0kmn
dfnb = pd.concat([
    dfx.loc[dfx.km.le(10),["postcode_x","postcode_y","km"]],dfx.loc[dfx.km.le(10),["postcode_y","postcode_x","km"]].rename(columns={"postcode_x":"postcode_y","postcode_y":"postcode_x"}),pd.DataFrame({"postcode_x":dfhc.PostCode.unique(),"postcode_y":dfhc.PostCode.unique(),"km":0})
          ],).reset_index(drop=True)

# finally some analysis,find nearest pharmacies to GP surgery

(dfnb.merge(dfhc.loc[dfhc.PrimaryRoleId.isin(["RO180","RO96"]),["Name","PostCode","PrimaryRoleDescription"]],left_on="postcode_x",right_on="PostCode")
 .merge(dfhc.loc[dfhc.PrimaryRoleId.isin(["RO182","RO177"]),left_on="postcode_y",right_on="PostCode")
 .sort_values(["Name_x","km"])
 .groupby(["Name_x"],as_index=False).first()

)

	Name_x	postcode_x	postcode_y	km	PostCode_x	PrimaryRoleDescription_x	Name_y	PostCode_y	PrimaryRoleDescription_y
0	22A KING STREET	HR4 9DA	HR4 9AA	0.213861	HR4 9DA	初级保健信托网站	BOOTS UK LIMITED	HR4 9AA	药房
1	奥尔顿街外科	HR9 5AB	HR9 5AB	0	HR9 5AB	初级保健信托网站	奥尔顿街外科	HR9 5AB	处方成本中心
2	奥布里街	HR4 0BU	HR4 9AA	0.148447	HR4 0BU	初级保健信托网站	BOOTS UK LIMITED	HR4 9AA	药房
3	艾尔斯通山手术	HR1 1HR	HR4 9AA	1.46984	HR1 1HR	分支手术	BOOTS UK LIMITED	HR4 9AA	药房
4	巴尔斯法院学校	HR1 1EQ	HR4 9AA	1.27244	HR1 1EQ	初级保健信托网站	BOOTS UK LIMITED	HR4 9AA	药房
5	贝尔蒙特修道院	HR2 9RP	HR2 9RP	0	HR2 9RP	初级保健信托网站	CYPS - 林登中心	HR2 9RP	处方成本中心
6	贝尔蒙特健康中心	HR2 7XT	HR2 7XT	0	HR2 7XT	初级保健信托网站	贝尔蒙特医疗中心	HR2 7XT	处方成本中心
7	布莱克马斯顿学校	HR2 7NX	HR2 7JE	0.975908	HR2 7NX	初级保健信托网站	阿斯达药房	HR2 7JE	药房
8	BOBBLESTOCK SURGERY	HR4 9LP	HR4 9AA	3.5643	HR4 9LP	分支手术	BOOTS UK LIMITED	HR4 9AA	药房
9	博登汉姆外科	HR1 3JU	HR6 8LR	9.71357	HR1 3JU	初级保健信托网站	BOOTS UK LIMITED	HR6 8LR	药房
10	牙科服务中心/健康促进	HR2 7JE	HR2 7JE	0	HR2 7JE	初级保健信托网站	阿斯达药房	HR2 7JE	药房
11	ETNAM 街心理健康资源中心	HR6 8AN	HR6 8LR	0.557963	HR6 8AN	初级保健信托网站	BOOTS UK LIMITED	HR6 8LR	药房
12	金顿法院健康和社会关怀中心	HR5 3BX	HR5 3BJ	0.649622	HR5 3BX	初级保健信托网站	金顿药房	HR5 3BJ	药房
13	金顿外科	HR5 3EA	HR5 3EA	0	HR5 3EA	初级保健信托网站	金顿医疗实践	HR5 3EA	处方成本中心

Python pandas/lists 算法性能 设置numba / geopy / 半正弦距离优化时间用法

如何解决Python pandas/lists 算法性能 设置numba / geopy / 半正弦距离优化时间用法

解决方法

设置

numba / geopy / 半正弦距离

优化

时间

用法

相关推荐

Python pandas/lists 算法性能设置numba / geopy / 半正弦距离优化时间用法

如何解决Python pandas/lists 算法性能设置numba / geopy / 半正弦距离优化时间用法