匹配 Python 中的字符串元素应用函数

如何解决匹配 Python 中的字符串元素应用函数

我有 Twitter 数据，我正在尝试返回与用户自我描述的位置相匹配的所有州缩写。我创建了一个匹配函数并将其应用于我的数据框，但由于某种原因，尽管原始数据中存在状态缩写，但我没有返回任何匹配项（所有 NaN）。

我的州列表包括所有 50 个州

states = ['AL','AK','AZ','AR','CA'...]

我的数据框的一小部分示例如下所示：

        user_location            text
0   CO                           australia to manufacture covid vaccine and g...
1   Seattle,WA                  coronavirusvaccine coronavaccine covidvaccine ...
2   nan                          deaths due to covid in affected countries re...
3   Atlanta,GA                  subhashree stay safe di amp da

我创建了以下嵌套循环函数，以尝试从我的状态列表中返回位置与 user_location 列的匹配项：

def match(user_location):
    for state in states:
        if state in tweets2.user_location:
            return state
        else:
            return np.nan

然后我通过应用我的函数创建了一个返回匹配的新列：

tweets2['State'] = tweets2['user_location'].apply(match)

但是，当我知道 user_location 列中肯定有状态缩写时，我得到的只是 NaN 值。

我使用以下代码进行了检查：

tweets2['State'].notnull().value_counts()

对解决此问题的任何帮助将不胜感激！

解决方法

在您的代码中，一旦未找到一个状态，您就会返回 nan，如下所示

julia> collect(z)
3-element Array{Tuple{Int64,Int64},1}:
 (1,4)
 (2,5)
 (3,6)

您应该将其更改为仅在检查所有状态后才返回 nan 。为此，您可以这样编码，

def match(user_location):
    for state in states:
        if state in tweets2.user_location:
            return state
        else:
            return np.nan

您总是在循环的第一次迭代后返回该值。尽量避免在循环内使用 return。让我们重建你的循环：

from                to                  cc
employee.1@xtra.co  employee.5@xtra.co  employee.2xtra.co
employee.1@xtra.co  employee.5@xtra.co  employee.4xtra.co
employee.1@xtra.co  employee.5@xtra.co  employee.6xtra.co
employee.1@xtra.co  employee.3xtra.co   employee.2xtra.co
employee.1@xtra.co  employee.3xtra.co   employee.4xtra.co
employee.1@xtra.co  employee.3xtra.co   employee.6xtra.co

你可以用集合做一些更优雅的事情。如果将 def match(user_location): user_state = np.nan for state in states: if state in user_location: user_state = state break return user_state print(match(tweets2.user_location)) 设为集合，则可以执行以下操作：states 将返回存在于两个集合中的一组项目。