熊猫-屏蔽两个不共享索引的数据框之间的行/列

如何解决熊猫-屏蔽两个不共享索引的数据框之间的行/列

问题

我有两个数据集，分别描述海洋的某些深度和某些纬度处的温度。数据集来自两个不同的模型，因此具有不同的分辨率，其中模型1的纬度分辨率更高，并且两个模型的深度维度都不同。我已经将两个数据集都转换为pandas数据框，深度作为垂直索引，纬度作为列标签。我想掩盖在两个数据框之间不共享的行（深度）和列（纬度），因为我会有所作为，并且不想插值数据。我已经找到了如何屏蔽行和列中某些值的方法，但是我想屏蔽整个行和列中的值。

我在深度上使用np.intersect1d作为列表来查找哪些深度在模型之间不共享，并且我使用条件语句创建了一个布尔列表，该条件语句为每个索引显示True，其中该值对于该数据帧是唯一的。但是，我不确定如何将其用作遮罩，或者即使可以使用，也不确定。 DataFrame.mask表示“条件数组必须与自身的形状相同”，但是条件数组是一维的，而数据帧是二维的。我不确定如何仅使用掩码来引用数据帧的索引。我觉得自己步入正轨，但是由于我还不熟悉熊猫，所以我不确定。（我曾尝试搜索类似的问题，但从我所看到的情况来看，没有一个完全符合我的问题。）

代码（简化的工作示例）

注意-这是在Jupyter笔记本环境中编写的

import numpy as np
import pandas as pd

# Model 1 data
depthmod1 = [5,10,15,20,30,50,60,80,100]  #depth in meters
latmod1 = [50,50.5,51,51.5,52,52.5,53] #latitude in degrees north
tmpumod1 = np.random.randint(273,303,size=(len(depthmod1),len(latmod1))) #temperature
dfmod1 = pd.DataFrame(tmpumod1,index=depthmod1,columns=latmod1)
print(dfmod1)

     50.0  50.5  51.0  51.5  52.0  52.5  53.0
5     299   300   300   293   285   293   273
10    273   288   293   292   290   302   273
15    277   279   284   302   280   294   284
20    291   295   277   276   295   279   274
30    281   284   284   275   295   284   282
50    284   276   291   282   286   295   295
60    298   294   289   294   285   289   288
80    285   284   275   298   287   277   300
100   292   295   294   273   291   276   290

# Model 2 data
depthmod2  = [5,25,35,100]
latmod2  = [50,53]
tmpumod2  = np.random.randint(273,size=(len(depthmod2),len(latmod2)))
dfmod2 = pd.DataFrame(tmpumod2,index=depthmod2,columns=latmod2)
print(dfmod2)

      50   51   52   53
5    297  282  275  292
10   298  286  292  282
15   286  285  288  273
25   292  288  279  299
35   301  295  300  288
50   277  301  281  277
60   276  293  295  297
100  275  279  292  287

# Find shared depths
depthxsect = np.intersect1d(depthmod1,depthmod2)
print(depthxsect,depthxsect.shape)

Shared depths:  [  5  10  15  50  60 100] (6,)

# Boolean mask for model 1
depthmask = dfmod1.index.isin(depthxsect) == False
print("Bool showing where mod1 index is NOT in mod2: ",depthmask)

Bool showing where mod1 index is NOT in mod2:  [False False False  True  True False False  True False]

# Mask data
dfmod1masked = dfmod1.mask(depthmask1,np.nan)
print(dfmod1masked)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-fedf013c2200> in <module>
----> 1 dfmod1masked = dfmod1.mask(depthmask1,np.nan)
      2 print(dfmod1masked)
[...]
ValueError: Array conditional must be same shape as self

问题

如何才能按索引屏蔽行，以使我只剩下在两个数据帧中都可用的行/索引[5 10 15 50 60 100]？我将对列（纬度）进行类似的遮罩，因此希望行的解决方案也适用于列。我也不想合并数据框。除非需要合并，否则它们应该保持独立。

解决方法

depthxsect返回所需的索引np.array。因此，您可以跳过创建布尔数组depthmask的过程，而只需使用.loc将np.array传递到datframe。如果您要保留所有行，但仅在其他索引上返回.mask值，则应使用NaN。

获得dfmod1和depthxsect之后，您可以简单地使用：

dfmod1.loc[depthxsect]

完整的可复制代码：

import pandas as pd
import numpy as np

# Model 1 data
depthmod1 = [5,10,15,20,30,50,60,80,100]  #depth in meters
latmod1 = [50,50.5,51,51.5,52,52.5,53] #latitude in degrees north
tmpumod1 = np.random.randint(273,303,size=(len(depthmod1),len(latmod1))) #temperature
dfmod1 = pd.DataFrame(tmpumod1,index=depthmod1,columns=latmod1)

depthmod2  = [5,25,35,100]
latmod2  = [50,53]
tmpumod2  = np.random.randint(273,size=(len(depthmod2),len(latmod2)))
dfmod2 = pd.DataFrame(tmpumod2,index=depthmod2,columns=latmod2)
depthxsect = np.intersect1d(depthmod1,depthmod2)
dfmod1.loc[depthxsect]
Out[2]: 
     50.0  50.5  51.0  51.5  52.0  52.5  53.0
5     284   291   280   287   297   286   277
10    294   279   302   283   284   298   291
15    278   296   286   298   279   275   286
50    284   281   297   290   302   299   280
60    290   301   302   298   283   286   287
100   285   283   297   287   289   282   283

我也包括了您正在尝试的方法。您没有在列上指定mask。您正在整个数据帧上执行此操作：

import pandas as pd
import numpy as np
# Model 1 data
depthmod1 = [5,columns=latmod1)
dfmod1
depthmod2  = [5,depthmod2)
depthmask = dfmod1.index.isin(depthxsect) == False
for col in dfmod1.columns:
    dfmod1[col] = dfmod1[col].mask(depthmask,np.nan)
dfmod1
Out[3]: 
      50.0   50.5   51.0   51.5   52.0   52.5   53.0
5    289.0  274.0  297.0  274.0  277.0  278.0  277.0
10   282.0  280.0  277.0  302.0  297.0  289.0  278.0
15   300.0  282.0  297.0  297.0  300.0  279.0  291.0
20     NaN    NaN    NaN    NaN    NaN    NaN    NaN
30     NaN    NaN    NaN    NaN    NaN    NaN    NaN
50   285.0  297.0  292.0  301.0  296.0  289.0  291.0
60   295.0  299.0  278.0  295.0  299.0  293.0  277.0
80     NaN    NaN    NaN    NaN    NaN    NaN    NaN
100  292.0  293.0  289.0  291.0  289.0  276.0  286.0

熊猫-屏蔽两个不共享索引的数据框之间的行/列

如何解决熊猫-屏蔽两个不共享索引的数据框之间的行/列

解决方法

相关推荐