如何在python中处理未知大小的多值行数值？

如何解决如何在python中处理未知大小的多值行数值？

我实际上是在尝试解决 vidya 最近的 Hackathon LTFS(Bank Data) 分析问题，但在那里我遇到了一些独特的问题，实际上并不是太独特。解释一下

Problem

Bureau 数据集中的几列名为 REPORTED DATE - HIST、CUR BAL - HIST、AMT OVERDUE - HIST & AMT PAID - HIST 由空值,或一行多个值组成，而且每一行的值个数不相同

这是数据集的一部分（不是original data，因为行大小很大）

**Requested Date - Hist**                                                                   
20180430,20180331,20191231,20191130,20191031,20190930,20190831,20190731,20190630,20190531,20190430,20190331,20121031,20120930,20120831,20120731,20120630,20120531,20120430,----------------x-----------2nd column------------x-----------------------------------

**AMT OVERDUE**
37873,1452,3064,2972,2802,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866

-----x--other columns are similar---x---------------------

Seeking for a better option,if possible

以前当我解决这类问题时，它是 Movielens 项目的流派，在那里我使用了虚拟列概念，它在那里工作，因为流派列中没有太多值，而且一些值在很多行，所以很容易。但是这里看起来很困难，原因有两个

1st reason 因为它包含很多值，同时它可能不包含任何值

2nd reason 如何在 Movielens 类型案例中为每个唯一值或行创建一列

**genre**
action|adventure|comedy
carton|scifi|action
biopic|adventure|comedy
Thrill|action

# so here I had extracted all unique value and created columns 

**genre**                 | **action** | **adventure**| **Comedy**| **carton**| **sci-fi**| and so on...
action|adventure|comedy   |   1        |     1        |      1    |     0     |      0    |    
carton|scifi|action       |   1        |     0        |      0    |     1     |      1    |
biopic|adventure|comedy   |   0        |     1        |      1    |     0     |      0    |
Thrill|action             |   1        |     0        |      0    |     0     |      0    |

# but here it's different how can I deal with this,I have no clue
**AMT OVERDUE**
37873,866

解决方法

当在推荐器中时，通常有稀疏矩阵。这些可能会非常消耗空间（太多的零或空白空间），可能适合移动到稀疏矩阵 scipy 表示，如 here 中。如前所述，这在推荐系统中很常见，请找到 here 很好的例子。

不幸的是我不能使用原始数据，也许在 csv 中有一个较小的例子是好的。所以我会使用推荐人的例子，因为它也很常见。

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

df = pd.DataFrame({
    'genres' : ["action|adventure|comedy","carton|scifi|action","biopic|adventure|comedy","Thrill|action"],})
print(df)
                    genres
0  action|adventure|comedy
1      carton|scifi|action
2  biopic|adventure|comedy
3            Thrill|action

让我们看看它看起来像一个矩阵：

# To identify the genres so we can create our columns
genres = []
for G in df['genres'].unique():
    for i in G.split("|"):
        print(i)
        genres.append(i)
# To remove duplicates
genres = list(set(genres))

# Create a column for each genere
for g in genres:
    df[g] = df.genres.transform(lambda x: int(g in x))

# This is the sparse matrix with many 0
movie_genres = df.drop(columns=['genres'])
print(movie_genres)
   comedy  carton  adventure  Thrill  biopic  action  scifi
0       1       0          1       0       0       1      0
1       0       1          0       0       0       1      1
2       1       0          1       0       1       0      0
3       0       0          0       1       0       1      0

我们不需要创建那个矩阵，事实上，最好避免它可能非常消耗资源。

我们应该把它转换成一个 csr_matrix，只有一部分大小：

from scipy.sparse import csr_matrix

M = df.index.__len__()
N = genres.__len__()

user_mapper = dict(zip(np.unique(df.index),list(range(M))))
genres_mapper = dict(zip(genres,list(range(N))))

user_inv_mapper = {user_mapper[i]:i for i in user_mapper.keys()}
genres_inv_mapper = {genres_mapper[i]:i for i in genres_mapper.keys()}

user_index = []
genre_index = []
for user in df.index:
    print(user)
    print(df.loc[user,'genres'])
    for genre in df.loc[user,'genres'].split('|'):
        genre_index.append(genres_mapper[genre])
        user_index.append(user_mapper[user])

X = csr_matrix((np.ones(genre_index.__len__()),(user_index,genre_index)),shape=(M,N))

看起来像：

print(X)
  (0,0)    1.0
  (0,2)    1.0
  (0,5)    1.0
  (1,1)    1.0
  (1,6)    1.0
  (2,0)    1.0
  (2,2)    1.0
  (2,4)    1.0
  (3,3)    1.0
  (3,5)    1.0

通过以上，您可以看到使用较小数据集的过程。