如何解决python:将字符串集的列表转换为scipy csr_matrix
假设我有以下套子列表:
\definecolor{mycolorlightblue}{RGB}{103,153,200}
\definecolor{mycolordarkblue}{RGB}{0,70,127}
% add packages
\usepackage{tikz}
\usetikzlibrary{arrows}
\usepackage{tcolorBox}
\usepackage{ragged2e}
% remove 2nd section from header
\makeatletter
\beamer@theme@subsectionfalse
\makeatother
% change colour of lines
\setbeamercolor{separation line}{bg=mycolorlightblue}
% text title
\setbeamercolor{title}{fg=mycolordarkblue}
\setbeamercolor{frametitle}{fg=mycolordarkblue}
% text colour
\setbeamercolor{frametitle}{fg=mycolordarkblue}
% item colour
\setbeamercolor{structure}{fg=mycolordarkblue}
% define colour text
% \usebeamerfont{section title}\color{blue!70!green}\insertsection\par
% no header or footer on first page
\thispagestyle{empty}
% remove title slides at beginning of sections
\AtBeginSection{}
% add page counter to the footer
\setbeamertemplate{footline}[frame number]
% logo of my university
\titlegraphic{%
\begin{picture}(0,0)
\put(155,0){\makeBox(0,0)[rt]{\includegraphics[]{ALL-ICONS.png}}}
\end{picture}}
如何将其转换为稀疏的csr_matrix?其预期输出如下:
db = [{"bread","butter","milk"},{"eggs","milk","yogurt"},{"bread","cheese","eggs",{"cheese","yogurt"}]
我尝试对其进行硬编码,以便我可以进一步消化它,但是我似乎听不懂。我的代码是:
[[1.,1. 0.,0.,1.,0.],[0.,1.],[1.,1.]]
我似乎无法使其正常运行。有没有更好的方法来实现这一点?
解决方法
设置:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
db = [{"bread","butter","milk"},{"eggs","milk","yogurt"},{"bread","cheese","eggs",{"cheese","yogurt"}]
all_products = set()
for SET in db:
all_products |= SET
sorted_products = sorted(all_products)
方法2(没有熊猫):
首先,您需要翻译
d = dict()
for i,prod in enumerate(sorted_products):
d[prod] = i
{'bread': 0,'butter': 1,'cheese': 2,'eggs': 3,'milk': 4,'yogurt': 5}
然后,创建完整的矩阵并填充
template = np.zeros(len(all_products) * len(db),dtype=int).reshape((len(db),len(all_products)))
for j,line in enumerate(db):
for prod in line:
template[j,d[prod]] = 1
array([[1,1,0],[0,1],[1,1]])
最后将其转换为稀疏矩阵
matrix = csr_matrix(template)
(0,0) 1
(0,1) 1
(0,4) 1
(1,3) 1
(1,5) 1
(2,0) 1
(2,2) 1
(2,3) 1
(2,4) 1
(3,3) 1
(3,5) 1
(4,2) 1
(4,4) 1
(4,5) 1
#<5x6 sparse matrix of type '<class 'numpy.longlong'>'
# with 16 stored elements in Compressed Sparse Row format>
方法1(熊猫):
df = pd.DataFrame(index=sorted_products,columns=range(len(db)))
print(df)
为您提供空的数据框
0 1 2 3 4
yogurt NaN NaN NaN NaN NaN
butter NaN NaN NaN NaN NaN
bread NaN NaN NaN NaN NaN
milk NaN NaN NaN NaN NaN
cheese NaN NaN NaN NaN NaN
eggs NaN NaN NaN NaN NaN
然后添加集合
for i in range(len(db)):
df[i] = pd.Series([1]*len(db[i]),index=list(db[i]))
0 1 2 3 4
yogurt NaN 1.0 NaN 1.0 1.0
butter 1.0 NaN NaN NaN NaN
bread 1.0 NaN 1.0 NaN NaN
milk 1.0 1.0 1.0 1.0 1.0
cheese NaN NaN 1.0 NaN 1.0
eggs NaN 1.0 1.0 1.0 NaN
接下来,您用零填充NaN值
data = df.fillna(0)
最后将其转换为稀疏矩阵
from scipy.sparse import csr_matrix
matrix = csr_matrix(data)
print(matrix)
输出:
#<6x5 sparse matrix of type '<class 'numpy.longlong'>'
# with 16 stored elements in Compressed Sparse Row format>
(0,2) 1
(0,1) 1
(1,2) 1
(1,1) 1
(2,1) 1
(3,0) 1
(4,2) 1
(5,0) 1
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。