ValueError：BitVects 的长度必须相同rdkit

如何解决ValueError：BitVects 的长度必须相同rdkit

我正在使用 rdkit 计算 2 个摩尔之间的结构相似性。当我在 google colab (rdkit=2020.09.2 python=3.7) 中运行程序时，程序运行良好。

我在 PC (rdkit=2021.03.2 python=3.8.5) 上运行时遇到错误。错误有点奇怪。数据框包含 500 行，代码仅适用于前 10 行 (0-9)，对于后面的行，我收到错误

 s = DataStructs.BulkTanimotoSimilarity(fps_2[n],fps_2[n+1:]) 
    ValueError: BitVects must be same length

代码块如下

  data = pd.read_csv(os.path.join(os.path.join(os.getcwd(),"dataset"),"test_ssp.csv"),index_col=None)
 
  
  #Proff and make a list of Smiles and id
  c_smiles = []
  count = 0
  for index,row in data.iterrows():
    try:
      cs = Chem.CanonSmiles(row['SMILES'])
      c_smiles.append([row['ID_Name'],cs])
    except:
      count = count + 1
      print('Count Invalid SMILES:',count,row['ID_Name'],row['SMILES'])

  # make a list of id,smiles,and mols
  ms = []
  df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
  for index,row in df.iterrows():
    mol = Chem.MolFromSmiles(row['SMILES'])
    ms.append([row['ID_Name'],row['SMILES'],mol])

  # make a list of id,mols,and fingerprints (fp)
  fps = []
  df_fps = DataFrame(ms,'SMILES','mol'])
  df_fps.head

  for index,row in df_fps.iterrows():
    fps_cal = FingerprintMols.FingerprintMol(row['mol'])
    fps.append([row['ID_Name'],fps_cal])


  fps_2 = DataFrame(fps,'fps'])
  fps_2 = fps_2[fps_2.columns[1]]
  fps_2 = fps_2.values.tolist()


  # compare all fp pairwise without duplicates
  for n in range(len(fps_2)): 
      s = DataStructs.BulkTanimotoSimilarity(fps_2[n],fps_2[n+1:])
      for m in range(len(s)):
          qu.append(c_smiles2[n])
          ta.append(c_smiles2[n+1:][m])
          sim.append(s[m])

您能告诉我为什么代码在 Google Colab 中运行良好时我的 PC 上出现此错误吗？我该如何解决这个问题？无论如何要安装rdkit=2020.09.2？

可重现的数据

DB00607 [H][C@]12SC(C)(C)[C@@H](N1C(=O)[C@H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C@@H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C@@]12[C@H]3CC[C@H](C3)[C@]1([H])C(=O)N(C[C@@H]1CCCC[C@H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C@]1(C)CN(C[C@@]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C@H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1

解决方法

要首先回答如何安装特定版本的 Rdkit，您可以运行以下命令：

conda install -c rdkit rdkit=2020.09.2

回到原来的问题，错误来了是因为函数：

FingerprintMols.FingerprintMol()

无论出于何种内部原因，它都会将前 10 个 SMILES 转换为 2048 长度向量，而将第 11 个 SMILES 转换为 1024 长度向量。旧版本能够处理这种不匹配，但新版本不能。有两种方法可以解决此问题：

使用我上面提到的命令将 RdKit 降级到旧版本。
通过将向量作为参数传递来修复向量的长度。基本上，替换该行

FingerprintMols.FingerprintMol(row['mol'])

与

FingerprintMols.FingerprintMol(row['mol'],minPath=1,maxPath=7,fpSize=2048,bitsPerHash=2,useHs=True,tgtDensity=0.0,minSize=128)

在替换中，除 fpSize 之外的所有参数都设置为其默认值，而 fpSize 固定为 2048。请注意，您必须传递所有参数并不只是fpSize。

只是为了扩展 mnis 的答案，由于 FingerPrintMol 默认为 RDKFingerprint，您可能会发现直接使用它更容易，因为它更灵活，而且您不必提供所有论据。在 2021.03.3 版本上测试

Chem.RDKFingerprint(row['mol'],fpSize=2048)