如何使用uproot加载许多非常大的文件?

如何解决如何使用uproot加载许多非常大的文件?

我正在使用CMSSW_11_1_4环境(Python 3.8.4,uproot 3.11.3,尴尬的0.12.20),并且尝试打开一些非常大的.root文件(3个文件,每个9 GB)并使用事件循环遍历其某些分支。遍历第一个文件(前100k个事件)时,该代码似乎按预期工作。在第一个事件中需要花费大量时间,但随后的其他事件将很快得到处理。直到到达10万个已处理事件的第二个文件为止。然后,代码崩溃并打印出AssertionError消息。这是一些代码,可以通过xrootd访问来重现该问题。

import uproot
import uproot_methods
import numpy as np
import matplotlib.pyplot as plt

# Get the file and import using uproot
base = 'root://cmSEOs.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/'
datasets = [
            base + 'Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root',base + 'Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root',base + 'Autumn18.QCD_HT2000toInf_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root',]

# Load on lazy arrays
mycache = uproot.ArrayCache("1 GB")
events = uproot.lazyarrays(datasets,'TreeMaker2/PreSelection',['HT','CrossSection','Tracks.fCoordinates.fX','Tracks.fCoordinates.fY','Tracks.fCoordinates.fZ','Tracks_fromPV0','Tracks_matchedToPFCandidate'],cache=mycache)

trk_multiplicity = np.zeros(events['Tracks.fCoordinates.fX'].size)
for ievt in range(events['Tracks.fCoordinates.fX'].size):
    if ievt%1000 == 0:
        print("Processing event %d. Progress: %.2f%%"%(ievt,100*ievt/events['Tracks.fCoordinates.fX'].size))
    if events['HT'][ievt] < 1200:
        continue
    tracks_x = events['Tracks.fCoordinates.fX'][ievt]
    tracks_y = events['Tracks.fCoordinates.fY'][ievt]
    tracks_z = events['Tracks.fCoordinates.fZ'][ievt]
    tracks_E = np.sqrt(tracks_x**2+tracks_y**2+tracks_z**2+0.13957**2)
    tracks = uproot_methods.TLorentzVectorArray.from_cartesian(tracks_x,tracks_y,tracks_z,tracks_E)
    tracks_fromPV0 = events['Tracks_fromPV0'][ievt]
    tracks_matchedToPFCandidate = events['Tracks_matchedToPFCandidate'][ievt]
    tracks = tracks[(tracks.pt > 1.) & (tracks.eta < 2.5) & (tracks_fromPV0 >= 2) & 
                    (tracks_matchedToPFCandidate > 0)]
    trk_multiplicity[ievt] = tracks.size

# Plot results
fig = plt.figure(figsize=(8,8))
ax = plt.gca()

CrossSection = events['CrossSection'][events['HT'] > 1200]
trk_multiplicity = trk_multiplicity[events['HT'] > 1200]

ax.hist(trk_multiplicity,bins=100,density=True,weights=CrossSection,histtype='step')

plt.show()

我确信这是一些内存问题,因为我已经设法以这种方式处理了多个文件(尽管文件较小并且本地存储)。我该如何克服呢?我需要以其他方式处理缓存吗?

编辑(2020年10月28日):我尝试切换到uproot4(0.0.27)/ awkward1(0.3.1)并没有成功。我将相同的代码转换为以下代码

import uproot4 as uproot
import uproot_methods
import awkward1 as ak
import numpy as np
import matplotlib.pyplot as plt

# Get the file and import using uproot
base = 'root://cmSEOs.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/'
datasets = {
    base + 'Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',base + 'Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',base + 'Autumn18.QCD_HT2000toInf_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',}


# Load on lazy arrays
events = uproot.lazy(datasets)

trk_multiplicity = np.zeros(len(events['Tracks.fCoordinates.fX']))
for ievt in range(len(events['Tracks.fCoordinates.fX'])):
    if ievt%1000 == 0:
        print("Processing event %d. Progress: %.2f%%"%(ievt,100*ievt/len(events['Tracks.fCoordinates.fX'])))
    if events['HT'][ievt] < 1200:
        continue
    tracks_x = events['Tracks.fCoordinates.fX'][ievt]
    tracks_y = events['Tracks.fCoordinates.fY'][ievt]
    tracks_z = events['Tracks.fCoordinates.fZ'][ievt]
    tracks_E = np.sqrt(tracks_x**2+tracks_y**2+tracks_z**2+0.13957**2)
    tracks = uproot_methods.TLorentzVectorArray.from_cartesian(ak.to_awkward0(tracks_x),ak.to_awkward0(tracks_y),ak.to_awkward0(tracks_z),ak.to_awkward0(tracks_E))
    tracks_fromPV0 = events['Tracks_fromPV0'][ievt]
    tracks_matchedToPFCandidate = events['Tracks_matchedToPFCandidate'][ievt]
    tracks = tracks[(tracks.pt > 1.) & (tracks.eta < 2.5) & (ak.to_awkward0(tracks_fromPV0) >= 2) & 
                    (ak.to_awkward0(tracks_matchedToPFCandidate) > 0)]
    trk_multiplicity[ievt] = tracks.size

# Plot results
fig = plt.figure(figsize=(8,histtype='step')

plt.show()

这一次,代码虽然虽然慢得多(可能是较小的缓存?),却立即开始处理事件。同样,当它到达第二个文件时,代码这次崩溃并显示以下消息:

Traceback (most recent call last):
  File "scripts/plotEventShapes_lazy.py",line 43,in <module>
    tracks_x = events['Tracks.fCoordinates.fX'][ievt]
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/awkward1/highlevel.py",line 974,in __getitem__
    self._layout[where],self._behavior,cache=self._cache
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/awkward1/partition.py",line 366,in __getitem__
    return PartitionedArray.from_ext(self._ext.getitem_at(where))
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py",line 2017,in array
    _ranges_or_baskets_to_arrays(
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py",line 3264,in _ranges_or_baskets_to_arrays
    uproot4.source.futures.delayed_raise(*obj)
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/source/futures.py",line 46,in delayed_raise
    raise exception_value.with_traceback(traceback)
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py",line 3189,in chunk_to_basket
    basket = uproot4.models.TBasket.Model_TBasket.read(
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/model.py",line 730,in read
    self.read_members(chunk,cursor,context,file)
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/models/TBasket.py",line 230,in read_members
    ) = cursor.fields(chunk,_tbasket_format2,context)
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/source/cursor.py",line 195,in fields
    return format.unpack(chunk.get(start,stop,self,context))
  File "/uscms/home/chpapage/.local/lib/python3.8/site-packages/uproot4/source/chunk.py",line 370,in get
    raise uproot4.deserialization.DeserializationError(
uproot4.deserialization.DeserializationError: while reading

    TBasket version None as uproot4.models.TBasket.Model_TBasket (? bytes)
        fNbytes: -1607368158
        fObjlen: -1243566277
        fDatime: 2634931141
        fKeylen: -27664
        fCycle: 21409

attempting to get bytes 483015:483033
outside expected range 510698:538758 for this Chunk
in file root://cmSEOs.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root

最后一条评论,我使用较小的文件(每个信号文件大约1 GB)运行相同的代码,但是这次使用xrootd访问并且循环没有问题。由于文件大小非常大,这可能是一个缓存问题吗?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?
Java在半透明框架/面板/组件上重新绘画。
Java“ Class.forName()”和“ Class.forName()。newInstance()”之间有什么区别?
在此环境中不提供编译器。也许是在JRE而不是JDK上运行?
Java用相同的方法在一个类中实现两个接口。哪种接口方法被覆盖?
Java 什么是Runtime.getRuntime()。totalMemory()和freeMemory()?
java.library.path中的java.lang.UnsatisfiedLinkError否*****。dll
JavaFX“位置是必需的。” 即使在同一包装中
Java 导入两个具有相同名称的类。怎么处理?
Java 是否应该在HttpServletResponse.getOutputStream()/。getWriter()上调用.close()?
Java RegEx元字符(。)和普通点?