python – xml.etree.ElementTree与lxml.etree：不同的内部节点表示？

我一直在将一些原始的xml.etree.ElementTree(ET)代码转换为lxml.etree(lxmlET).幸运的是,两者之间有很多相似之处.但是,我偶然发现了一些我在任何文档中都找不到的奇怪行为.它考虑后代节点的内部表示.

在ET中,iter()用于迭代Element的所有后代,可选地按标记名称进行过滤.因为我在文档中找不到关于此的任何细节,所以我期望lxmlET的类似行为.问题是,从测试我得出结论,在lxmlET中,有一个不同的树内部表示.

在下面的示例中,我迭代树中的节点并打印每个节点的子节点,但此外我还创建了这些子节点的所有不同组合并打印它们.这意味着,如果一个元素有子元素(‘A’,’B’,’C’),我会创建更改,即树[(‘A’),(‘A’,’B’),(‘A’,’ C’),(‘B’),(‘B’,’C’),(‘C’)].

# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            new_combo_tree = ET.Element(tree.tag, tree.attrib)
            for recombined_child in combination:
                new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            yield new_combo_tree

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            for combo_tree in get_combination_trees(subnode):
                combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))    

    return None


s = '''<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
'''

parse_xml(s)

这里的预期输出是用连字符连接在一起的每个节点的子节点的id,以及以自上而下的广度优先方式的子节点的所有可能组合(参见上文).

2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

但是,当您使用lxml模块而不是xml时(取消注释lxmlET的导入并注释ET的导入),并运行代码,您将看到输出是

2-3
2
3
node 2 has no children

因此,永远不会访问更深层次的后代节点.这可以通过以下两种方式规避：

>使用deepcopy(在get_combination_trees()中注释/取消注释相关部分),或
>在parse_xml()中使用node.xpath(‘.// node’)中的子节点而不是iter().

所以我知道有一种解决方法,但我主要想知道发生了什么？！我花了很长时间来调试它,我找不到任何文档.发生了什么,两个模块之间的实际底层差异是什么？在使用非常大的树木时,最有效的解决方法是什么？

解决方法:

虽然路易斯的答案是正确的,我完全同意在你遍历它时修改数据结构通常是一个坏主意(tm),你也问为什么代码适用于xml.etree.ElementTree而不是lxml.etree并且有一个非常合理的对此的解释.

在xml.etree.ElementTree中实现.append

该库直接在Python中实现,可能因您使用的Python运行时而异.假设您正在使用cpython,您正在寻找的实现已实现in vanilla Python：

def append(self, subelement):
    """Add *subelement* to the end of this element.
    The new element will appear in document order after the last existing
    subelement (or directly after the text, if it's the first subelement),
    but before the end tag for this element.
    """
    self._assert_is_element(subelement)
    self._children.append(subelement)

最后一行是我们唯一关注的部分.事实证明,self._children被初始化为towards the top of that file：

self._children = []

因此,将一个子项添加到树只是将一个元素附加到列表中.直觉上,这正是您正在寻找的(在这种情况下),并且实现的行为完全不令人惊讶.

在lxml.etree中实现.append

lxml是作为Python,非平凡的Cython和C代码的混合实现的,因此通过它进行处理比纯Python实现要困难得多.首先,.append is implemented as：

def append(self, _Element element not None):
    u"""append(self, element)
    Adds a subelement to the end of this element.
    """
    _assertValidNode(self)
    _assertValidNode(element)
    _appendChild(self, element)

_appendChild在apihelper.pxi实现：

cdef int _appendChild(_Element parent, _Element child) except -1:
    u"""Append a new child to a parent element.
    """
    c_node = child._c_node
    c_source_doc = c_node.doc
    # prevent cycles
    if _isAncestorOrSame(c_node, parent._c_node):
        raise ValueError("cannot append parent to itself")
    # store possible text node
    c_next = c_node.next
    # move node itself
    tree.xmlUnlinkNode(c_node)
    tree.xmlAddChild(parent._c_node, c_node)
    _moveTail(c_next, c_node)
    # uh oh, elements may be pointing to different doc when
    # parent element has moved; change them too..
    moveNodetoDocument(parent._doc, c_source_doc, c_node)
    return 0

肯定会有更多的事情发生在这里.特别是,lxml显式地从树中删除该节点,然后将其添加到其他位置.这可以防止您在操作节点时意外创建循环XML图形(这是您可以使用xml.etree版本执行的操作).

lxml的变通方法

现在我们知道xml.etree在追加时会复制节点,但lxml.etree会移动它们,为什么这些变通办法有效呢？基于tree.xmlUnlinkNode方法(实际上是defined in C inside of libxml2),取消链接仅仅是一堆指针.因此,复制节点元数据的任何事情都可以解决问题.因为我们关心的所有元数据都是the xmlNode struct上的直接字段,所以任何浅层复制节点的东西都可以解决问题

> copy.deepcopy()绝对有效
> node.xpath返回恰好浅树复制树元数据的节点wrapped in proxy elements
> copy.copy()也有诀窍
>如果您不需要组合实际位于官方树中,设置new_combo_tree = []也会为您提供与xml.etree类似的列表.

如果你真的关心性能和大树,我可能会先用copy.copy()进行浅层复制,尽管你应该绝对描述一些不同的选项,看看哪一个最适合你.

python – xml.etree.ElementTree与lxml.etree：不同的内部节点表示？

相关推荐