Python difflib 给出了不好的结果

如何解决Python difflib 给出了不好的结果

我正在使用 python difflib 来计算两个纯文本英文段落之间的差异。

段落非常相似——有一个额外的前导句和结尾句。人物之间也有细微差别。

不幸的是,我得到了非常糟糕的结果。似乎差异开头的一个字符正在抛弃它,并在整个过程中散布随机字符。

diffchecker.com 等网站在计算差异时没有问题。我还注意到,如果我减少 difflib 的窗口以忽略第一句话,它会正确计算差异。有没有其他人注意到这个问题?

附上我的代码和下面的示例段落。非常感谢。

import difflib

s1 = "Ableton Live also supports Audio To MIDI,which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody,Harmony,or Rhythm. Once finished,Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.[14] See Fourier transform.Envelopes[edit]Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips,in which case they will be used in every performance of that clip,or on the entire arrangement. The most obvIoUs examples are volume and track panning,but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls,which can also control parameters in real-time using sliders,faders and such. Using the global transport record function will also record changes made to these parameters,creating an envelope for them.User interface[edit]Much of Live’s interface comes from being designed for use in live performance,as well as for production.[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help Box)."
s2 = "Once finished,Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes. [14] See Fourier transform . Envelopes[ edit ] Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips,creating an envelope for them. User interface[ edit ] Much of Live’s interface comes from being designed for use in live performance,as well as for production."

if __name__ == "__main__":
    res = [d for d in difflib.ndiff(s1,s2)]
    print(res)

解决方法

正如文档所说,

比较 a 和 b(字符串列表)...返回一个不同风格的增量(生成增量线的生成器)。

ndiff() 旨在,例如,比较两个文件,给定行列表 文件包含。很像常见的 Unixy diff 实用程序。

您正在尝试比较两条单独的行。 difflib 没有内置的“漂亮打印”方式来做到这一点,但确实提供了比较工具,您可以在此基础上构建您喜欢的任何格式。例如,

d = difflib.SequenceMatcher(None,s1,s2,autojunk=None)
for op in d.get_opcodes():
    print(op)

印刷品

('delete',194,0)
('equal',446,252)
('insert',252,253)
('equal',472,253,279)
('insert',279,280)
('equal',473,280,281)
('insert',281,282)
('equal',483,282,292)
('insert',292,293)
('equal',487,293,297)
('insert',297,298)
('equal',488,298,299)
('insert',299,300)
('equal',1143,300,955)
('insert',955,956)
('equal',1158,956,971)
('insert',971,972)
('equal',1162,972,976)
('insert',976,977)
('equal',1163,977,978)
('insert',978,979)
('equal',1269,979,1085)
('delete',1508,1085,1085)

有关这些的确切含义,请参阅文档。它们简洁地描述了将 s1 更改为 s2 所需的条件。长精确匹配块由 ('equal',955) 描述,实际上,

>>> s1[488 : 1143] == s2[300 : 955]
True

建议:相反,将您的两个输入分成句子,并将每个输入视为换行终止的句子的序列(如列表)。然后您可以直接使用 ndiff(),以它的预期使用方式。

让另一种方式更具体,例如这段代码:

import difflib
d = difflib.SequenceMatcher(None,autojunk=None)
for op,i1,i2,j1,j2 in d.get_opcodes():
    print(">>> ",end="")
    if op == "equal":
        print(f"{i2-i1} characters the same at",f"{i1}:{i2} and {j1}:{j2}")
        print(s1[i1:i2])
    elif op == "delete":
        print(f"delete {i2-i1} characters at {i1}:{i2}")
        print(s1[i1:i2])
    elif op == "insert":
        print(f"insert {j2-j1} characters from {j1}:{j2}")
        print(s2[j1:j2])
    elif op == "replace":
        print(f"replace {i1}:{i2} with {j1}:{j2}")
        print(s1[i1:i2])
        print(s2[j1:j2])
    else:
        assert False,("unknown op",repr(op))

产生这个输出:

>>> delete 194 characters at 0:194
Ableton Live also supports Audio To MIDI,which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody,Harmony,or Rhythm. 
>>> 252 characters the same at 194:446 and 0:252
Once finished,Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.
>>> insert 1 characters from 252:253
 
>>> 26 characters the same at 446:472 and 253:279
[14] See Fourier transform
>>> insert 1 characters from 279:280
 
>>> 1 characters the same at 472:473 and 280:281
.
>>> insert 1 characters from 281:282
 
>>> 10 characters the same at 473:483 and 282:292
Envelopes[
>>> insert 1 characters from 292:293
 
>>> 4 characters the same at 483:487 and 293:297
edit
>>> insert 1 characters from 297:298
 
>>> 1 characters the same at 487:488 and 298:299
]
>>> insert 1 characters from 299:300
 
>>> 655 characters the same at 488:1143 and 300:955
Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips,in which case they will be used in every performance of that clip,or on the entire arrangement. The most obvious examples are volume and track panning,but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls,which can also control parameters in real-time using sliders,faders and such. Using the global transport record function will also record changes made to these parameters,creating an envelope for them.
>>> insert 1 characters from 955:956
 
>>> 15 characters the same at 1143:1158 and 956:971
User interface[
>>> insert 1 characters from 971:972
 
>>> 4 characters the same at 1158:1162 and 972:976
edit
>>> insert 1 characters from 976:977
 
>>> 1 characters the same at 1162:1163 and 977:978
]
>>> insert 1 characters from 978:979
 
>>> 106 characters the same at 1163:1269 and 979:1085
Much of Live’s interface comes from being designed for use in live performance,as well as for production.
>>> delete 239 characters at 1269:1508
[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help box).

您可以编辑该模板,以您最喜欢的任何方式显示结果。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?
Java在半透明框架/面板/组件上重新绘画。
Java“ Class.forName()”和“ Class.forName()。newInstance()”之间有什么区别?
在此环境中不提供编译器。也许是在JRE而不是JDK上运行?
Java用相同的方法在一个类中实现两个接口。哪种接口方法被覆盖?
Java 什么是Runtime.getRuntime()。totalMemory()和freeMemory()?
java.library.path中的java.lang.UnsatisfiedLinkError否*****。dll
JavaFX“位置是必需的。” 即使在同一包装中
Java 导入两个具有相同名称的类。怎么处理?
Java 是否应该在HttpServletResponse.getOutputStream()/。getWriter()上调用.close()?
Java RegEx元字符(。)和普通点?