微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Python正则表达式在论文中被引用

如何解决Python正则表达式在论文中被引用

我正在改编this code以便从文本中提取引文:

#!/usr/bin/env python3
# https://stackoverflow.com/a/16826935

import re
from sys import stdin

text = stdin.read()

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:,p.? [0-9]+)?"  # Always optional
year = "(?:,*"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex,text)
matches = list( dict.fromkeys(matches) )
matches.sort()

#print(matches)
print ("\n".join(matches))

但是,它会将一些大写单词识别为作者姓名。例如,在文本中:

Although James (2020) recognized blablabla,Smith et al. (2020) found mimimi. 
Those inconsistent results are a sign of lalala (Green,2010; Grimm,1990). 
Also James (2020) ...

输出应为

Also James (2020)
Although James (2020)
Green,2010
Grimm,1990
Smith et al. (2020)

是否可以在不删除整个匹配项的情况下将上述代码中的某些单词“黑名单”?我希望它能识别James的作品,但从引用中删除“ Also”和“ Although”。

谢谢。

解决方法

您可以使用

author = r"(?:[A-Z][A-Za-z'`-]+)"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:,p\.? [0-9]+)?"  # Always optional
year = fr"(?:,*{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
matches = re.findall(regex,text)

请参见Python demoresulting regex demo

主要区别在于regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}',如果紧邻右边的单词是\b(?!(?:Although|Also)\b)Although,则Also部分将失败。

另外,请注意,我转义了应该与文字点匹配的点,并使用f字符串使代码看起来更紧凑。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。