Python 正则表达式在论文中获取引用

Iur*_*ski 2 python regex text citations

我正在改编此代码以从文本中提取引文:

#!/usr/bin/env python3
# https://stackoverflow.com/a/16826935

import re
from sys import stdin

text = stdin.read()

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?"  # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()

#print(matches)
print ("\n".join(matches))
Run Code Online (Sandbox Code Playgroud)

但是,它会将一些大写单词识别为作者姓名。例如,在文中:

Although James (2020) recognized blablabla, Smith et al. (2020) found mimimi. 
Those inconsistent results are a sign of lalala (Green, 2010; Grimm, 1990). 
Also James (2020) ...
Run Code Online (Sandbox Code Playgroud)

输出将是

Also James (2020)
Although James (2020)
Green, 2010
Grimm, 1990
Smith et al. (2020)
Run Code Online (Sandbox Code Playgroud)

有没有办法将上述代码中的某些单词“列入黑名单”而不删除整个匹配项?我希望它认可詹姆斯的工作,但从引文中删除了“也”和“虽然”。

提前致谢。

Wik*_*żew 5

您可以使用

author = r"(?:[A-Z][A-Za-z'`-]+)"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9]+)?"  # Always optional
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
matches = re.findall(regex, text)
Run Code Online (Sandbox Code Playgroud)

请参阅Python 演示生成的正则表达式演示

主要区别在于regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'\b(?!(?:Although|Also)\b)如果紧邻右侧的单词是Althoughor ,则该部分将失败Also

另外,请注意,我转义了应该与文字点匹配的点,并使用 f 字符串使代码看起来更紧凑。