Iur*_*ski 2 python regex text citations
我正在改编此代码以从文本中提取引文:
#!/usr/bin/env python3
# https://stackoverflow.com/a/16826935
import re
from sys import stdin
text = stdin.read()
author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?" # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"
matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()
#print(matches)
print ("\n".join(matches))
Run Code Online (Sandbox Code Playgroud)
但是,它会将一些大写单词识别为作者姓名。例如,在文中:
Although James (2020) recognized blablabla, Smith et al. (2020) found mimimi.
Those inconsistent results are a sign of lalala (Green, 2010; Grimm, 1990).
Also James (2020) ...
Run Code Online (Sandbox Code Playgroud)
输出将是
Also James (2020)
Although James (2020)
Green, 2010
Grimm, 1990
Smith et al. (2020)
Run Code Online (Sandbox Code Playgroud)
有没有办法将上述代码中的某些单词“列入黑名单”而不删除整个匹配项?我希望它认可詹姆斯的工作,但从引文中删除了“也”和“虽然”。
提前致谢。
您可以使用
author = r"(?:[A-Z][A-Za-z'`-]+)"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9]+)?" # Always optional
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
matches = re.findall(regex, text)
Run Code Online (Sandbox Code Playgroud)
请参阅Python 演示和生成的正则表达式演示。
主要区别在于regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}',\b(?!(?:Although|Also)\b)如果紧邻右侧的单词是Althoughor ,则该部分将失败Also。
另外,请注意,我转义了应该与文字点匹配的点,并使用 f 字符串使代码看起来更紧凑。
| 归档时间: |
|
| 查看次数: |
1678 次 |
| 最近记录: |