我正在改编此代码以从文本中提取引文:
#!/usr/bin/env python3
# https://stackoverflow.com/a/16826935
import re
from sys import stdin
text = stdin.read()
author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?" # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"
matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()
#print(matches)
print ("\n".join(matches))
Run Code Online (Sandbox Code Playgroud)
但是,它会将一些大写单词识别为作者姓名。例如,在文中:
Although James …Run Code Online (Sandbox Code Playgroud)