小编Iur*_*ski的帖子

Python 正则表达式在论文中获取引用

我正在改编此代码以从文本中提取引文:

#!/usr/bin/env python3
# https://stackoverflow.com/a/16826935

import re
from sys import stdin

text = stdin.read()

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?"  # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()

#print(matches)
print ("\n".join(matches))
Run Code Online (Sandbox Code Playgroud)

但是,它会将一些大写单词识别为作者姓名。例如,在文中:

Although James …
Run Code Online (Sandbox Code Playgroud)

python regex text citations

2
推荐指数
1
解决办法
1678
查看次数

标签 统计

citations ×1

python ×1

regex ×1

text ×1