用于识别文本引用的正则表达式

Question

用于识别文本引用的正则表达式

我正在尝试创建一个正则表达式来捕获文本引用.

以下是文本引用的几个例句:

... (Nivre等人,2007年)的报告结果不具代表性......

......两个系统使用马尔可夫链方法(Sagae和Tsujii 2007).

Nivre (2007)表明......

...用于附着和标记依赖性(Chen等,2007; Dredze等,2007).

目前,我的正则表达式是

\(\D*\d\d\d\d\)

Run Code Online (Sandbox Code Playgroud)

哪个匹配示例1-3,但不匹配示例4.如何修改此示例以捕获示例4？

谢谢!

Answer 1

tch*_*ist 5

我\xe2\x80\x99最近一直在使用类似的东西来达到这个目的：

\n\n

#!/usr/bin/env perl\n\nuse 5.010;\nuse utf8;\nuse strict;\nuse autodie;\nuse warnings qw< FATAL all >;\nuse open qw< :std IO :utf8 >;\n\nmy $citation_rx = qr{\n    \\( (?:\n        \\s*\n\n        # optional author list\n        (?: \n            # has to start capitalized\n            \\p{Uppercase_Letter}        \n\n            # then have a lower case letter, or maybe an apostrophe\n            (?=  [\\p{Lowercase_Letter}\\p{Quotation_Mark}] )\n\n            # before a run of letters and admissible punctuation\n            [\\p{Alphabetic}\\p{Dash_Punctuation}\\p{Quotation_Mark}\\s,.] +\n\n        ) ?  # hook if and only if you want the authors to be optional!!\n\n        # a reasonable year\n        \\b (18|19|20) \\d\\d \n\n        # citation series suffix, up to a six-parter\n        [a-f] ?         \\b                 \n\n        # trailing semicolon to separate multiple citations\n        ; ?  \n        \\s*\n    ) +\n    \\)\n}x;\n\nwhile (<DATA>) {\n    while (/$citation_rx/gp) {\n        say ${^MATCH};\n    } \n} \n\n__END__\n... and the reported results in (Nivr\xc3\xa9 et al., 2007) were not representative ...\n... two systems used a Markov chain approach (Sagae and Tsujii 2007).\nNivre (2007) showed that ...\n... for attaching and labelling dependencies (Chen et al., 2007; Dre\xc7\xb3e et al., 2007).\n

Run Code Online (Sandbox Code Playgroud)\n\n

运行时，它会产生：

\n\n

(Nivr\xc3\xa9 et al., 2007)\n(Sagae and Tsujii 2007)\n(2007)\n(Chen et al., 2007; Dre\xc7\xb3e et al., 2007)\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 2

orl*_*ade 5

在Tex 的回答的基础上，我编写了一个非常简单的 Python 脚本，名为Overcite，为朋友执行此操作（学期末，懒惰引用，你知道它是怎么回事）。它是开源的，并在Bitbucket上获得 MIT 许可。

它涵盖了比 Tex 更多的情况，这可能会有所帮助（请参阅测试文件），包括＆符号和带页码的引用。整个脚本基本上是：

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?"  # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)

Run Code Online (Sandbox Code Playgroud)

Answer 3

Ign*_*ams 2

/\(\D*\d\d\d\d(?:;\D*\d\d\d\d)*\)/

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，4 月前
查看次数：	2508 次
最近记录：	12 年，10 月前