用于识别文本引用的正则表达式

maw*_*dby 9 regex

我正在尝试创建一个正则表达式来捕获文本引用.

以下是文本引用的几个例句:

  1. ... (Nivre等人,2007年)的报告结果不具代表性......

  2. ......两个系统使用马尔可夫链方法(Sagae和Tsujii 2007).

  3. Nivre (2007)表明......

  4. ...用于附着和标记依赖性(Chen等,2007; Dredze等,2007).

目前,我的正则表达式是

\(\D*\d\d\d\d\)
Run Code Online (Sandbox Code Playgroud)

哪个匹配示例1-3,但不匹配示例4.如何修改此示例以捕获示例4?

谢谢!

tch*_*ist 5

我\xe2\x80\x99最近一直在使用类似的东西来达到这个目的:

\n\n
#!/usr/bin/env perl\n\nuse 5.010;\nuse utf8;\nuse strict;\nuse autodie;\nuse warnings qw< FATAL all >;\nuse open qw< :std IO :utf8 >;\n\nmy $citation_rx = qr{\n    \\( (?:\n        \\s*\n\n        # optional author list\n        (?: \n            # has to start capitalized\n            \\p{Uppercase_Letter}        \n\n            # then have a lower case letter, or maybe an apostrophe\n            (?=  [\\p{Lowercase_Letter}\\p{Quotation_Mark}] )\n\n            # before a run of letters and admissible punctuation\n            [\\p{Alphabetic}\\p{Dash_Punctuation}\\p{Quotation_Mark}\\s,.] +\n\n        ) ?  # hook if and only if you want the authors to be optional!!\n\n        # a reasonable year\n        \\b (18|19|20) \\d\\d \n\n        # citation series suffix, up to a six-parter\n        [a-f] ?         \\b                 \n\n        # trailing semicolon to separate multiple citations\n        ; ?  \n        \\s*\n    ) +\n    \\)\n}x;\n\nwhile (<DATA>) {\n    while (/$citation_rx/gp) {\n        say ${^MATCH};\n    } \n} \n\n__END__\n... and the reported results in (Nivr\xc3\xa9 et al., 2007) were not representative ...\n... two systems used a Markov chain approach (Sagae and Tsujii 2007).\nNivre (2007) showed that ...\n... for attaching and labelling dependencies (Chen et al., 2007; Dre\xc7\xb3e et al., 2007).\n
Run Code Online (Sandbox Code Playgroud)\n\n

运行时,它会产生:

\n\n
(Nivr\xc3\xa9 et al., 2007)\n(Sagae and Tsujii 2007)\n(2007)\n(Chen et al., 2007; Dre\xc7\xb3e et al., 2007)\n
Run Code Online (Sandbox Code Playgroud)\n


orl*_*ade 5

在Tex 的回答的基础上,我编写了一个非常简单的 Python 脚本,名为Overcite,为朋友执行此操作(学期末,懒惰引用,你知道它是怎么回事)。它是开源的,并在Bitbucket上获得 MIT 许可。

它涵盖了比 Tex 更多的情况,这可能会有所帮助(请参阅测试文件),包括&符号和带页码的引用。整个脚本基本上是:

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?"  # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)
Run Code Online (Sandbox Code Playgroud)


Ign*_*ams 2

/\(\D*\d\d\d\d(?:;\D*\d\d\d\d)*\)/
Run Code Online (Sandbox Code Playgroud)