我正在尝试创建一个正则表达式来捕获文本引用.
以下是文本引用的几个例句:
... (Nivre等人,2007年)的报告结果不具代表性......
......两个系统使用马尔可夫链方法(Sagae和Tsujii 2007).
Nivre (2007)表明......
...用于附着和标记依赖性(Chen等,2007; Dredze等,2007).
目前,我的正则表达式是
\(\D*\d\d\d\d\)
Run Code Online (Sandbox Code Playgroud)
哪个匹配示例1-3,但不匹配示例4.如何修改此示例以捕获示例4?
谢谢!
我\xe2\x80\x99最近一直在使用类似的东西来达到这个目的:
\n\n#!/usr/bin/env perl\n\nuse 5.010;\nuse utf8;\nuse strict;\nuse autodie;\nuse warnings qw< FATAL all >;\nuse open qw< :std IO :utf8 >;\n\nmy $citation_rx = qr{\n \\( (?:\n \\s*\n\n # optional author list\n (?: \n # has to start capitalized\n \\p{Uppercase_Letter} \n\n # then have a lower case letter, or maybe an apostrophe\n (?= [\\p{Lowercase_Letter}\\p{Quotation_Mark}] )\n\n # before a run of letters and admissible punctuation\n [\\p{Alphabetic}\\p{Dash_Punctuation}\\p{Quotation_Mark}\\s,.] +\n\n ) ? # hook if and only if you want the authors to be optional!!\n\n # a reasonable year\n \\b (18|19|20) \\d\\d \n\n # citation series suffix, up to a six-parter\n [a-f] ? \\b \n\n # trailing semicolon to separate multiple citations\n ; ? \n \\s*\n ) +\n \\)\n}x;\n\nwhile (<DATA>) {\n while (/$citation_rx/gp) {\n say ${^MATCH};\n } \n} \n\n__END__\n... and the reported results in (Nivr\xc3\xa9 et al., 2007) were not representative ...\n... two systems used a Markov chain approach (Sagae and Tsujii 2007).\nNivre (2007) showed that ...\n... for attaching and labelling dependencies (Chen et al., 2007; Dre\xc7\xb3e et al., 2007).\nRun Code Online (Sandbox Code Playgroud)\n\n运行时,它会产生:
\n\n(Nivr\xc3\xa9 et al., 2007)\n(Sagae and Tsujii 2007)\n(2007)\n(Chen et al., 2007; Dre\xc7\xb3e et al., 2007)\nRun Code Online (Sandbox Code Playgroud)\n
在Tex 的回答的基础上,我编写了一个非常简单的 Python 脚本,名为Overcite,为朋友执行此操作(学期末,懒惰引用,你知道它是怎么回事)。它是开源的,并在Bitbucket上获得 MIT 许可。
它涵盖了比 Tex 更多的情况,这可能会有所帮助(请参阅测试文件),包括&符号和带页码的引用。整个脚本基本上是:
author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?" # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"
matches = re.findall(regex, text)
Run Code Online (Sandbox Code Playgroud)