使用Python解析BibText引文格式

gmo*_*evt 1 python regex bibtex

在python中解析这个结果的最佳方法是什么?我试过正则表达式,但无法让它工作.我正在寻找标题词,作者等作为键.

@article{perry2000epidemiological,
  title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
  author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
  journal={Journal of public health},
  volume={22},
  number={3},
  pages={427--434},
  year={2000},
  publisher={Oxford University Press}
}
Run Code Online (Sandbox Code Playgroud)

Bra*_*mon 5

这看起来像引文格式.你可以像这样解析它:

>>> import re

>>> kv = re.compile(r'\b(?P<key>\w+)={(?P<value>[^}]+)}')

>>> citation = """
... @article{perry2000epidemiological,
...   title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence
...  Study},
...   author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and
...  Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
...   journal={Journal of public health},
...   volume={22},
...   number={3},
...   pages={427--434},
...   year={2000},
...   publisher={Oxford University Press}
... }
... """

>>> dict(kv.findall(citation))
{'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others',
 'journal': 'Journal of public health',
 'number': '3',
 'pages': '427--434',
 'publisher': 'Oxford University Press',
 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study',
 'volume': '22',
 'year': '2000'}
Run Code Online (Sandbox Code Playgroud)

正则表达式使用两个命名的捕获组(主要是为了在视觉上表示什么是什么).

  • "key"是任何1+ unicode单词字符,左边是单词边界,右边是文字等号;
  • "值"是两个大括号内的东西.[^}]只要您不希望使用"嵌套"花括号,就可以方便地使用.换句话说,值只是大括号内的任何非大括号字符中的一个或多个.


Pat*_*ner 5

您可能正在寻找BibTeX解析器:https://bibtexparser.readthedocs.io/en/master/

\n\n

来源:https ://bibtexparser.readthedocs.io/en/master/tutorial.html#step-0-vocabulary

\n\n

输入/创建 bibtex 文件:

\n\n
\n
bibtex = """@ARTICLE{Cesar2013,\n  author = {Jean C\xc3\xa9sar},\n  title = {An amazing title},\n  year = {2013},\n  month = jan,\n  volume = {12},\n  pages = {12--23},\n  journal = {Nice Journal},\n  abstract = {This is an abstract. This line should be long enough to test\n     multilines...},\n  comments = {A comment},\n  keywords = {keyword1, keyword2}\n}\n"""\n\nwith open(\'bibtex.bib\', \'w\') as bibfile:\n    bibfile.write(bibtex)\n
Run Code Online (Sandbox Code Playgroud)\n
\n\n

解析它:

\n\n
\n
import bibtexparser\n\nwith open(\'bibtex.bib\') as bibtex_file:\n    bib_database = bibtexparser.load(bibtex_file)\n\nprint(bib_database.entries)\n
Run Code Online (Sandbox Code Playgroud)\n
\n\n

输出:

\n\n
\n
[{\'journal\': \'Nice Journal\',\n  \'comments\': \'A comment\',\n  \'pages\': \'12--23\',\n  \'month\': \'jan\',\n  \'abstract\': \'This is an abstract. This line should be long enough to test\\nmultilines...\',\n  \'title\': \'An amazing title\',\n  \'year\': \'2013\',\n  \'volume\': \'12\',\n  \'ID\': \'Cesar2013\',\n  \'author\': \'Jean C\xc3\xa9sar\',\n  \'keyword\': \'keyword1, keyword2\',\n  \'ENTRYTYPE\': \'article\'}]\n
Run Code Online (Sandbox Code Playgroud)\n
\n