使用 Invisible XML 从文本中提取记录

tat*_*tat 2 xml grammar text-parsing invisible-xml

我有一份包含结构化条目的期刊参考书目的 OCR 文本。我想使用不可见的 XML标准来提取和解析条目。

\n

输入示例:

\n
\n1  2  Hype.  1990?- 1993.  Frequency:  Bimonthly.  River  Edge, \n\nNJ.  Published  by  Word  Up!  Video,  Inc.  Last  issue  66  pages. \nHeight  28  cm.  Line  drawings;  Photographs  (some  in  color); \nCommercial  advertising;  Table  of  contents.  Previous  editor(s): \nMarica  A.  Cole.  ISSN  1056-4632.  LC  card  no.  sn91-1965. \nOCLC  no.  23715422.  Subject  focus  and/or  Features:  Hip  hop \nculture,  Music,  Rap  music. \n\nWHi  v.l,  n.6;  v.2,  n.5  Pam  01-5450  Aug,  1992;  Aug,  1993 \n\n6561  The  Zora  Neale  Hurston  Forum.  1986-.  Frequency: \nSemiannual.  Ruth  T.  Sheffey,  Editor,  The  Zora  Neale  Hurston \nForum,  P.O.  Box  550,  Morgan  State  University,  Baltimore, \n\nMD  21239.  $15  for  individuals  and  institutions.  Telephone: \n(301)  444-3435.  Published  by  Zora  Neale  Hurston  Society. \n\nLast  issue  69  pages.  Last  volume  142  pages.  Height  23  cm. \nPhotographs;  Table  of  contents.  ISSN  1051-6867.  LC  card  no. \n90-649339.  OCLC  no.  15610848.  Subject  focus  and/or  Features:  Hurston,  Zora  Neale,  Literature,  Literary  criticism. \nMdBMC  v.l,  n.l-v.8,  n.2  Special  Collections  Fall,  1986-Spring, \n\n1994 \n\nTxDw  v.l,  n.l;  v.2,  n.l  Woman\xe2\x80\x99s  Collection  Fall,  1986;  Fall,  1987 \nWU  v.l,  n.l-  AP/Z893/N345  Fall,  1986\n6562  Zwanna:  Son  of  Zulu.  1993-.  Frequency:  Unknown. \nNabile  P.  Hage,  Editor,  Zwanna,  P.O.  Box  38261,  Atlanta,  GA \n30334.  Published  by  Dark  Zulu  Lies  Comics,  Inc.  Last  issue  32 \npages.  Height  28  cm.  Line  drawings  (some  in  color);  Commercial  advertising.  OCLC  no.  28389961.  Subject  focus  and/or \nFeatures:  Comic  books,  strips,  etc. \n\nWHi  v.l,  n.l  Pam  00-305  Apr/May,  1993 \n
Run Code Online (Sandbox Code Playgroud)\n

每个条目都以条目号开头,后跟一个或多个空白字符,最后是按换行符分割的描述性文本。

\n

iXML语法

\n
data: entry+ .\nentry: -#a, entrynum, " "+, content .\nentrynum: -digit+ .\ndigit: ["1"-"9"] .\ncontent: ~[]+; -#a+ .\n
Run Code Online (Sandbox Code Playgroud)\n

对 iXML 语法的初步尝试产生了不明确的解析(使用CoffeePot iXML 处理器)。

\n

输出

\n
<data xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">\n  <entry>\n    <entrynum>1</entrynum>\n    <content>2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge, NJ. Published by Word Up! Video,\n      Inc. Last issue 66 pages. Height 28 cm. Line drawings; Photographs (some in color); Commercial\n      advertising; Table of contents. Previous editor(s): Marica A. Cole. ISSN 1056-4632. LC card\n      no. sn91-1965. OCLC no. 23715422. Subject focus and/or Features: Hip hop culture, Music, Rap\n      music. WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993 6561 The Zora Neale Hurston\n      Forum. 1986-. Frequency: Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston Forum,\n      P.O. Box 550, Morgan State University, Baltimore, MD 21239. $15 for individuals and\n      institutions. Telephone: (301) 444-3435. Published by Zora Neale Hurston Society. Last issue\n      69 pages. Last volume 142 pages. Height 23 cm. Photographs; Table of contents. ISSN 1051-6867.\n      LC card no. 90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale,\n      Literature, Literary criticism. MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,\n      1994 TxDw v.l, n.l; v.2, n.l Woman\xe2\x80\x99s Collection Fall, 1986; Fall, 1987 WU v.l, n.l-\n      AP/Z893/N345 Fall, 1986</content>\n  </entry>\n  <entry>\n    <entrynum>6562</entrynum>\n    <content>Zwanna: Son of Zulu. 1993-. Frequency: Unknown. Nabile P. Hage, Editor, Zwanna, P.O.\n      Box 38261, Atlanta, GA 30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32 pages.\n      Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961.\n      Subject focus and/or Features: Comic books, strips, etc. WHi v.l, n.l Pam 00-305 Apr/May, 1993\n    </content>\n  </entry>\n</data>\n
Run Code Online (Sandbox Code Playgroud)\n

首先,我想了解如何对条目进行分块,然后开始解析内容:例如,每个条目编号后跟一个或多个空格,然后是字母数字标题,后跟句点等。

\n

Nor*_*orm 5

“或许。” iXML 的一大优势是它可以处理歧义。这使得语法变得非常非常容易编写。如果模棱两可的选择同样有效,或者如果您不关心选择哪个模棱两可的选择,那么它的效果就非常好。

对于书目数据,我怀疑某些选择比其他选择更有效,并且您确实关心选择哪个选项,这使得它变得更加困难。我还敢打赌,由于 OCR 不完善,因此存在很多歧义。

我不认为单个 iXML 语法能够解析输入并准确生成您想要的输出,但它可能构成某些更广泛策略的有用部分。我首先尝试将参考书目分成单独的条目,将语法限制为单个条目。然后我可能会看看是否可以制定出不同类别的条目(书籍、杂志、期刊等),并且每个类别都有不同的语法。

祝你好运!