从 pdf 中提取参考文献 - Python

Nul*_*ter 1 python text-extraction

在我的 python 项目中,我需要REFERENCES从 pdf 研究论文中提取。我PyPDF2用来阅读pdf并像这样从中提取文本。

import PyPDF2

pdfFileObj = open('fileName.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageCount = pdfReader.numPages
count = 0
text = ''

while count < pageCount:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
Run Code Online (Sandbox Code Playgroud)

现在这text可以是任何格式,我无法从中识别任何标题。我不能使用,find('References')因为纸也可以在其他任何地方包含这个词。有些论文在标题之前包含 Number 6 REFERENCES,所以我可以为此添加正则表达式

但在前往之前,我被困在没有任何数值的论文中。

这是我目前正在研究的 pdf非投影依赖解析器

这就是如何,我得到了它的参考

References Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. Hans Jiirgen Heringer. 1993. Dependency syntax - basic ideas and the classical model. In Joachim Jacobs, Arnim von Stechow, Wolfgang Sternefeld, and Thee Venneman, editors, Syntax - An In- ternational Handbook of Contemporary Research, volume 1, chapter 12, pages 298-316. Walter de Gruyter, Berlin - New York. Richard Hudson. 1991. English Word Grammar. Basil Blackwell, Cambridge, MA. Arvi Hurskainen. 1996. Disambiguation of morpho- logical analysis in Bantu languages. In The 16th International Conference on Computational Lin- guistics, pages 568-573. Copenhagen. Time J~rvinen. 1994. Annotating 200 million words: the Bank of English project. In The 15th International Conference on Computational Lin- guistics Proceedings, pages 565-568. Kyoto. Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors. 1995. Constraint Gram- mar: a language-independent system for parsing unrestricted text, volume 4 of Natural Language Processing. Mouton de Gruyter, Berlin and N.Y. Fred Karlsson. 1990. Constraint grammar as a framework for parsing running text. In Hans Karl- gren, editor, Papers presented to the 13th Interna- tional Conference on Computational Linguistics, volume 3, pages 168-173, Helsinki, Finland. Michael McCord. 1990. Slot grammar: A system for simpler construction of practical natural language grammars. In lq, Studer, editor, Natural Language and Logic: International Scientific Symposium, Lecture Notes in Computer Science, pages 118- 145. Springer, Berlin. Igor A. Mel'~uk. 1987. Dependency Syntax: Theory and Practice. State University of New York Press, Albany. Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996. Inducing constraint gram- mars. In Laurent Miclet and Colin de la Higuera, editors, Grammatical Inference: Learning Syntax from Sentences, volume 1147 of Lecture Notes in Artificial Intelligence, pages 146-155, Springer. Daniel Sleator and Davy Temperley. 1991. Parsing English with a link grammar. Technical Report CMU-CS-91-196, Carnegie Mellon University. Pasi Tapanainen and Time J/irvinen. 1994. Syn- tactic analysis of natural language using linguis- tic rules and corpus-based patterns. In The 15th International Conference on Computational Lin- guistics Proceedings, pages 629-634. Kyoto. Pasi Tapanainen. 1996. The Constraint Grammar Parser CG-2. Number 27 in Publications of the Department of General Linguistics, University of Helsinki. Lucien TesniSre. 1959. l~ldments de syntaxe stvuc- turale, l~ditions Klincksieck, Paris. Atro Voutilainen. 1995. Morphological disambigua- tion. In Karlsson et al., chapter 6, pages 165-284. 71
Run Code Online (Sandbox Code Playgroud)

如何将这些参考字符串解析为 pdf 中提到的多个参考?任何形式的帮助将不胜感激。

fur*_*ras 5

PDF非常复杂,我不是专家,但我使用了extractText() 的源代码来查看它是如何工作的,并且使用print('>>>', operator, operands)我可以看到它在 PDF 中找到的值。

在本文档中,它用于"Tm"将位置移动到新行,因此更改了原始代码extractText(),我曾经"Tm"添加\n并在行中获取文本

Arto Anttila. 1995. How to recognise subjects in 
English. In Karlsson et al., chapt. 9, pp. 315-358. 
Dekang Lin. 1996. Evaluation of Principar with the 
Susanne corpus. In John Carroll, editor, Work- 
shop on Robust Parsing, pages 54-69, Prague. 
Jason M. Eisner. 1996. Three new probabilistic 
models for dependency parsing: An exploration. 
In The 16th International Conference on Compu- 
tational Linguistics, pages 340-345. Copenhagen. 
David G. Hays. 1964. Dependency theory: A 
formalism and some observations. Language, 
40(4):511-525. 
Run Code Online (Sandbox Code Playgroud)

或与---线之间

---
Arto Anttila. 1995. How to recognise subjects in 
---
English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the 
---
Susanne corpus. In John Carroll, editor, Work- 
---
shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic 
---
models for dependency parsing: An exploration. 
---
In The 16th International Conference on Compu- 
---
tational Linguistics, pages 340-345. Copenhagen. 
---
David G. Hays. 1964. Dependency theory: A 
---
formalism and some observations. Language, 
---
40(4):511-525. 
Run Code Online (Sandbox Code Playgroud)

但它仍然不是那么有用,但现在我用来得到这个结果的代码

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self):  
    # code from original `extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
    
    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        # new code to add `\n` when text moves to new line
        elif operator == b_("Tm"):
            text += '\n'
            
    return text
    
# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')
Run Code Online (Sandbox Code Playgroud)

挖掘后,它似乎Tm也有值,并且有一个新的位置x, y,我用来计算文本行之间的距离,\n当距离大于某个值时,我添加。我测试了不同的值,并从值中17得到了预期的结果

---
Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. 
---
David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. 
---
Run Code Online (Sandbox Code Playgroud)

这里代码

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText2(self):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    prev_x = 0
    prev_y = 0
    
    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"
            
        elif operator == b_("Tm"):
            x = operands[-2]
            y = operands[-1]

            diff_x = prev_x - x
            diff_y = prev_y - y

            #print('>>>', diff_x, diff_y - y)
            #text += f'| {diff_x}, {diff_y - y} |'
            
            if diff_y > 17 or diff_y < 0:  # (bigger margin) or (move to top in next column)
                text += '\n'
                #text += '\n' # to add empty line between elements
                
            prev_x = x
            prev_y = y
            
    return text
        
# --- main ---
        
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')
Run Code Online (Sandbox Code Playgroud)

它适用于这个 PDF,但其他文件可能有不同的结构或不同的距离references,它们可能需要其他更改。


编辑:

更通用的版本 - 它有第二个参数

如果你在没有第二个参数的情况下运行

 text += myExtractText(page)
Run Code Online (Sandbox Code Playgroud)

然后它就像原始的一样工作,extractText()并且您将所有内容都放在一个字符串中。

如果第二个参数是 True

 text += myExtractText(page, True)
Run Code Online (Sandbox Code Playgroud)

然后它在每个之后添加新行Tm- 就像我的第一个版本一样。

如果第二个参数是整数 - 即。 17

 text += myExtractText(page, 17)
Run Code Online (Sandbox Code Playgroud)

然后当距离更大时它会添加新行17- 就像在我的第二个版本中一样。

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self, distance=None):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    prev_x = 0
    prev_y = 0
    
    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"
            
        if operator == b_("Tm"):
        
            if distance is True: 
                text += '\n'
                
            elif isinstance(distance, int):
                x = operands[-2]
                y = operands[-1]

                diff_x = prev_x - x
                diff_y = prev_y - y

                #print('>>>', diff_x, diff_y - y)
                #text += f'| {diff_x}, {diff_y - y} |'
                
                if diff_y > distance or diff_y < 0:  # (bigger margin) or (move to top in next column)
                    text += '\n'
                    #text += '\n' # to add empty line between elements
                    
                prev_x = x
                prev_y = y
            
    return text
        
# --- main ---
        
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    
    #text += myExtractText(page)        # modified function (works like original version)
    #text += myExtractText(page, True)  # modified function (add `\n` after every `Tm`)
    text += myExtractText(page, 17)  # modified function (add `\n` only if distance is bigger then `17`)   

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')
Run Code Online (Sandbox Code Playgroud)

顺便说一句:它不仅对References文本的其余部分有用,而且对文本的其余部分也很有用- 似乎它拆分了段落。

PDF开头的结果

---
A non-projective dependency parser 
---
Pasi Tapanainen and Timo J~irvinen University of Helsinki, Department of General Linguistics Research Unit for Multilingual Language Technology P.O. Box 4, FIN-00014 University of Helsinki, Finland {Pas i. Tapanainen, Timo. Jarvinen}@l ing. Hel s inki. f i 
---
Abstract 
---
We describe a practical parser for unre- stricted dependencies. The parser creates links between words and names the links according to their syntactic functions. We first describe the older Constraint Gram- mar parser where many of the ideas come from. Then we proceed to describe the cen- tral ideas of our new parser. Finally, the parser is evaluated. 
---
1 Introduction 
---
We are concerned with surface-syntactic parsing of running text. Our main goal is to describe syntac- tic analyses of sentences using dependency links that show the he~t-modifier relations between words. In addition, these links have labels that refer to the syntactic function of the modifying word. A simpli- fied example is in Figure 1, where the link between I and see denotes that I is the modifier of see and its syntactic function is that of subject. Similarly, a modifies bird, and it is a determiner. 
---
see bi i ~ d'~b~ bird 
---
Figure 1: Dependencies for sentence I see a bird. 
---
First, in this paper, we explain some central con- cepts of the Constraint Grammar framework from which many of the ideas are derived. Then, we give some linguistic background to the notations we are using, with a brief comparison to other current de- pendency formalisms and systems. New formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. Fi- nally, the real parsing system, with a grammar of some 2 500 rules, is evaluated. 
---
64 
---
The parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing English Constraint Grammar parser (Karls- son et al., 1995). The parsers can be tested via WWW t . 
---
2 Background 
---
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson (1990). A de- tMled description of the English Constraint Gram- mar (ENGCG) is in Karlsson et al. (1995). The basic rule types of the Constraint Grammar (Tapanainen, 1996) 2 are REMOVE and SELECT for discarding and se- lecting an alternative reading of a word. Rules also have contextual tests that describe the condition ac- cording to which they may be applied. For example, the rule 
---
Run Code Online (Sandbox Code Playgroud)