使用javascript从pdf文件中提取文本

Question

使用javascript从pdf文件中提取文本

Coc*_*lle 24 javascript pdf text-extraction pdf.js

我想在客户端仅使用Javascript从pdf文件中提取文本而不使用服务器.我已经在以下链接中找到了一个javascript代码:在Javascript中从pdf中提取文本

然后在

http://hublog.hubmed.org/archives/001948.html

并在:

https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

1)我想知道从以前的文件中提取这些文件所需的文件是什么.2)我不确切知道如何在应用程序中调整这些代码,而不是在Web中.

欢迎任何答案.谢谢.

Answer 1

All*_*non 16

这是一个很好的例子,说明如何使用pdf.js来提取文本:http://git.macropus.org/2011/11/pdftotext/example/

当然,你必须为你的目的删除很多代码,但它应该这样做

未来Google员工的注意事项:自上述链接发布以来,官方pdf.js项目似乎已经多次易手,但它目前位于Mozilla的GitHub页面 - https://github.com/mozilla/pdf.js (6认同)
@Allanon你知道有什么方法可以提取文本并保留其语义吗？该示例仅抓取所有文本，而不考虑换行符、段落、标题等。 (2认同)

Answer 2

Car*_*ado 10

我采用了一种更简单的方法，即不需要使用pdf.js在同一框架（使用最新版本）之间在iframe之间发布消息。

以下示例仅从PDF的第一页提取所有文本：

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js 
 * 
 * @param {Integer} pageNum Specifies the number of the page 
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained 
 **/
function getPageText(pageNum, PDFDocumentInstance) {
    // Return a Promise that is solved once the text of the page is retrieven
    return new Promise(function (resolve, reject) {
        PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
            // The main trick to obtain the text of the PDF page, use the getTextContent method
            pdfPage.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    var item = textItems[i];

                    finalString += item.str + " ";
                }

                // Solve promise with the text retrieven from the page
                resolve(finalString);
            });
        });
    });
}

/**
 * Extract the test from the PDF
 */

var PDF_URL  = '/path/to/example.pdf';
PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {

    var totalPages = PDFDocumentInstance.pdfInfo.numPages;
    var pageNumber = 1;

    // Extract the text
    getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
        // Show the text of the page in the console
        console.log(textPage);
    });

}, function (reason) {
    // PDF loading error
    console.error(reason);
});

Run Code Online (Sandbox Code Playgroud)

在此处阅读有关此解决方案的文章。正如@xarxziux所提到的，自发布第一个解决方案以来，库已发生了更改（它不再适用于最新版本的pdf.js）。这在大多数情况下都应该起作用。

@RishabhGarg 请记住，PDF 不知道文本的格式甚至顺序。你很幸运，你可以得到文本。导出的格式甚至可能不一致。这就是原始演示用一个空格替换所有空格的原因。这至少有点保持格式一致。 (2认同)
现在，“ PDFDocumentInstance.pdfInfo.numPages”应该是“ PDFDocumentInstance.numPages” (2认同)

归档时间：	12 年，4 月前
查看次数：	44846 次
最近记录：	7 年，5 月前