从doc和docx中提取文本

Question

从doc和docx中提取文本

我想知道如何阅读doc或docx的内容.我正在使用Linux VPS和PHP,但如果有更简单的解决方案使用其他语言,请告诉我,只要它在Linux网络服务器下工作.

Answer 1

这只是一个.DOCX解决方案.对于.DOC或.PDF,您需要使用其他类似pdf2text.php的PDF格式

function docx2text($filename) {
   return readZippedXML($filename, "word/document.xml");
 }

function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;

// Open received archive file
if (true === $zip->open($archiveFile)) {
    // If done, search for the data file in the archive
    if (($index = $zip->locateName($dataFile)) !== false) {
        // If found, read it to the string
        $data = $zip->getFromIndex($index);
        // Close archive file
        $zip->close();
        // Load XML from a string
        // Skip errors and warnings
        $xml = new DOMDocument();
    $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
        // Return data without XML formatting tags
        return strip_tags($xml->saveXML());
    }
    $zip->close();
}

// In case of failure return empty string
return "";
}

echo docx2text("test.docx"); // Save this contents to file

Run Code Online (Sandbox Code Playgroud)

id不适用于.doc扩展名.它没有word/document.xml而是有_rels/.rels.xml要做的就是这样的情况?????? (3认同)

Answer 2

M K*_*aid 13

在这里,我添加了解决方案,以从.doc,.docx word文件中获取文本

如何从word文件.doc,docx php中提取文本

对于.doc

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

Run Code Online (Sandbox Code Playgroud)

对于.docx

private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

Run Code Online (Sandbox Code Playgroud)

Answer 3

Luk*_*nga 7

解析.docx,.odt,.doc和.rtf文档

我写了一个库,根据这里和其他地方的答案解析docx,odt和rtf文档.

我对.docx和.odt解析所做的主要改进是,库处理描述文档的XML并尝试将其符合HTML标记,即em和强标记.这意味着如果您将库用于CMS,则不会丢失文本格式

你可以在这里得到它

Answer 4

小智 6

我的解决办法是Antiword为.doc和docx2txt为.DOCX

假设您控制的Linux服务器,下载每个服务器,解压缩然后安装.我安装了每个系统:

Antiword:make global_install
docx2txt:make install

然后使用这些工具将文本提取到php中的字符串:

//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' . 
    escapeshellarg($docFilePath));

//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' . 
    escapeshellarg($docxFilePath) . ' -');

Run Code Online (Sandbox Code Playgroud)

docx2txt需要perl

no_freedom的解决方案确实从docx文件中提取文本,但它可以删除空白.我测试的大多数文件都有这样的实例:应该分开的单词之间没有空格.当您想要全文搜索您正在处理的文档时,这样做并不好.

Answer 5

小智 1

尝试ApachePOI。它适用于 Java。我想您在 Linux 上安装 Java 不会有任何困难。

归档时间：	14 年，9 月前
查看次数：	38920 次
最近记录：	8 年，8 月前