小编hch*_*am1的帖子

用R做OCR

我一直在尝试在R中进行OCR(读取数据作为扫描图像的PDF数据).一直在读这个@ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

这是一篇非常好的帖子.

有效3个步骤:

将pdf转换为ppm(图像格式)
将ppm转换为tif准备好tesseract(使用ImageMagick进行转换)
将tif转换为文本文件

根据链接帖子的上述3个步骤的有效代码:

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the 
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif …

Run Code Online (Sandbox Code Playgroud)

pdf ocr shell tesseract r

r_a*_*ics

2016 09-17

11
推荐指数

2
解决办法

3989
查看次数