R从图像中提取文本并将其导出为CSV

Ed_*_*avy 11 tesseract r

有人直接向我提供了一个screenshot表格,我必须在 中输入信息MS-Excel。我正在考虑找到一种方法从该图像中提取文本并将其导出为CSV.

我确实遇到了tesseract包,但效果不佳。

有没有办法做到这一点R

示例图片:

在此输入图像描述

我尝试过的代码:

library(tidyverse)
library(tesseract)

eng = tesseract("eng")
text = tesseract::ocr("path/file_name.png", engine = eng)
cat(text)
Run Code Online (Sandbox Code Playgroud)

JBG*_*ber 8

这实际上效果很好。我没有检查整个内容是否有错误,但乍一看一切看起来都是正确的,包括“PRISM_tm”。诀窍是使图像变得相当大,因为超立方体似乎会忽略小字符:

library(magick)
#> Linking to ImageMagick 7.1.0.31
#> Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp, x11
#> Disabled features: fftw, ghostscript
#> Using 12 threads
library(tesseract)
input <- image_read("https://i.stack.imgur.com/JxGHc.png") %>% 
  # preprocess image to make it easier to ocr
  image_convert(type = 'Grayscale') %>% 
  image_deskew() %>% 
  image_resize("2000x") %>% 
  ocr()

df <- data.table::fread(text = input)
#> Warning in data.table::fread(text = input): Detected 11 column names but the
#> data has 12 columns (i.e. invalid file). Added 1 extra default column name for
#> the first column which is guessed to be row names or an index. Use setnames()
#> afterwards if this guess is not correct, or fix the file write command that
#> created the file to create a valid file.
df
#>     V1     info    tmax ACREAGE                               GLOBALID
#>  1:  1 PRISM_tm 30.3976  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  2:  2 PRISM_tm 26.0226  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  3:  3 PRISM_tm 27.1775  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  4:  4 PRISM_tm  24,164  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  5:  5 PRISM_tm  24.458  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  6:  6 PRISM_tm  26.118  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  7:  7 PRISM_tm  27.259  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  8:  8 PRISM_tm  30.105  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  9:  9 PRISM_tm  30.697  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 10: 10 PRISM_tm   32949  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 11: 11 PRISM_tm  32,966  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 12: 12 PRISM_tm  32.081  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 13: 13 PRISM_tm  29.847  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 14: 14 PRISM_tm  27.576  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 15: 15 PRISM_tm  24.671  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 16: 16 PRISM_tm  24.382  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 17: 17 PRISM_tm  24.382  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 18: 18 PRISM_tm  26.365  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 19: 19 PRISM_tm  29.246  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 20: 20 PRISM_tm  30.737  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 21: 21 PRISM_tm  31.658  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 22: 22 PRISM_tm  31.386  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 23: 23 PRISM_tm   32457  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 24: 24 PRISM_tm  32.093  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 25: 25 PRISM_tm  30.303  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 26: 26 PRISM_tm  26.231  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 27: 27 PRISM_tm  25.956  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>     V1     info    tmax ACREAGE                               GLOBALID
#>     datasource variable    datatype resolutior    Date year month
#>  1:      PRISM     tmax provisional      4kmM3 2021-10 2021    10
#>  2:      PRISM     tmax provisional      4kmM3 2021-11 2021    11
#>  3:      PRISM     tmax provisional      4kmM3 2021-12 2021    12
#>  4:      PRISM     tmax      stable      4kmM3 2005-01 2005     1
#>  5:      PRISM     tmax      stable      4kmM3 2005-02 2005     2
#>  6:      PRISM     tmax      stable      4kmM3 2005-03 2005     3
#>  7:      PRISM     tmax      stable      4kmM3 2005-04 2005     4
#>  8:      PRISM     tmax      stable      4kmM3 2005-05 2005     5
#>  9:      PRISM     tmax      stable      4kmM3 2005-06 2005     6
#> 10:      PRISM     tmax      stable      4kmM3 2005-07 2005     7
#> 11:      PRISM     tmax      stable      4kmM3 2005-08 2005     8
#> 12:      PRISM     tmax      stable      4kmM3 2005-09 2005     9
#> 13:      PRISM     tmax      stable      4kmM3 2005-10 2005    10
#> 14:      PRISM     tmax      stable      4kmM3 2005-11 2005    11
#> 15:      PRISM     tmax      stable      4kmM3 2005-12 2005    12
#> 16:      PRISM     tmax      stable      4kmM3 2006-01 2006     1
#> 17:      PRISM     tmax      stable      4kmM3 2006-02 2006     2
#> 18:      PRISM     tmax      stable      4kmM3 2006-03 2006     3
#> 19:      PRISM     tmax      stable      4kmM3 2006-04 2006     4
#> 20:      PRISM     tmax      stable      4kmM3 2006-05 2006     5
#> 21:      PRISM     tmax      stable      4kmM3 2006-06 2006     6
#> 22:      PRISM     tmax      stable      4kmM3 2006-07 2006     7
#> 23:      PRISM     tmax      stable      4kmM3 2006-08 2006     8
#> 24:      PRISM     tmax      stable      4kmM3 2006-09 2006     9
#> 25:      PRISM     tmax      stable      4kmM3 2006-10 2006    10
#> 26:      PRISM     tmax      stable      4kmM3 2006-11 2006    11
#> 27:      PRISM     tmax      stable      4kmM3 2006-12 2006    12
#>     datasource variable    datatype resolutior    Date year month
Run Code Online (Sandbox Code Playgroud)

由reprex 包于 2022 年 8 月 10 日创建(v2.0.1)

fread可以安全地忽略来自的警告,因为它只抱怨第一列中缺少标题。

  • 这很奇怪。你运行过这个确切的代码吗?也许您使用的是不同版本的 ImageMagick? (2认同)
  • 我认为该版本是 2017 年的。我不太确定如何更新它。你使用什么操作系统? (2认同)
  • 我不太熟悉 Fedora 包管理器,但我相信它应该有更新的版本。也许检查 https://imagemagick.org/script/download.php (2认同)
  • 您还可以考虑更新 `tesseract`:`sudo yum install tesseract-devel leptonica-devel`。我运行的是 5.2.0 版本。您可以使用“tesseract::tesseract_info()”进行检查。 (2认同)