有谁知道如何为Pytesseract设置角色白名单?我希望它只输出Az和0-9.这可能吗?我有以下内容:
img = Image.open('test.jpg')
result = pytesseract.image_to_string(img, config='-psm 6')
Run Code Online (Sandbox Code Playgroud)
我得到其他字符像/为1所以我想限制可能的字符的选项.
我想从图像的特定区域提取文本,例如身份证中的名称和ID号。我要从中提取文本的ID卡是中文(中文ID卡)。我已经尝试过此代码,但是它只是提取了我不需要的地址和出生日期。我只需要姓名和身份证号码。
import cv2
from PIL import Image
import pytesseract
import argparse
import os
image = cv2.imread("E:/face.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename,gray)
text = pytesseract.image_to_string(Image.open(filename), lang='chi_sim')
print(text)
os.remove(filename)
Run Code Online (Sandbox Code Playgroud)
我正在尝试在Mac上运行以下代码。
import Image
enter code here`import pytesseract
im = Image.open('test.png')
print pytesseract.image_to_string(im)
Run Code Online (Sandbox Code Playgroud)
从这里提出以下问题:pytesseract-没有此类文件或目录错误, 我需要安装tesseract-ocr
但是,当我尝试点安装tesseract-ocr时,出现以下错误:
creating build/temp.macosx-10.5-x86_64-2.7
gcc -fno-strict-aliasing -I//anaconda/include -arch x86_64 -DNDEBUG -g
-fwrapv -O3 -Wall -Wstrict-prototypes -I//anaconda/include/python2.7 -c
tesseract_ocr.cpp -o build/temp.macosx-10.5-x86_64-2.7/tesseract_ocr.o
tesseract_ocr.cpp:264:10:
fatal error: 'leptonica/allheaders.h' file not found #include "leptonica/allheaders.h"
^
1 error generated.
error: command 'gcc' failed with exit status 1
Run Code Online (Sandbox Code Playgroud)
我不知道该怎么办。
我试图将pytesseract用于OCR(从图像中提取文本)。我已经使用以下命令成功安装了pytessearct-
pip install pytessearct
Run Code Online (Sandbox Code Playgroud)
当我尝试再次安装它时,它会清楚地说-
Requirement already satisfied (use --upgrade to upgrade):
pytesseract in ./site-packages
Run Code Online (Sandbox Code Playgroud)
这意味着pytessearct已成功安装。当我尝试使用-在我的iPython笔记本中导入此软件包时-
import pytessearct
Run Code Online (Sandbox Code Playgroud)
引发错误-
ImportError: No module named pytesseract
Run Code Online (Sandbox Code Playgroud)
为什么会这样呢?
嗨,我是 python 和 tesseract 的新手。我正在使用 anaconda 发行版,并尝试使用 pytesseract-ocr 当我尝试从图像获取数据时,它给出以下错误:
tesseract imageSample1.jpg test.txt digits
// output
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Error opening data file /anaconda/envs/_build/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Run Code Online (Sandbox Code Playgroud)
所以首先没有这样的/anaconda/envs/_build/share/tessdata/目录。我有 anaconda3 文件夹。我从 git 下载了 end.traindata 。但不确定将该数据放在哪里。难道我做错了什么。需要一些帮助。谢谢。
我将 MSS 与 pytesseract 结合使用,尝试在屏幕上读取以确定正在监视的区域中的字符串。我的代码如下:
import Image
import pytesseract
import cv2
import os
import mss
import numpy as np
with mss.mss() as sct:
mon = {'top': 0, 'left': 0, 'width': 150, 'height': 150}
im = sct.grab(mon)
im = np.asarray(im)
im_gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
#im_gray = plt.imshow(im_gray, interpolation='nearest')
cv2.imwrite("test.png", im_gray)
#cur_dir = os.getcwd()
text = pytesseract.image_to_string(Image.open(im_gray))
print(text)
cv2.imshow("Image", im)
cv2.imshow("Output", im_gray)
cv2.waitKey(0)
Run Code Online (Sandbox Code Playgroud)
我返回以下错误: AttributeError: 'numpy.ndarray' object has no attribute 'read'
我还尝试使用 pyplot 将其转换回图像,如代码示例中的注释行所示。然而,这会打印回错误: TypeError: img is not a numpy array, isn't …
尝试在 python 上运行 tesseract,这是我的代码:
import cv2
import os
import numpy as np
import matplotlib.pyplot as plt
import pytesseract
import Image
# def main():
jpgCounter = 0
for root, dirs, files in os.walk('/home/manel/Desktop/fotografias etiquetas'):
for file in files:
if file.endswith('.jpg'):
jpgCounter += 1
for i in range(1, 2):
name = str(i) + ".jpg"
nameBW = str(i) + "_bw.jpg"
img = cv2.imread(name,0) #zero -> abre em grayscale
# img = cv2.equalizeHist(img)
kernel = np.array([[0,-1,0], [-1,5,-1], [0,-1,0]])
img = cv2.filter2D(img, -1, kernel)
cv2.normalize(img,img,0,255,cv2.NORM_MINMAX) …Run Code Online (Sandbox Code Playgroud) 我已经开始在 python 中使用 pytesseract。当我将它传递给单引号或双引号时
from PIL import Image
import pytesseract
import numpy as np
tesseract_config = r"""-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#'<>(){};:"""
tesseract_language = "eng"
text = pytesseract.image_to_string(Image.open('res/outc001.jpg'), lang=tesseract_language, config=tesseract_config)
print text
Run Code Online (Sandbox Code Playgroud)
它返回
Traceback (most recent call last):
File "main.py", line 15, in <module>
text = pytesseract.image_to_string(Image.open('res/outc001.jpg'), lang=tesseract_language, config=tesseract_config).split('\n')
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 193, in image_to_string
return run_and_get_output(image, 'txt', lang, config, nice)
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 140, in run_and_get_output
run_tesseract(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 106, in run_tesseract
command += shlex.split(config)
File "/usr/lib/python2.7/shlex.py", line 279, in split
return …Run Code Online (Sandbox Code Playgroud) 我收到了赛百味全天的详细销售、工人等收据,需要为管理课程提取数据。
我拍了收据的照片,然后用 pytesseract 将它们处理成由 \n 分隔的字符串,但现在不知道如何使用 pd.read_csv 和 StringIO 将其转换为数据帧。如果这是最好的方法,请不要这样做。也可能需要使用 cv2 编辑图像,以便更好地处理。
import numpy as np
import pytesseract
from PIL import Image
import pandas as pd
path = 'C:\\attachments\\'
monday = pytesseract.image_to_string(Image.open(path+'file1-1.jpeg'),lang='eng')
from StringIO import StringIO
mon = pd.read_csv(StringIO(monday),sep=r'\s',lineterminator=r'\n')
print(mon)
Run Code Online (Sandbox Code Playgroud)
这是当前星期一的一些变量。
"\nTIME HOURS :\nPERIOD SALES UNITS WORKED PROD SPLH\nZhan emmoo «Ct (iti ;:t‘«é‘«‘i CSD\n3A-4A $0.00 0 0 0 $0.00\n44-54 =: $0.00 SssOO 0 0 $0.00\n5A-6A $0.00 0 0 0 $0.00\nbA-7A $0.00 0 0 0 $0.00\n7A-BA =s«$0.00-Sss«OOs«*O0.80 0 $0.00\nBA-9A 60,00 . …Run Code Online (Sandbox Code Playgroud) 我已经开始使用Pytesser,它可以同时使用英语和中文,但是有没有办法同时使用两种语言?我需要制作自己的训练数据文件吗?我的代码是:
import Image
from pytesser import *
print image_to_string(Image.open("chinese_and_english.jpg"), lang="eng")
#also want to have chinese be recognized
Run Code Online (Sandbox Code Playgroud)
我正在使用python 3.x并使用以下代码将图像转换为文本:
from PIL import Image
from pytesseract import image_to_string
image = Image.open('image.png', mode='r')
print(image_to_string(image))
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "C:/Users/hp/Desktop/GII/Image_to_text.py", line 12, in <module>
print(image_to_string(image))
File "C:\Users\hp\Downloads\WinPython-64bit-3.5.1.2\python-3.5.1.amd64\lib\site-packages\pytesseract\pytesseract.py", line 161, in image_to_string
config=config)
File "C:\Users\hp\Downloads\WinPython-64bit-3.5.1.2\python-3.5.1.amd64\lib\site-packages\pytesseract\pytesseract.py", line 94, in run_tesseract
stderr=subprocess.PIPE)
File "C:\Users\hp\Downloads\WinPython-64bit-3.5.1.2\python-3.5.1.amd64\lib\subprocess.py", line 950, in __init__
restore_signals, start_new_session)
File "C:\Users\hp\Downloads\WinPython-64bit-3.5.1.2\python-3.5.1.amd64\lib\subprocess.py", line 1220, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
Run Code Online (Sandbox Code Playgroud)
请注意,我已将图像放在存在python的同一目录中。也不会引发错误,image = Image.open('image.png', mode='r')但会引发错误 print(image_to_string(image))。
知道这里可能有什么问题吗?谢谢
遇到有关 Pytesseract 的以下代码的此错误代码的问题。(Python 3.6.1,Mac OSX)
从 PIL 导入 pytesseract 导入请求 从 PIL 导入图像 从 io 导入 ImageFilter 导入 StringIO, BytesIO
def process_image(url):
image = _get_image(url)
image.filter(ImageFilter.SHARPEN)
return pytesseract.image_to_string(image)
def _get_image(url):
r = requests.get(url)
s = BytesIO(r.content)
img = Image.open(s)
return img
process_image("https://www.prepressure.com/images/fonts_sample_ocra_medium.png")
Run Code Online (Sandbox Code Playgroud)
错误:
/usr/local/Cellar/python3/3.6.0_1/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/g/pyfo/reddit/ocr.py
Traceback (most recent call last):
File "/Users/g/pyfo/reddit/ocr.py", line 20, in <module>
process_image("https://www.prepressure.com/images/fonts_sample_ocra_medium.png")
File "/Users/g/pyfo/reddit/ocr.py", line 10, in process_image
image.filter(ImageFilter.SHARPEN)
File "/usr/local/lib/python3.6/site-packages/PIL/Image.py", line 1094, in filter
return self._new(filter.filter(self.im))
File "/usr/local/lib/python3.6/site-packages/PIL/ImageFilter.py", line 53, in filter
raise ValueError("cannot …Run Code Online (Sandbox Code Playgroud) 我需要使用pytesseract将具有多个页面的image.tif转录为文本。我有下一个代码:
> From PIL import Image
> Import pytesseract
> Pytesseract.pytesseract.tesseract_cmd = 'C: / Program Files (x86) / Tesseract-
> OCR / tesseract '
> Print (pytesseract.image_to_string (Image.open ('CAMARA.tif'), lang = "spa"))
Run Code Online (Sandbox Code Playgroud)
问题在于只能提取冷杉页面。我如何提取所有这些?