SIM*_*SIM 4 python web-scraping python-imaging-library python-3.x python-tesseract
我已经编写了一个脚本,python用于pytesseract从图像中提取一个单词.该图像中只有一个单词TOOLS,这就是我所追求的.目前我的下面的脚本给了我错误的输出WIS.我该怎么做才能得到文字?
这是我的脚本:
import requests, io, pytesseract
from PIL import Image
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
img = img.resize([100,100], Image.ANTIALIAS)
img = img.convert('L')
img = img.point(lambda x: 0 if x < 170 else 255)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
# img.show()
Run Code Online (Sandbox Code Playgroud)
这是我运行上述脚本时修改后的图像的状态:
我输出的输出:
WIS
Run Code Online (Sandbox Code Playgroud)
预期产量:
TOOLS
Run Code Online (Sandbox Code Playgroud)
igr*_*nis 11
关键是将图像变换与tesseract能力相匹配.你的主要问题是字体不是通常的字体.所有你需要的是
from PIL import Image, ImageEnhance, ImageFilter
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
# remove texture
enhancer = ImageEnhance.Color(img)
img = enhancer.enhance(0) # decolorize
img = img.point(lambda x: 0 if x < 250 else 255) # set threshold
img = img.resize([300, 100], Image.LANCZOS) # resize to remove noise
img = img.point(lambda x: 0 if x < 250 else 255) # get rid of remains of noise
# adjust font weight
img = img.filter(ImageFilter.MaxFilter(11)) # lighten the font ;)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
Run Code Online (Sandbox Code Playgroud)
瞧,
TOOLS
Run Code Online (Sandbox Code Playgroud)
被认可.
| 归档时间: |
|
| 查看次数: |
288 次 |
| 最近记录: |