如何定义tesseract在识别时使用的字体类型(不在培训中)？

Question

如何定义tesseract在识别时使用的字体类型(不在培训中)？

Ken*_*but 6 c++ ocr fonts tesseract truetype

对于我可下载的英文数据集

cat tessdata/eng.* | egrep -o ".*ttf" | sort -u

Run Code Online (Sandbox Code Playgroud)

并获取在英语培训中使用的所有字体的列表

Andale_Mono.ttf
Arial_Black.ttf
Arial_Bold.ttf
Arial.ttf
buttf
Comic_Sans_MS_Bold.ttf
Comic_Sans_MS.ttf
Courier_New_Bold.ttf
Courier_New.ttf
Georgia_Bold.ttf
Georgia.ttf
Gottf
Impact.ttf
Times_New_Roman_Bold.ttf
Times_New_Roman.ttf
Trebuchet_MS_Bold.ttf
Trebuchet_MS.ttf
ttf
Verdana_Bold.ttf
Verdana.ttf

Run Code Online (Sandbox Code Playgroud)

现在我想识别一个我已经知道fonttype的文本,所以我想限制对它的认可.我试过了:

api.SetVariable("classify_font_name", "Arial_Bold.ttf");

Run Code Online (Sandbox Code Playgroud)

但我没有看到更好的结果.有人可以告诉我如何做到这一点,或者甚至可能吗？

Answer 1

ngu*_*enq -1

您可以使用LTRResultIterator类及其WordFontAttributes方法来获取单词或字符级别结果的字体信息。获取字体属性后，您可以根据特定的字体名称条件过滤输出文本。请参阅Tesseract API 示例。

归档时间：	11 年，8 月前
查看次数：	4174 次
最近记录：	9 年，11 月前