我在使用pdfbox阅读pdf时遇到了问题.我的实际pdf是部分不可读的,所以当我在编辑器中复制并粘贴不可读的部分时,它会显示小盒符号,但当我尝试通过pdfbox读取相同的文件时,这些字符不会被读取(我不指望它们待读).我期望的是,我至少得到一些符号或一些随机字符而不是实际字符.有没有办法做到这一点.该行已被选中,因此它不是图像.有没有人找到任何解决方法?
有一个pdfbox示例,我们覆盖pdfTextStripper类下的writeString方法以获得一些额外的字体属性.我正在使用该方法来获取我的文本和一些字体属性.所以我的问题是为什么pdfbox不会读取每个字符(它可能会打印出乱码).但就我而言,我算了一下.调用该方法的次数(每个方法调用对应于每个字符)并看到没有.方法调用确实与输出文本中的no.of字符匹配,但与总数没有匹配.pdf中的字符.这是一个示例pdf,单词"Profit"是不可读的,pdf甚至不显示这个单词的乱码,它只是完全跳过它.这是链接. https://drive.google.com/file/d/0B_Ke2amBgdpedUNwVTR3RVlRTFE/view?usp=sharing
事实上,实际上整条"截至2014年3月31日止年度的损益表"以及更多内容都无法提取; 检查内容原因变得明显:此文本使用复合字体编写,该字体既没有编码也没有ToUnicode条目,以便识别相关字符.
该org.apache.pdfbox.text.PDFTextStreamEngine(从中PDFTextStripper导出)方法showGlyph不久之前调用processTextPosition(其中PDFTextStripper实现并从中获取它的文本信息)包含以下代码:
// use our additional glyph list for Unicode mapping
unicode = font.toUnicode(code, glyphList);
// when there is no Unicode mapping available, Acrobat simply coerces the character code
// into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want
// this, which is why we leave it until this point in PDFTextStreamEngine.
if (unicode == null)
{
if (font instanceof PDSimpleFont)
{
char c = (char) code;
unicode = new String(new char[] { c });
}
else
{
// Acrobat doesn't seem to coerce composite font's character codes, instead it
// skips them. See the "allah2.pdf" TestTextStripper file.
return;
}
}
Run Code Online (Sandbox Code Playgroud)
有问题的字体没有提供文本提取的任何线索.因此,unicode这是null.
此外,字体是复合的,而不是简单的.因此,该else子句被执行,processTextPosition甚至不被调用.
PDFTextStripper因此,根本没有告知"截至2014年3月31日止年度的损益表"甚至存在!
如果你替换它
else
{
// Acrobat doesn't seem to coerce composite font's character codes, instead it
// skips them. See the "allah2.pdf" TestTextStripper file.
return;
}
Run Code Online (Sandbox Code Playgroud)
在PDFTextStreamEngine.showGlyph由一些码设置unicode,例如,使用Unicode替换字符
else
{
// Use the Unicode replacement character to indicate an unknown character
unicode = "\uFFFD";
}
Run Code Online (Sandbox Code Playgroud)
你会得到
57
THIRTY SEVENTH ANNUAL REPORT 2013-14
STANDALONE FINANCIAL STATEMENTS
?????????????????????????????????????????????????????????????
As per our report attached. Directors
For Deloitte Haskins & Sells LLP Deepak S. Parekh Nasser Munjee R. S. Tarneja
Chartered Accountants ???????? B. S. Mehta J. J. Irani
D. N. Ghosh Bimal Jalan
Keki M. Mistry S. A. Dave D. M. Sukthankar
Sanjiv V. Pilgaonkar ???????????????
Partner ???????????????????????
Renu Sud Karnad V. Srinivasa Rangan Girish V. Koliyote
??????, May 6, 2014 Managing Director ?????????????????? ?????????????????
Notes Previous Year
? in Crore ? in Crore
INCOME
??????????????????????? 23 23,894.03 20,796.95
???????????????????????????? 24 248.98 315.55
???????????? 25 54.66 35.12
Total Revenue 24,197.67 21,147.62
EXPENSES
Finance Cost 26 16,029.37 13,890.89
?????????????? 27 279.18 246.19
?????????????????????? 28 86.98 75.68
?????????????? 29 230.03 193.43
?????????????????????????????? 11 & 12 31.87 23.59
Provision for Contingencies 100.00 145.00
Total Expenses 16,757.43 14,574.78
PROFIT BEFORE TAX 7,440.24 6,572.84
???????????
????????????? 1,973.00 1,727.68
?????????????? 14 27.00 (3.18)
PROFIT FOR THE YEAR 3 5,440.24 4,848.34
EARNINGS PER SHARE??????????????? 2) 31
- Basic 34.89 31.84
- Diluted 34.62 31.45
?????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)
不幸的是,该PDFTextStreamEngine.showGlyph方法使用了一些私有类成员 因此,不能简单地PDFTextStripper使用具有上述变化的原始方法代码在一个人自己的类中覆盖它.要么必须复制几乎所有PDFTextStreamEngine类中的所有功能,要么必须使用Java反射,或者必须自己修补PDFBox类.
这种架构并不完美.
第二个文件的大小写是由上面引用的同一段PDFBox代码引起的.但是,这次,字体很简单,执行另一个代码块:
if (font instanceof PDSimpleFont)
{
char c = (char) code;
unicode = new String(new char[] { c });
}
Run Code Online (Sandbox Code Playgroud)
这里发生的是纯粹的猜测:如果没有将字形代码映射到Unicode的信息,我们假设映射是Latin-1,它简单地嵌入char.随着在OP的第二个文件中可见,这个假设并不总是成立.
如果您不希望PDFBox在此处做出类似这样的假设,也请替换if上面的块
if (font instanceof PDSimpleFont)
{
// Use the Unicode replacement character to indicate an unknown character
unicode = "\uFFFD";
}
Run Code Online (Sandbox Code Playgroud)
这导致了
Aries Agro Care Private Limited
1118th Annual Report 2013-14
Balance Sheet as at 31st March, 2014
Particulars Note
No.
As at
31 March, 2014
Rupees
As at
31 March, 2013
Rupees
I. EQUITY AND LIABILITIES
(1) Shareholder's Funds
(a) ????????????? 3 100,000 100,000
(b) Reserves and Surplus 4 (2,673,971) ????????????
(2,573,971) ????????????
(2) Current Liabilities
(a) Short Term Borrowings 5 5,805,535 ???????????
(b) Trade Payables 6 159,400 ?????????
(c) ????????????????????????? 7 2,500 22,743
5,967,435 5,934,756
TOTAL 3,393,464 ???????????
II. ASSETS
(1) Non-Current Assets
(a) ???????????????????? ? - -
- -
(2) Current Assets
(a) ??????????????????????? 9 39,605 ???????
(b) ????????????????????????????? 10 3,353,859 ??????????
3,393,464 ??????????
TOTAL 3,393,464 ??????????
????????????????????????????????
The Notes to Accounts 1 to 23 form part of these Financial Statements
As per our report of even date For and on behalf of the Board
For Kirti D. Shah & Associates
?????????????????????
?????????????????????????????
Dr. Jimmy Mirchandani
Director
Kirti D. Shah
Proprietor
Membership No 32371
Dr. Rahul Mirchandani
Director
Place : Mumbai.
Date :- 26th May, 2014.
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2850 次 |
| 最近记录: |