我有一个 pdf 文件,它具有以下安全属性:打印:允许;文件汇编:不允许;内容复制:允许;可访问性的内容复制:允许;页面提取:不允许;
我尝试使用示例代码作为文档示例获取文本,如下所示:
pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(System.Environment.NewLine);
text.Append("\n Page Number:" + page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value++;
}
pdftext.Text += text.ToString();
pdfReader.Close();
Run Code Online (Sandbox Code Playgroud)
但输出文本是带有“”的行???? ???????\n?? ???? " 价值观;
似乎文件被加密了或者我们有编码问题......
请注意,在以下几行中
var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = …Run Code Online (Sandbox Code Playgroud) encryption encoding text-extraction itext character-encoding