Goo*_*uJu 7 .net vb.net acrobat parsing
我正在使用acrobat.tlb库解析.pdf
在连续删除连字符的新行中,连字符被分开.
例如ABC-123-XXX-987
解析为:
ABC
123
XXX
987
如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字符串(序列号),而不是将突出显示放在正确的位置...因此acrobat.tlb
我在这里使用此代码:http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf
' filey = "*your full file name including directory here*"
AcroExchApp = CreateObject("AcroExch.App")
AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
' Open the [strfiley] pdf file
AcroExchAVDoc.Open(filey, "")
' Get the PDDoc associated with the open AVDoc
AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
sustext = "accessorizes"
suktext = "accessorises"
' get JavaScript Object
' note jso is related to PDDoc of a PDF,
jso = AcroExchPDDoc.GetJSObject
' count
nCount = 0
nCount1 = 0
gbStop = False
bUSCnt = False
bUKCnt = False
' search for the text
If Not jso Is Nothing Then
' total number of pages
nPages = jso.numpages
' Go through pages
For i = 0 To nPages - 1
' check each word in a page
nWords = jso.getPageNumWords(i)
For j = 0 To nWords - 1
' get a word
word = Trim(CStr(jso.getPageNthWord(i, j)))
'If VarType(word) = VariantType.String Then
If word <> "" Then
' compare the word with what the user wants
If Trim(sustext) <> "" Then
result = StrComp(word, sustext, vbTextCompare)
' if same
If result = 0 Then
nCount = nCount + 1
If bUSCnt = False Then
iUSCnt = iUSCnt + 1
bUSCnt = True
End If
End If
End If
If suktext<> "" Then
result1 = StrComp(word, suktext, vbTextCompare)
' if same
If result1 = 0 Then
nCount1 = nCount1 + 1
If bUKCnt = False Then
iUKCnt = iUKCnt + 1
bUKCnt = True
End If
End If
End If
End If
Next j
Next i
jso = Nothing
End If
Run Code Online (Sandbox Code Playgroud)
代码执行突出显示文本的工作,但带有'word'变量的FOR循环将带连字符的字符串拆分为组件部分.
For i = 0 To nPages - 1
' check each word in a page
nWords = jso.getPageNumWords(i)
For j = 0 To nWords - 1
' get a word
word = Trim(CStr(jso.getPageNthWord(i, j)))
Run Code Online (Sandbox Code Playgroud)
有谁知道如何使用acrobat.tlb维护整个字符串?我的相当广泛的搜索空白.
我可以理解,iTextSharp突出显示文本时这很麻烦,因为您必须绘制一个矩形并且变得很复杂,但解决方案acrobat.tlb也有其缺点。它不是免费的,很少有人会使用它。对于我们其他人来说,更好的解决方案是免费且易于使用Spire.Pdf。您可以从NuGet 包中获取它。该代码执行以下操作:
- 打开.pdf
- 阅读每个文本页
- 使用正则表达式查找匹配项
- 将它们保存到字符串列表中,消除重复项
- 对于此列表搜索页面中的每个字符串并突出显示该单词
代码:
Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection
Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)
For Each page As PdfPageBase In pdf.Pages
'get text from current page
content.Append(page.ExtractText())
'find matches
matches = Regex.Matches(content.ToString, pattern, RegexOptions.None)
matchList.Clear()
'Assign each match to a string list.
For Each match As Match In matches
matchList.Add(match.Value)
Next
'Eliminate duplicates.
matchList = matchList.Distinct.ToList
'for each string in list
For i = 0 To matchList.Count - 1
'find all occurances of matchList(i) string in page and highlight it
result = page.FindText(matchList(i)).Finds
For Each find As PdfTextFind In result
find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
Next
Next 'matchList
Next 'page
pdf.SaveToFile("New Path")
pdf.Close()
pdf.Dispose()
Run Code Online (Sandbox Code Playgroud)
我不太擅长,regular expression所以你可以实现你的。无论如何,这就是我的方法。
| 归档时间: |
|
| 查看次数: |
186 次 |
| 最近记录: |