使用.NET VB或C#中的acrobat.tlb从.pdf中提取完整的带连字符的单词

Goo*_*uJu 7 .net vb.net acrobat parsing

我正在使用acrobat.tlb库解析.pdf

在连续删除连字符的新行中,连字符被分开.

例如ABC-123-XXX-987

解析为:
ABC
123
XXX
987

如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字符串(序列号),而不是将突出显示放在正确的位置...因此acrobat.tlb

我在这里使用此代码:http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

 ' filey = "*your full file name including directory here*"
        AcroExchApp = CreateObject("AcroExch.App")
        AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
        ' Open the [strfiley] pdf file
        AcroExchAVDoc.Open(filey, "")       

        ' Get the PDDoc associated with the open AVDoc
        AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
        sustext = "accessorizes"
        suktext = "accessorises" 
        ' get JavaScript Object
        ' note jso is related to PDDoc of a PDF,
        jso = AcroExchPDDoc.GetJSObject
        ' count
        nCount = 0
        nCount1 = 0
        gbStop = False
        bUSCnt = False
        bUKCnt = False
        ' search for the text
        If Not jso Is Nothing Then
            ' total number of pages
            nPages = jso.numpages           

                ' Go through pages
                For i = 0 To nPages - 1
                    ' check each word in a page
                    nWords = jso.getPageNumWords(i)
                    For j = 0 To nWords - 1
                        ' get a word

                        word = Trim(CStr(jso.getPageNthWord(i, j)))
                        'If VarType(word) = VariantType.String Then
                        If word <> "" Then
                            ' compare the word with what the user wants
                            If Trim(sustext) <> "" Then
                                result = StrComp(word, sustext, vbTextCompare)
                                ' if same
                                If result = 0 Then
                                    nCount = nCount + 1
                                    If bUSCnt = False Then
                                        iUSCnt = iUSCnt + 1
                                        bUSCnt = True
                                    End If
                                End If
                            End If
                            If suktext<> "" Then
                                result1 = StrComp(word, suktext, vbTextCompare)
                                ' if same
                                If result1 = 0 Then
                                    nCount1 = nCount1 + 1
                                    If bUKCnt = False Then
                                        iUKCnt = iUKCnt + 1
                                        bUKCnt = True
                                    End If
                                End If
                            End If
                        End If
                    Next j
                Next i
jso = Nothing
        End If
Run Code Online (Sandbox Code Playgroud)

代码执行突出显示文本的工作,但带有'word'变量的FOR循环将带连字符的字符串拆分为组件部分.

For i = 0 To nPages - 1
                        ' check each word in a page
                        nWords = jso.getPageNumWords(i)
                        For j = 0 To nWords - 1
                            ' get a word

                            word = Trim(CStr(jso.getPageNthWord(i, j)))
Run Code Online (Sandbox Code Playgroud)

有谁知道如何使用acrobat.tlb维护整个字符串?我的相当广泛的搜索空白.

γηρ*_*όμε 2

我可以理解,iTextSharp突出显示文本时这很麻烦,因为您必须绘制一个矩形并且变得很复杂,但解决方案acrobat.tlb也有其缺点。它不是免费的,很少有人会使用它。对于我们其他人来说,更好的解决方案是免费且易于使用Spire.Pdf您可以从NuGet 包中获取它。该代码执行以下操作:

  • 打开.pdf
  • 阅读每个文本页
  • 使用正则表达式查找匹配项
  • 将它们保存到字符串列表中,消除重复项
  • 对于此列表搜索页面中的每个字符串并突出显示该单词

代码:

Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection

Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)

For Each page As PdfPageBase In pdf.Pages
    'get text from current page
    content.Append(page.ExtractText())

    'find matches
    matches = Regex.Matches(content.ToString, pattern, RegexOptions.None)

    matchList.Clear()

    'Assign each match to a string list.
    For Each match As Match In matches
        matchList.Add(match.Value)
    Next

    'Eliminate duplicates.
    matchList = matchList.Distinct.ToList

    'for each string in list
    For i = 0 To matchList.Count - 1
        'find all occurances of matchList(i) string in page and highlight it
        result = page.FindText(matchList(i)).Finds

        For Each find As PdfTextFind In result
            find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
        Next

    Next 'matchList

Next 'page

pdf.SaveToFile("New Path")

pdf.Close()
pdf.Dispose()
Run Code Online (Sandbox Code Playgroud)

我不太擅长,regular expression所以你可以实现你的。无论如何,这就是我的方法。