从pdf(到excel)提取表格,pref.w/vba

MeR*_*uud 7 pdf excel vba filesystemobject

我试图用vba从pdf文件中提取表格并将它们导出到excel.如果一切都按照应有的方式进行,它应该全部自动完成.问题是表格没有标准化.

这就是我到目前为止所拥有的.

  1. VBA(Excel)运行XPDF,并将当前文件夹中找到的所有.pdf文件转换为文本文件.
  2. VBA(Excel)逐行读取每个文本文件.

和代码:

With New Scripting.FileSystemObject
With .OpenTextFile(strFileName, 1, False, 0)

    If Not .AtEndOfStream Then .SkipLine
    Do Until .AtEndOfStream
        //do something
    Loop
End With
End With
Run Code Online (Sandbox Code Playgroud)

一切都很好.但现在我遇到了从文本文件中提取表格的问题.我想要做的是VBA找到一个字符串,例如"年收入",然后将数据输出到列中.(直到桌子结束.)

第一部分并不是很困难(找到某个字符串),但我将如何处理第二部分.文本文件看起来像这个Pastebin.问题是文本没有标准化.因此,例如,一些表具有3年列(2010 2011 2012),而一些表仅有两个(或1),一些表在列之间具有更多空格,而一些表不包括某些行(例如Capital Asset,net).

我正在考虑做这样的事情,但不确定如何在VBA中进行.

  1. 查找用户定义的字符串 例如."表1:年回报."
  2. 一个.下一行发现年; 如果有两个我们将需要三列输出(标题+,2x年),如果有三个我们将需要四个(标题+,3x年)...等
    b.为每年创建标题列+列.
  3. 到达行尾时,转到下一行
  4. 一个.读取文本 - >输出到第1列
    .b.识别空格(空格> 3?)作为第2列的开头.读取数字 - >输出到第2列
    .c.(如果column = 3)将空格识别为第3列的开头.读取数字 - >输出到第3列
    .(如果column = 4)将空格识别为第4列的开头.读取数字 - >输出到第4列.
  5. 每一行,循环4.
  6. 下一行不包含任何数字 - 结束表.(可能是easiet只是一个用户定义的数字,15个字符后没有数字?结束表)

我将我的第一个版本基于Pdf进行优秀,但在网上阅读的人并不推荐OpenFile,而是FileSystemObject(尽管看起来速度要慢很多).

有什么指示让我开始,主要是在第2步?

Cub*_*ase 1

您可以使用多种方法来剖析文本文件,并且根据文件的复杂程度,您可能会倾向于采用一种或另一种方式。我开始了这个,但它有点失控了……享受吧。

根据您提供的示例和附加评论,我注意到以下内容。其中一些可能适用于简单文件,但对于更大、更复杂的文件可能会变得笨拙。此外,我在这里使用的方法或技巧可能稍微更有效,但这肯定会让您达到预期的结果。希望这与提供的代码结合起来有意义:

  • 您可以使用布尔值来帮助您确定您所在的文本文件的“部分”。即InStr在当前行上使用,通过查找文本“表格”来确定您位于表格中,然后一旦您知道您位于表格中,文件的“表”部分开始查找“资产”部分等
  • 您可以使用几种方法来确定您拥有的年数(或列数)。该Split函数和一个循环将完成这项工作。
  • 如果您的文件始终具有恒定的格式,即使只是在某些部分,您也可以利用这一点。例如,如果您知道文件行前面始终有一个美元符号,那么您知道这将定义列宽,并且您可以在后续文本行中使用它。

以下代码将从文本文件中提取资产详细信息,您可以对其进行修改以提取其他部分。它应该处理多行。希望我已经足够评论了。看一下,如果需要进一步帮助,我会进行编辑。

 Sub ReadInTextFile()
    Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
    Dim sFileName As String, sLine As String, vYears As Variant
    Dim iNoColumns As Integer, ii As Integer, iCount As Integer
    Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean

    Set fs = CreateObject("Scripting.FileSystemObject")
    sFileName = "G:\Sample.txt"
    Set fsFile = fs.OpenTextFile(sFileName, 1, False)

    'Loop through the file as you've already done
    Do While fsFile.AtEndOfStream <> True
        'Determine flag positions in text file
        sLine = fsFile.Readline

        Debug.Print VBA.Len(sLine)

        'Always skip empty lines (including single spaceS)
        If VBA.Len(sLine) > 1 Then

            'We've found a new table so we can reset the booleans
            If VBA.InStr(1, sLine, "Table") > 0 Then
                bIsTable = True
                bIsAssets = False
                bIsNetAssets = False
                bIsLiabilities = False
                iNoColumns = 0
            End If

            'Perhaps you want to also have some sort of way to designate that a table has finished.  Like so
            If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
                bIsTable = False
            End If 

            'If we're in the table section then we want to read in the data
            If bIsTable Then
                'Check for your different sections.  You could make this constant if your text file allowed it.
                If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
                If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
                If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True

                'If we haven't triggered any of these booleans then we're at the column headings
                If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
                    'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
                    vYears = VBA.Split(VBA.Trim$(sLine), " ")
                    For ii = LBound(vYears) To UBound(vYears)
                        If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
                    Next ii

                    'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
                    ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
                    ReDim iColumns(1 To iNoColumns) As Integer
                Else
                    If bIsAssets Then
                        'Skip the heading line
                        If Not VBA.Trim$(sLine) = "Assets" Then
                            'Increment the counter
                            iCount = iCount + 1

                            'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
                            If iCount > 99 Then
                                'You'll find other posts on stackoverflow to do this
                            End If

                            'This will happen on the first row, it'll happen everytime you
                            'hit a $ sign but you could code to only do so the first time
                            If VBA.InStr(1, sLine, "$") > 0 Then
                                iColumns(1) = VBA.InStr(1, sLine, "$")
                                For ii = 2 To iNoColumns
                                    'We need to start at the next character across
                                    iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
                                Next ii
                            End If

                            'The first part (the name) is simply up to the $ sign (trimmed of spaces)
                            sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
                            For ii = 2 To iNoColumns
                                'Then we can loop around for the rest
                                sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
                            Next ii

                            'Now do the last column
                            If VBA.Len(sLine) > iColumns(iNoColumns) Then
                                sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
                            End If
                        Else
                            'Reset the counter
                            iCount = 0
                        End If
                    End If
                End If

            End If
        End If
    Loop

    'Clean up
    fsFile.Close
    Set fsFile = Nothing
    Set fs = Nothing
End Sub
Run Code Online (Sandbox Code Playgroud)