使用powershell搜索pdf内容并输出文件列表

Question

使用powershell搜索pdf内容并输出文件列表

这是我想要做的：

我有一大堆各种格式的文件（大约一万个）。每个文件都可以定义为某种类型（例如：产品表、商业计划、报价、演示文稿等）。这些文件没有特定的顺序，不妨将其视为一个列表。我有兴趣按类型创建目录。

这个想法是，对于某种格式和某种类型，我知道要在文件内容中查找哪些关键字。我想要一个 powershell 脚本，它基本上执行一系列脚本，查找包含特定关键字的特定格式的所有文件，并将每个列表输出到单独的 csv。这里的关键点是关键字将在内容（pdf 的正文、excel 的单元格等）中，而不是在文件名中。到目前为止，我已经尝试了以下方法：

get-childitem -Recurse | where {!$_.PSIsContainer} |
select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file.csv  -encoding default

Run Code Online (Sandbox Code Playgroud)

这很好，并为我提供了完整的文件列表，包括它们的大小和扩展名。我正在寻找类似但按内容过滤的东西。有任何想法吗？

编辑：基于她下面的解决方案的新代码：

$searchstring = "foo"
$directory = Get-ChildItem -include ('*.pdf') -Path "C:\Users\Uzer\Searchfolder" -Recurse

foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)}| select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file2.csv  -encoding default}

Run Code Online (Sandbox Code Playgroud)

但是我收到了一堆这些错误：

 An object at the specified path C:[blabla]\filename.pdf does not exist, or has been filtered by the -Include or -Exclude parameter.

Run Code Online (Sandbox Code Playgroud)

Answer 1

roo*_*oot 8

Powershell 使用itextsharp.dll。下面评估每个 pdf 的每个页面上的关键字的文本，然后将任何匹配导出到 csv。如果找到匹配项，您可以使用它来重命名文件，将它们移动到分类文件夹等。

编辑：itextsharp 的 Github 页面表明它已停产并链接到Itext7 https://github.com/itext/itext7-dotnet（双重许可为 AGPL/商业软件，非商业用途似乎免费。）

Add-Type -Path "C:\path_to_dll\itextsharp.dll"
$pdfs = gci "C:\path_to_pdfs" *.pdf
$export = "C:\path_to_export\export.csv"
$results = @()
$keywords = @('Keyword1','Keyword2','Keyword3')

foreach($pdf in $pdfs) {

    Write-Host "processing -" $pdf.FullName

    # prepare the pdf
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $pdf.FullName

    # for each page
    for($page = 1; $page -le $reader.NumberOfPages; $page++) {
    
        # set the page text
        $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)

        # if the page text contains any of the keywords we're evaluating
        foreach($keyword in $keywords) {
            if($pageText -match $keyword) {
                $response = @{
                    keyword = $keyword
                    file = $pdf.FullName
                    page = $page
                }
                $results += New-Object PSObject -Property $response
            }
        }
    }
    $reader.Close()
}

Write-Host ""
Write-Host "done"

$results | epcsv $export -NoTypeInformation

Run Code Online (Sandbox Code Playgroud)

控制台输出：

processing - C:\path_to_pdfs\1.pdf
processing - C:\path_to_pdfs\2.pdf
processing - C:\path_to_pdfs\3.pdf
processing - C:\path_to_pdfs\4.pdf
processing - C:\path_to_pdfs\5.pdf

done
PS C:\>

Run Code Online (Sandbox Code Playgroud)

csv输出：

keyword    page    file
Keyword2   14      C:\path_to_pdfs\3.pdf
Keyword3   22      C:\path_to_pdfs\3.pdf
Keyword1   6       C:\path_to_pdfs\5.pdf

Run Code Online (Sandbox Code Playgroud)

Answer 2

Sme*_*ijp -1

您可以用来Get-Content查找文件中的某些内容。

例子：

$searchstring = "foo"
$directory = Get-ChildItem -Path C:\temp\ -Recurse

foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)} | # do something...}

Run Code Online (Sandbox Code Playgroud)

使用该$searchstring变量提供在文件中搜索的单词。该$directory变量是包含将使用搜索字符串搜索的文件的目录。

有关 cmdlet 的更多信息Get-Content可以在此处找到。

`Get-Content` 不适合很好地阅读 PDF 的内容。 (2认同)
您将需要 itextsharp.dll 来使用 Powershell 解析 PDF 内容。开始研究这个，如果可以的话我会写一些东西。 (2认同)

归档时间：	8 年，1 月前
查看次数：	26791 次
最近记录：	5 年，1 月前