java.io.IOException:错误:文件结束,预期行与PDFBox问题

Dev*_*Raj 5 java pdfbox selenium-webdriver

我正在尝试从浏览器中打开的PDF中读取PDF文本.

单击"打印"按钮后,下面的URL将在新选项卡中打开.

https://myappurl.com/employees/2Jb_rpRC710XGvs8xHSOmHE9_LGkL97j/details/listprint.pdf?ids%5B%5D=2Jb_rpRC711lmIvMaBdxnzJj_ZfipcXW
Run Code Online (Sandbox Code Playgroud)

我已经使用其他网址执行了相同的程序,并且发现它工作正常.我使用了与此处使用的相同的代码(提取PDF文本).

我正在使用以下版本的PDFBox.

    <dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.9</version>
</dependency>
<dependency>
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>fontbox</artifactId>
    <version>1.8.9</version>
</dependency>
Run Code Online (Sandbox Code Playgroud)

以下是与其他URL一起正常工作的代码:

public boolean verifyPDFContent(String strURL, String reqTextInPDF) {

    boolean flag = false;

    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    String parsedText = null;

    try {
        URL url = new URL(strURL);
        BufferedInputStream file = new BufferedInputStream(url.openStream());
        PDFParser parser = new PDFParser(file);

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(1);

        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException e2) {
        System.err.println("URL string could not be parsed "+e2.getMessage());
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e.printStackTrace();
        }
    }

    System.out.println("+++++++++++++++++");
    System.out.println(parsedText);
    System.out.println("+++++++++++++++++");

    if(parsedText.contains(reqTextInPDF)) {
        flag=true;
    }

    return flag;
}
Run Code Online (Sandbox Code Playgroud)

以下是我得到的异常的Stacktrace

java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1517)
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)
at com.kareo.utils.PDFManager.getPDFContent(PDFManager.java:26)
Run Code Online (Sandbox Code Playgroud)

更新我在URL和文件调试时拍摄的图像. 在此输入图像描述 请帮帮我.这是'https'???

小智 -1

我们都知道文件流就像一个管道。数据一旦流过,就无法再次使用。所以你可以: 1.将输入流转换为文件。

public void useInputStreamTwiceBySaveToDisk(InputStream inputStream) { 
    String desPath = "test001.bin";
    try (BufferedInputStream is = new BufferedInputStream(inputStream);
         BufferedOutputStream os = new BufferedOutputStream(new FileOutputStream(desPath))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            os.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    
    File file = new File(desPath);
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}
Run Code Online (Sandbox Code Playgroud)

2.将输入流转换为数据。

public void useInputStreamTwiceSaveToByteArrayOutputStream(InputStream inputStream) { 
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    try { 
        byte[] buffer = new byte[1024];
        int len;
        while ((len = inputStream.read(buffer)) != -1) { 
            outputStream.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    // first read InputStream
    InputStream inputStream1 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream1);
    // second read InputStream
    InputStream inputStream2 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream2);
}
Run Code Online (Sandbox Code Playgroud)

3.用输入流进行标记和重置。

public void useInputStreamTwiceByUseMarkAndReset(InputStream inputStream) { 
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, 10)) { 
        byte[] buffer = new byte[1024];
        //Call the mark method to mark
        //The number of bytes allowed to be read by the flag set here after reset is the maximum value of an integer
        bufferedInputStream.mark(bufferedInputStream.available() + 1);
        int len;
        while ((len = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
        // After the first call, explicitly call the reset method to reset the flow
        bufferedInputStream.reset();
        // Read the second stream
        sb = new StringBuilder();
        int len1;
        while ((len1 = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len1));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}
Run Code Online (Sandbox Code Playgroud)

那么您可以多次对同一输入流重复读取操作。