PDFBOX.如何获取所有类型的pdf表单的字段

Bas*_*sad 2 java pdf xfa pdfbox

我能够使用pdfbox获取大多数pdf文件的字段名称,但我无法获取字段所得税.它是以某种形式受限制的吗?

虽然它在表单中包含多个字段,但它只显示一个字段.

这是输出:

topmostSubform [0].

我的代码:

PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List fields = acroForm.getFields();

@SuppressWarnings("rawtypes")
java.util.Iterator fieldsIter = fields.iterator();
System.out.println(new Integer(fields.size()).toString());
while( fieldsIter.hasNext())
{
    PDField field = (PDField)fieldsIter.next();
    System.out.println(field.getFullyQualifiedName());
    System.out.println(field.getPartialName());
}
Run Code Online (Sandbox Code Playgroud)

用于

public static void main(String[] args) throws IOException {
    PDDocument pdDoc = null;
    try {
        pdDoc = PDDocument.load("income.pdf");
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace(); 
    }
    Ggdfgdgdgf feilds = new Ggdfgdgdgf();
    feilds.printFields(pdDoc);
}
Run Code Online (Sandbox Code Playgroud)

mkl*_*mkl 8

有问题的PDF是混合AcroForm/XFA表单.这意味着它包含AcroForm和XFA格式的表单定义.

PDFBox主要支持AcroForm(这是PDF规范中提供的PDF表单技术),但由于两种格式都存在,PDFBox至少可以检查AcroForm表单定义.

您的代码忽略了AcroForm.getFields()不返回所有字段定义,而只返回根字段的定义,参见 JavaDoc评论:

/**
 * This will return all of the documents root fields.
 * 
 * A field might have children that are fields (non-terminal field) or does not
 * have children which are fields (terminal fields).
 * 
 * The fields within an AcroForm are organized in a tree structure. The documents root fields 
 * might either be terminal fields, non-terminal fields or a mixture of both. Non-terminal fields
 * mark branches which contents can be retrieved using {@link PDNonTerminalField#getChildren()}.
 * 
 * @return A list of the documents root fields.
 * 
 */
public List<PDField> getFields()
Run Code Online (Sandbox Code Playgroud)

如果要访问所有字段,则必须遍历表单字段树,例如:

public void test() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("f2290.pdf"))
    {
        PDDocument pdfDocument = PDDocument.load(resource);
        PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
        PDAcroForm acroForm = docCatalog.getAcroForm();
        List<PDField> fields = acroForm.getFields();
        for (PDField field : fields)
        {
            list(field);
        }
    }
}

void list(PDField field)
{
    System.out.println(field.getFullyQualifiedName());
    System.out.println(field.getPartialName());
    if (field instanceof PDNonTerminalField)
    {
        PDNonTerminalField nonTerminalField = (PDNonTerminalField) field;
        for (PDField child : nonTerminalField.getChildren())
        {
            list(child);
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

这将返回文档的大量字段列表.

PS:您还没有说明您使用的PDFBox版本.由于目前PDFBox开发显然已经开始推荐使用当前的2.0.0版本候选版本,我在答案中假设您使用该版本.

  • 0.7.3?哇,古老.我担心你需要2.0.0 (3认同)
  • 为了完整性使用PDFBox 2.0.0迭代所有字段,您可以执行PDAcroForm表单; ... for(PDField field:form.getFieldTree()){...(做某事)} (2认同)