相关疑难解决方法(0)

使用Apache tika获取MimeType子类型

对于像odt,ppt,pptx,xlsx等文档,我需要获取iana.org MediaType而不是application/zip或application/x-tika-msoffice.

如果你看一下mimetypes.xml,那么mimeType元素由iana.org mime-type和"sub-class-of"组成.

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>
Run Code Online (Sandbox Code Playgroud)

如何获取iana.org mime-type名称而不是父类型名称?

在测试mime类型检测时,我做:

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();
Run Code Online (Sandbox Code Playgroud)

检测结果 :

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>
Run Code Online (Sandbox Code Playgroud)

有没有办法从mimetypes.xml获取实际的子类型?而不是x-tika-msoffice或application/zip?

此外,我从来没有获得application/x-tika-ooxml,但xlsx,docx,pptx文件的应用程序/ zip.

java detection mime-types apache-tika

10
推荐指数
3
解决办法
2万
查看次数

标签 统计

apache-tika ×1

detection ×1

java ×1

mime-types ×1