小编Wil*_*ill的帖子

使用 Azure 认知搜索对静态 HTML Blob 存储内容建立索引未按预期工作

我正在研究 Blob 存储中的静态 HTML 内容索引。该文档指出，预处理分析器在从该数据源索引内容时将去除周围的 HTML 标签。然而，我们的content值始终是整个原始 HTML 文档。我也无法提取“元描述”标签的值。根据Indexing Blob Storage 的文档metadata_description，HTML 内容应该自动生成一个属性，但该值始终为 null。

我尝试了许多不同的索引器配置，但到目前为止还无法判断是否配置错误或者 Azure 搜索是否无法正确识别内容类型。

Blob 存储中的所有文件都有.html文件扩展名，内容类型列显示text/html.

这是索引器配置（某些位<已编辑>）：

{
  "@odata.context": "https://<instance>.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"<tag>\"",
  "name": "<name>",
  "description": null,
  "dataSourceName": "<datasource name>",
  "skillsetName": null,
  "targetIndexName": "<target index>",
  "disabled": null,
  "schedule": {
    "interval": "PT2H",
    "startTime": "0001-01-01T00:00:00Z"
  },
  "parameters": {
    "batchSize": null,
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "parsingMode": "text",
      "dataToExtract": "contentAndMetadata",
      "excludedFileNameExtensions": ".png .jpg .mpg .pdf",
      "indexedFileNameExtensions": ".html" …

Run Code Online (Sandbox Code Playgroud)

meta search azure azure-cognitive-search

Wil*_*ill

lucky-day

1
推荐指数

1
解决办法

578
查看次数