当我将 JSON 摄取规范发送到 Druid 霸主 API 时,我得到以下响应:
HTTP/1.1 400 Bad Request
Content-Type: application/json
Date: Wed, 25 Sep 2019 11:44:18 GMT
Server: Jetty(9.4.10.v20180503)
Transfer-Encoding: chunked
{
"error": "Instantiation of [simple type, class org.apache.druid.indexing.common.task.IndexTask] value failed: null"
}
Run Code Online (Sandbox Code Playgroud)
如果我将index任务类型更改为index_parallel,那么我会得到:
{
"error": "Instantiation of [simple type, class org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask] value failed: null"
}
Run Code Online (Sandbox Code Playgroud)
通过 Druid 的 Web UI 使用相同的摄取规范可以正常工作。
这是我使用的摄取规范(稍微修改以隐藏敏感数据):
{
"type": "index_parallel",
"dataSchema": {
"dataSource": "daily_xport_test",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "MONTH",
"queryGranularity": "NONE",
"rollup": false
},
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "dateday",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
{
"type": "string",
"name": "id",
"createBitmapIndex": true
},
{
"type": "long",
"name": "clicks_count_total"
},
{
"type": "long",
"name": "ctr"
},
"deleted",
"device_type",
"target_url"
]
}
}
}
},
"ioConfig": {
"type": "index_parallel",
"firehose": {
"type": "static-google-blobstore",
"blobs": [
{
"bucket": "data-test",
"path": "/sample_data/daily_export_18092019/000000000000.json.gz"
}
],
"filter": "*.json.gz$"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"maxNumSubTasks": 1,
"maxRowsInMemory": 1000000,
"pushTimeout": 0,
"maxRetry": 3,
"taskStatusCheckPeriodMs": 1000,
"chatHandlerTimeout": "PT10S",
"chatHandlerNumRetries": 5
}
}
Run Code Online (Sandbox Code Playgroud)
Overlord API URI 如下所示:
http://host:8081/druid/indexer/v1/task
Run Code Online (Sandbox Code Playgroud)
发送 API 请求的 HTTPie 命令:
HTTP/1.1 400 Bad Request
Content-Type: application/json
Date: Wed, 25 Sep 2019 11:44:18 GMT
Server: Jetty(9.4.10.v20180503)
Transfer-Encoding: chunked
{
"error": "Instantiation of [simple type, class org.apache.druid.indexing.common.task.IndexTask] value failed: null"
}
Run Code Online (Sandbox Code Playgroud)
另外,如果我尝试在 Airflow 中使用 DruidHook 类发送摄取任务,我也会遇到同样的问题
我找到了解决方案。显然,Druid UI 生成的规范与 API 使用的 JSON 格式略有不同。规范中的高级对象(“ioConfig”、“dataSchema”和“tuningConfig”)应该包装在spec对象中,如下所示:
{
"type": "index_parallel",
"spec": {
"dataSchema": {
"dataSource": "daily_xport_test",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "MONTH",
"queryGranularity": "NONE",
"rollup": false
},
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "dateday",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [{
"type": "string",
"name": "id",
"createBitmapIndex": true
},
{
"type": "long",
"name": "clicks_count_total"
},
{
"type": "long",
"name": "ctr"
},
"deleted",
"device_type",
"target_url"
]
}
}
}
},
"ioConfig": {
"type": "index_parallel",
"firehose": {
"type": "static-google-blobstore",
"blobs": [{
"bucket": "data-test",
"path": "/sample_data/daily_export_18092019/000000000000.json.gz"
}],
"filter": "*.json.gz$"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"maxNumSubTasks": 1,
"maxRowsInMemory": 1000000,
"pushTimeout": 0,
"maxRetry": 3,
"taskStatusCheckPeriodMs": 1000,
"chatHandlerTimeout": "PT10S",
"chatHandlerNumRetries": 5
}
}
}
Run Code Online (Sandbox Code Playgroud)