rad*_*ind 6 druid google-cloud-platform google-cloud-dataproc
描述druid支持的dataproc页面没有关于如何将数据加载到集群的章节。我一直在尝试使用GC Storage进行此操作,但不知道如何为它建立有效的规范。我希望“ firehose”部分中有一些Google特定于存储桶的引用,但是没有示例如何执行此操作。
直接在GCP dataproc上运行的将数据加载到Druid的方法是什么?
我尚未使用Druid的Dataproc版本,但是在Google Compute VM中运行着一个小型集群。我从GCS提取数据的方式是使用Google Cloud Storage Druid扩展-https: //druid.apache.org/docs/latest/development/extensions-core/google.html
要启用扩展名,您需要将其添加到您的Druid common.properties文件的扩展名列表中:
druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]
Run Code Online (Sandbox Code Playgroud)
要从GCS提取数据,我将HTTP POST请求发送到 http://druid-overlord-host:8081/druid/indexer/v1/task
POST请求正文包含具有接收规范的JSON文件(请参见[“ ioConfig”] [“ firehose”]部分):
{
"type": "index_parallel",
"spec": {
"dataSchema": {
"dataSource": "daily_xport_test",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "MONTH",
"queryGranularity": "NONE",
"rollup": false
},
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "dateday",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [{
"type": "string",
"name": "id",
"createBitmapIndex": true
},
{
"type": "long",
"name": "clicks_count_total"
},
{
"type": "long",
"name": "ctr"
},
"deleted",
"device_type",
"target_url"
]
}
}
}
},
"ioConfig": {
"type": "index_parallel",
"firehose": {
"type": "static-google-blobstore",
"blobs": [{
"bucket": "data-test",
"path": "/sample_data/daily_export_18092019/000000000000.json.gz"
}],
"filter": "*.json.gz$"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"maxNumSubTasks": 1,
"maxRowsInMemory": 1000000,
"pushTimeout": 0,
"maxRetry": 3,
"taskStatusCheckPeriodMs": 1000,
"chatHandlerTimeout": "PT10S",
"chatHandlerNumRetries": 5
}
}
}
Run Code Online (Sandbox Code Playgroud)
用于在Druid中启动提取任务的示例cURL命令(spec.json包含上一部分中的JSON):
curl -X 'POST' -H 'Content-Type:application/json' -d @spec.json http://druid-overlord-host:8081/druid/indexer/v1/task
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
142 次 |
| 最近记录: |