GCP Dataproc的德鲁伊可用alpha。如何加载细分?

rad*_*ind 6 druid google-cloud-platform google-cloud-dataproc

描述druid支持的dataproc页面没有关于如何将数据加载到集群的章节。我一直在尝试使用GC Storage进行此操作,但不知道如何为它建立有效的规范。我希望“ firehose”部分中有一些Google特定于存储桶的引用,但是没有示例如何执行此操作。

直接在GCP dataproc上运行的将数据加载到Druid的方法是什么?

Art*_*sia 6

我尚未使用Druid的Dataproc版本,但是在Google Compute VM中运行着一个小型集群。我从GCS提取数据的方式是使用Google Cloud Storage Druid扩展-https: //druid.apache.org/docs/latest/development/extensions-core/google.html

要启用扩展名,您需要将其添加到您的Druid common.properties文件的扩展名列表中:

druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]
Run Code Online (Sandbox Code Playgroud)

要从GCS提取数据,我将HTTP POST请求发送到 http://druid-overlord-host:8081/druid/indexer/v1/task

POST请求正文包含具有接收规范的JSON文件(请参见[“ ioConfig”] [“ firehose”]部分):

{
    "type": "index_parallel",
    "spec": {
        "dataSchema": {
            "dataSource": "daily_xport_test",
            "granularitySpec": {
                "type": "uniform",
                "segmentGranularity": "MONTH",
                "queryGranularity": "NONE",
                "rollup": false
            },
            "parser": {
                "type": "string",
                "parseSpec": {
                    "format": "json",
                    "timestampSpec": {
                        "column": "dateday",
                        "format": "auto"
                    },
                    "dimensionsSpec": {
                        "dimensions": [{
                                "type": "string",
                                "name": "id",
                                "createBitmapIndex": true
                            },
                            {
                                "type": "long",
                                "name": "clicks_count_total"
                            },
                            {
                                "type": "long",
                                "name": "ctr"
                            },
                            "deleted",
                            "device_type",
                            "target_url"
                        ]
                    }
                }
            }
        },
        "ioConfig": {
            "type": "index_parallel",
            "firehose": {
                "type": "static-google-blobstore",
                "blobs": [{
                    "bucket": "data-test",
                    "path": "/sample_data/daily_export_18092019/000000000000.json.gz"
                }],
                "filter": "*.json.gz$"
            },
            "appendToExisting": false
        },
        "tuningConfig": {
            "type": "index_parallel",
            "maxNumSubTasks": 1,
            "maxRowsInMemory": 1000000,
            "pushTimeout": 0,
            "maxRetry": 3,
            "taskStatusCheckPeriodMs": 1000,
            "chatHandlerTimeout": "PT10S",
            "chatHandlerNumRetries": 5
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

用于在Druid中启动提取任务的示例cURL命令(spec.json包含上一部分中的JSON):

curl -X 'POST' -H 'Content-Type:application/json' -d @spec.json http://druid-overlord-host:8081/druid/indexer/v1/task
Run Code Online (Sandbox Code Playgroud)