Adr*_*eno 17 csv elasticsearch logstash
我们有一个现有的搜索功能,它涉及SQL Server中多个表的数据.这会导致我们的数据库负载过重,所以我试图找到一种更好的方法来搜索这些数据(它不会经常更改).我使用包含120万条记录的导入,一直在使用Logstash和Elasticsearch大约一周.我的问题基本上是"如何使用我的'主键'更新现有文档"?
CSV数据文件(管道分隔)如下所示:
369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CARun Code Online (Sandbox Code Playgroud)
我的logstash配置如下所示:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/Data/sample/*"]
start_position => "beginning"
}
}
filter {
csv {
columns => ["property_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
output {
elasticsearch {
embedded => true
index => "samples4"
index_type => "sample"
}
}Run Code Online (Sandbox Code Playgroud)
elasticsearch中的文档,如下所示:
{
"_index": "samples4",
"_type": "sample",
"_id": "64Dc0_1eQ3uSln_k-4X26A",
"_score": 1.4054651,
"_source": {
"message": [
"369|90045|123 ABC ST|LOS ANGELES|CA\r"
],
"@version": "1",
"@timestamp": "2014-02-11T22:58:38.365Z",
"host": "[host]",
"path": "C:/Data/sample/sample.csv",
"property_id": "369",
"postal_code": "90045",
"address_1": "123 ABC ST",
"city": "LOS ANGELES",
"state_code": "CA"
}Run Code Online (Sandbox Code Playgroud)
我想想在该_id领域中的唯一ID ,将被替换为值property_id.这个想法是后续数据文件将包含更新.我不需要保留以前的版本,也不会出现我们在文档中添加或删除密钥的情况.
document_idelasticsearch输出的设置不会将该字段的值放入_id(它只是放入"property_id"并且只存储/更新一个文档).我知道我在这里遗漏了一些东西.我只是采取了错误的方法吗?
编辑:工作!
使用@ rutter的建议,我已将output配置更新为:
Run Code Online (Sandbox Code Playgroud)output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{property_id}"
}
}
Now documents are updating by dropping new files into the data folder as expected. _id并且property_id值相同.
{
"_index": "samples6",
"_type": "sample",
"_id": "351",
"_score": 1,
"_source": {
"message": [
"351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
],
"@version": "1",
"@timestamp": "2014-02-12T16:12:52.102Z",
"host": "TXDFWL3474",
"path": "C:/Data/sample/sample_update_3.csv",
"property_id": "351",
"postal_code": "90045",
"address_1": "Easy as 123 ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
rut*_*ter 13
从评论转换:
您可以通过发送具有相同ID的其他文档来覆盖文档...但这可能会对您以前的数据造成困难,因为默认情况下您将获得随机ID.
您可以使用输出插件的document_id字段设置ID ,但它需要一个文字字符串,而不是字段名称.要使用字段的内容,可以使用sprintf格式字符串,例如%{property_id}.
像这样的东西,例如:
output {
elasticsearch {
... other settings...
document_id => "%{property_id}"
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
16803 次 |
| 最近记录: |