根据字段删除重复的文档

K20*_*0GH 6 mongodb mongodb-query aggregation-framework

我已经看到了很多解决方案,但它们都适用于Mongo v2并且不适合V3.

我的文档看起来像这样:

    { 
    "_id" : ObjectId("582c98667d81e1d0270cb3e9"), 
    "asin" : "B01MTKPJT1", 
    "url" : "https://www.amazon.com/Trump-President-Presidential-Victory-T-Shirt/dp/B01MTKPJT1%3FSubscriptionId%3DAKIAIVCW62S7NTZ2U2AQ%26tag%3Dselfbalancingscooters-21%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB01MTKPJT1", 
    "image" : "http://ecx.images-amazon.com/images/I/41RvN8ud6UL.jpg", 
    "salesRank" : NumberInt(442137), 
    "title" : "Trump Wins 45th President Presidential Victory T-Shirt", 
    "brand" : "\"Getting Political On Me\"", 
    "favourite" : false, 
    "createdAt" : ISODate("2016-11-16T17:33:26.763+0000"), 
    "updatedAt" : ISODate("2016-11-16T17:33:26.763+0000")
}
Run Code Online (Sandbox Code Playgroud)

我的收藏包含大约500k文件.我想删除ASIN相同的所有重复文档(1除外)

我怎样才能做到这一点?

sty*_*ane 11

这是我们实际上可以使用聚合框架并且没有客户端处理的事情.

MongoDB 3.4

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$replaceRoot": { "newRoot": "$doc" } },
        { "$out": "collection" }
    ]

)
Run Code Online (Sandbox Code Playgroud)

MongoDB版本<= 3.2:

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$project": { 
            "asin": "$doc.asin", 
            "url": "$doc.url", 
            "image": "$doc.image", 
            "salesRank": "$doc.salesRank", 
            "title": "$doc.salesRank", 
            "brand": "$doc.brand", 
            "favourite": "$doc.favourite", 
            "createdAt": "$doc.createdAt", 
            "updatedAt": "$doc.updatedAt" 
        }},
        { "$out": "collection" }
    ]
)
Run Code Online (Sandbox Code Playgroud)

  • 谢谢你的回答。我想我在查询中做错了 - 我运行了第一个并且我所有的收藏都消失了:/说这个以防万一有人在没有备份的情况下运行它 (4认同)