Robomongo:超出$ group的内存限制

kad*_*dzu 18 out-of-memory duplicates mongodb

我使用脚本来删除mongo上的重复项,它在我用作测试的10个项目的集合中工作但是当我使用600万个文档的真实集合时,我收到错误.

这是我在Robomongo(现在称为Robo 3T)中运行的脚本:

var bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();
var count = 0;

db.getCollection('RAW_COLLECTION').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();
Run Code Online (Sandbox Code Playgroud)

这是错误消息:

Error: command failed: {
    "errmsg" : "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.",
    "code" : 16945,
    "ok" : 0
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:23:13
doassert@src/mongo/shell/assert.js:13:14
assert.commandWorked@src/mongo/shell/assert.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1215:5
@(shell):1:1
Run Code Online (Sandbox Code Playgroud)

所以我需要开始allowDiskUse:true工作?我在哪里在脚本中这样做,这样做有什么问题吗?

Ati*_*ish 40

{ allowDiskUse: true } 
Run Code Online (Sandbox Code Playgroud)

应该在聚合管道之后放置.

在你的代码中,这应该是这样的:

db.getCollection('RAW_COLLECTION').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
], { allowDiskUse: true } )
Run Code Online (Sandbox Code Playgroud)

  • 这给了我一个错误“AttributeError:‘dict’对象没有属性‘_txn_read_preference’”。这是因为它不应该是一个字典。它应该是这样的:```findings = list(collection.aggregate(aggr, allowDiskUse=True))``` (2认同)