use*_*824 3 python csv mongodb
我使用python脚本导出到csv时遇到问题.某些数组数据需要从Mongodb导出到CSV,但以下脚本未正确导出,因为三个子字段数据被转储到列中.我想将answer字段下的三个字段(order,text,answerid)分成CSV中的三个不同列.
Mongodb的样本:
"answers": [
{
"order": 0,
"text": {
"en": "Yes"
},
"answerId": "527d65de7563dd0fb98fa28c"
},
{
"order": 1,
"text": {
"en": "No"
},
"answerId": "527d65de7563dd0fb98fa28b"
}
]
Run Code Online (Sandbox Code Playgroud)
python脚本:
import csv
cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
cursor = list(cursor)
with open('answer_2.csv', 'w') as outfile:
fields = ['_id','answers.order', 'answers.text', 'answers.answerid']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for x in cursor:
for y, v in x.iteritems():
if y == 'answers'
print (y, v)
write.writerow(v)
write.writerow(x)
Run Code Online (Sandbox Code Playgroud)
所以......问题在于csv
作者不理解"subdictionaries"的概念,因为mongo返回它.
如果我理解正确,当你查询Mongo时,你得到一个这样的字典:
{
"_id": "a hex ID that correspond with the record that contains several answers",
"answers": [ ... a list with a bunch of dicts in it... ]
}
Run Code Online (Sandbox Code Playgroud)
因此,当csv.DictWriter
试图写出它时,它只写一个字典(最顶层).它不知道(或关心)这answers
是一个包含字典的列表,其值也需要在列中写入(使用点表示法访问字典中的字段,例如answers.order
只有Mongo可以理解,而不是由csv编写者理解)
我理解你应该做的是"遍历"答案列表并从该列表中的每个记录(每个字典)中创建一个字典.一旦你有一个"扁平"词典列表,你可以传递它们并将它们写在你的csv
文件中:
cursor = client.stack_overflow.stack_039.find(
{}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})
# Step 1: Create the list of dictionaries (one dictionary per entry in the `answers` list)
flattened_records = []
for answers_record in cursor:
answers_record_id = answers_record['_id']
for answer_record in answers_record['answers']:
flattened_record = {
'_id': answers_record_id,
'answers.order': answer_record['order'],
'answers.text': answer_record['text'],
'answers.answerId': answer_record['answerId']
}
flattened_records.append(flattened_record)
# Step 2: Iterate through the list of flattened records and write them to the csv file
with open('stack_039.csv', 'w') as outfile:
fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for flattened_record in flattened_records:
write.writerow(flattened_record)
Run Code Online (Sandbox Code Playgroud)
什么使用复数.answers_record
不同于answer_record
这会创建一个这样的文件:
$ cat ./stack_039.csv
_id,answers.order,answers.text,answers.answerId
580f9aa82de54705a2520833,0,{u'en': u'Yes'},527d65de7563dd0fb98fa28c
580f9aa82de54705a2520833,1,{u'en': u'No'},527d65de7563dd0fb98fa28b
Run Code Online (Sandbox Code Playgroud)
编辑:
您的查询(生成的查询cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
)将返回questions
集合中的所有条目.如果此集合非常大,您可能希望将其cursor
用作迭代器.
正如您可能已经意识到的那样,for
上面代码中的第一个循环将所有记录放在一个列表中(flattened_records
列表).你可以通过迭代来进行延迟加载cursor
(而不是加载内存中的所有项目,获取一个,用它做一些事情,获取下一个,用它做一些事情......).
它稍慢,但内存效率更高.
cursor = client.stack_overflow.stack_039.find(
{}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})
with open('stack_039.csv', 'w') as outfile:
fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for answers_record in cursor: # Here we are using 'cursor' as an iterator
answers_record_id = answers_record['_id']
for answer_record in answers_record['answers']:
flattened_record = {
'_id': answers_record_id,
'answers.order': answer_record['order'],
'answers.text': answer_record['text'],
'answers.answerId': answer_record['answerId']
}
write.writerow(flattened_record)
Run Code Online (Sandbox Code Playgroud)
它将生成.csv
如上所示的相同文件.