我有一个300 MB的CSV,有来自Geonames.org的300万行城市信息.我正在尝试将此CSV转换为JSON以使用mongoimport导入MongoDB.我想要JSON的原因是它允许我将"loc"字段指定为数组而不是用于地理空间索引的字符串.CSV以UTF-8编码.
我的CSV片段如下所示:
"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zam?n S?khteh","Zamin Sukhteh","Zamin Sukhteh,Zam?n S?khteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yek?h?","Yekahi","Yekahi,Yek?h?","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarv?? ‘Ad??","Tarvih `Adai","Tarvih `Adai,Tarv?? ‘Ad??","[48.2,32.1]","P","PPL","IR",,"15",,,
Run Code Online (Sandbox Code Playgroud)
与mongoimport一起使用的所需JSON输出(charset除外)如下:
{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
Run Code Online (Sandbox Code Playgroud)
我已经尝试了所有可用的在线CSV-JSON转换器,但由于文件大小,它们无法正常工作.我得到的最接近的是数据转换器先生(如上图所示),它将删除文件之间的开始和结束括号和逗号后导入MongoDb.不幸的是,该工具不适用于300 MB的文件.
上面的JSON设置为以UTF-8编码但仍然存在字符集问题,很可能是由于转换错误?
我花了最近三天学习Python,尝试使用Python CSVKIT,尝试堆栈溢出中的所有CSV-JSON脚本,将CSV导入MongoDB并将"loc"字符串更改为数组(这不幸地保留了引号)甚至尝试手动一次复制并粘贴30,000条记录.很多逆向工程,反复试验等等.
有没有人知道如何实现上面的JSON,同时保持编码正确,如上面的CSV?我完全停顿了.
这是我的models.py
from django.db import models
class School(models.Model):
school = models.CharField(max_length=300)
def __unicode__(self):
return self.school
class Lawyer(models.Model):
firm_url = models.URLField('Bio', max_length=200)
firm_name = models.CharField('Firm', max_length=100)
first = models.CharField('First Name', max_length=50)
last = models.CharField('Last Name', max_length=50)
year_graduated = models.IntegerField('Year graduated')
school = models.CharField(max_length=300)
school = models.ForeignKey(School)
class Meta:
ordering = ('?',)
def __unicode__(self):
return self.first
Run Code Online (Sandbox Code Playgroud)
来自csv文件的2个示例行:
"http://www.graychase.com/aabbas,Gray & Chase LLP, Amr A ,Abbas,The George Washington University Law School, 2005"
"http://www.graychase.com/kadam,Gray & Chase LLP, Karin ,Adam,Ernst Moritz Arndt University Greifswald, 2004"
Run Code Online (Sandbox Code Playgroud)
谢谢.
编辑 …