ali*_*ali 7 python json pandas
我有一个包含电影数据的“.csv”文件,我正在尝试将其重新格式化为 JSON 文件以在 MongoDB 中使用它。所以我将该 csv 文件加载到 pandas DataFrame 中,然后使用 to_json 方法将其写回。DataFrame 中的一行如下所示:
In [43]: result.iloc[0]
Out[43]:
title Avatar
release_date 2009
cast [{"cast_id": 242, "character": "Jake Sully", "...
crew [{"credit_id": "52fe48009251416c750aca23", "de...
Name: 0, dtype: object
Run Code Online (Sandbox Code Playgroud)
但是当 pandas 写回来时,它就变成了这样:
{ "title":"Avatar",
"release_date":"2009",
"cast":"[{\"cast_id\": 242, \"character\": \"Jake Sully\", \"credit_id\": \"5602a8a7c3a3685532001c9a\", \"gender\": 2,...]",
"crew":"[{\"credit_id\": \"52fe48009251416c750aca23\", \"department\": \"Editing\", \"gender\": 0, \"id\": 1721,...]"
}
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,“cast”和“crew”是列表,它们有大量多余的反斜杠。这些反斜杠出现在 MongoDB 集合中,导致无法从这两个字段中提取数据。
\"
除了替换为之外,如何解决这个问题"
?
PS1:这就是我将 DataFrame 保存为 JSON 的方法:
result.to_json('result.json', orient='records', lines=True)
Run Code Online (Sandbox Code Playgroud)
更新 1:显然 pandas 做得很好,问题是由原始 csv 文件引起的。它们是这样的:
movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3},...]"
Run Code Online (Sandbox Code Playgroud)
我尝试替换""
为"
(并且我真的想避免这种黑客攻击):
sed -i 's/\"\"/\"/g'
Run Code Online (Sandbox Code Playgroud)
当然,当再次将其读取为 csv 时,它会导致某些数据行出现问题:
ParserError: Error tokenizing data. C error: Expected 1501 fields in line 4, saw 1513
Run Code Online (Sandbox Code Playgroud)
所以我们可以得出结论,这种盲目更换是不安全的。任何想法?
PS2:我正在使用kaggle的5000部电影数据集:https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
小智 14
我遇到了同样的问题:解决方案分三步
1- 数据框形式 csv 或在我的例子中来自 xlsx:
excel_df= pd.read_excel(dataset ,sheet_name=my_sheet_name)
Run Code Online (Sandbox Code Playgroud)
2-转换为json(如果您的数据中有日期)
json_str = excel_df.to_json(orient='records' ,date_format='iso')
Run Code Online (Sandbox Code Playgroud)
3-最重要的事情:json.loads **** 就是它了!
parsed = json.loads(json_str)
Run Code Online (Sandbox Code Playgroud)
4-(可选)您可以写入或发送 json 文件:例如:本地写入
with open(out, 'w') as json_file:
json_file.write(json.dumps({"data": parsed}, indent=4 ))
Run Code Online (Sandbox Code Playgroud)
更多信息: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
Pandas 正在转义该"
字符,因为它认为 json 列中的值是文本。要获得所需的行为,只需将 json 列中的值解析为 json.
让文件 data.csv 具有以下内容(引号已转义)。
# data.csv
movie_id,title,cast
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}]"
Run Code Online (Sandbox Code Playgroud)
将其读入数据帧,然后应用该json.loads
函数并以 json 形式写入文件。
df = pd.read_csv('data.csv')
df.cast = df.cast.apply(json.loads)
df.to_json('data.json', orient='records', lines=True)
Run Code Online (Sandbox Code Playgroud)
输出是格式正确的 json(我添加了额外的换行符)
# data.json
{"movie_id":19995,
"title":"Avatar",
"cast":[{"cast_id":242,"character":"Jake Sully","credit_id":"5602a8a7c3a3685532001c9a","gender":2,"id":65731,"name":"Sam Worthington","order":0},
{"cast_id":3,"character":"Neytiri","credit_id":"52fe48009251416c750ac9cb","gender":1,"id":8691,"name":"Zoe Saldana","order":1},
{"cast_id":25,"character":"Dr. Grace Augustine","credit_id":"52fe48009251416c750aca39","gender":1,"id":10205,"name":"Sigourney Weaver","order":2},
{"cast_id":4,"character":"Col. Quaritch","credit_id":"52fe48009251416c750ac9cf","gender":2,"id":32747,"name":"Stephen Lang","order":3}]
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
8036 次 |
最近记录: |