使用Python将JSON文件拆分为相同/更小的部分

Question

使用Python将JSON文件拆分为相同/更小的部分

我目前正在开展一个项目,我在其中使用Sentiment Analysis for Twitter Posts.我正在使用Sentiment140对Tweets进行分类.使用该工具,我每天可以分类多达1,000,000条推文,我收集了大约750,000条推文.所以这应该没问题.唯一的问题是我可以一次向JSON批量分类发送最多15,000条推文.

我的整个代码已经设置并运行.唯一的问题是我的JSON文件现在包含所有750,000个推文.

因此我的问题是:将JSON拆分为具有相同结构的较小文件的最佳方法是什么？我更喜欢在Python中这样做.

我曾想过迭代文件.但是如何在代码中指定它应该在例如5,000个元素之后创建一个新文件？

我想知道最合理的方法是什么.谢谢!

编辑:这是我目前的代码.

import itertools
import json
from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# Open JSON file
values = open('Tweets.json').read()
#print values

# Adjust formatting of JSON file
values = values.replace('\n', '')    # do your cleanup here
#print values

v = values.encode('utf-8')
#print v

# Load JSON file
v = json.loads(v)
print type(v)

for i, group in enumerate(grouper(v, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

Run Code Online (Sandbox Code Playgroud)

输出给出:

["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]

Run Code Online (Sandbox Code Playgroud)

在一个名为"outputbatch_0.json"的文件中

编辑2:这是JSON的结构.

{
"data": [
{
"text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
"id": "1"
},
{
"text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
"id": "2"
},
{
"text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
"id": "3"}
]
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mar*_*ers 6

使用迭代分组器; 所述itertools模块的食谱列表包括以下内容:

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

Run Code Online (Sandbox Code Playgroud)

这使您可以以5000为一组迭代推文:

for i, group in enumerate(grouper(input_tweets, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，8 月前
查看次数：	8637 次
最近记录：	11 年，2 月前