我有一个多GB的JSON文件.该文件由JSON对象组成,每个对象不超过几千个字符,但记录之间没有换行符.
使用Python 3和json模块,我如何一次从文件读入一个JSON对象到内存?
数据位于纯文本文件中.这是一个类似记录的例子.实际记录包含许多嵌套字典和列表.
以可读格式记录:
{
"results": {
"__metadata": {
"type": "DataServiceProviderDemo.Address"
},
"Street": "NE 228th",
"City": "Sammamish",
"State": "WA",
"ZipCode": "98074",
"Country": "USA"
}
}
}
Run Code Online (Sandbox Code Playgroud)
实际格式.新记录一个接一个地开始,没有任何中断.
{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
Run Code Online (Sandbox Code Playgroud)
Mar*_*ers 25
一般来说,将多个JSON对象放入文件会使该文件无效,破坏JSON.也就是说,您仍然可以使用该JSONDecoder.raw_decode()方法解析数据块.
以下将在解析器找到它们时生成完整的对象:
from json import JSONDecoder
from functools import partial
def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
buffer = ''
for chunk in iter(partial(fileobj.read, buffersize), ''):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
Run Code Online (Sandbox Code Playgroud)
此函数将以块的形式从给定的文件对象中读取块buffersize,并使decoder对象从缓冲区中解析整个JSON对象.每个解析的对象都会产生给调用者.
像这样使用它:
with open('yourfilename', 'r') as infh:
for data in json_parse(infh):
# process object
Run Code Online (Sandbox Code Playgroud)
仅当JSON对象背靠背地写入文件时才使用此选项,两者之间没有换行符.如果您确实有换行符,并且每个JSON对象仅限于一行,那么您有一个JSON Lines文档,在这种情况下,您可以使用加载和解析带有多个JSON对象的JSON文件.
这是对Martijn Pieters 解决方案的轻微修改,它将处理用空格分隔的 JSON 字符串。
def json_parse(fileobj, decoder=json.JSONDecoder(), buffersize=2048,
delimiters=None):
remainder = ''
for chunk in iter(functools.partial(fileobj.read, buffersize), ''):
remainder += chunk
while remainder:
try:
stripped = remainder.strip(delimiters)
result, index = decoder.raw_decode(stripped)
yield result
remainder = stripped[index:]
except ValueError:
# Not enough data to decode, read more
break
Run Code Online (Sandbox Code Playgroud)
例如,如果data.txt包含由空格分隔的 JSON 字符串:
{"business_id": "1", "Accepts Credit Cards": true, "Price Range": 1, "type": "food"} {"business_id": "2", "Accepts Credit Cards": true, "Price Range": 2, "type": "cloth"} {"business_id": "3", "Accepts Credit Cards": false, "Price Range": 3, "type": "sports"}
Run Code Online (Sandbox Code Playgroud)
然后
In [47]: list(json_parse(open('data')))
Out[47]:
[{u'Accepts Credit Cards': True,
u'Price Range': 1,
u'business_id': u'1',
u'type': u'food'},
{u'Accepts Credit Cards': True,
u'Price Range': 2,
u'business_id': u'2',
u'type': u'cloth'},
{u'Accepts Credit Cards': False,
u'Price Range': 3,
u'business_id': u'3',
u'type': u'sports'}]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
20114 次 |
| 最近记录: |