如何读大熊猫中的大型json？

Question

如何读大熊猫中的大型json？

我的代码是:data_review=pd.read_json('review.json') 我的数据review为fllow:

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

Run Code Online (Sandbox Code Playgroud)

但我得到了以下错误:

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

Run Code Online (Sandbox Code Playgroud)

我的jsonfile不包含任何评论和3.8G!我只是从这里下载文件到练习链接

当我使用以下代码时,抛出相同的错误:

import json
with open('review.json') as json_file:
    data = json.load(json_file)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Sha*_*tal 6

也许，您正在读取的文件包含多个json对象，而不是方法json.load(json_file)和pd.read_json('review.json')预期的单个json或数组对象。这些方法应该使用单个json对象读取文件。

从我看到的yelp数据集中，您的文件必须包含以下内容：

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

Run Code Online (Sandbox Code Playgroud)

因此，重要的是要意识到这不是单个json数据，而是一个文件中的多个json对象。

要将以下数据读入pandas数据框中，应采用以下解决方案：

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

Run Code Online (Sandbox Code Playgroud)

假设数据量很大，我认为您的机器将花费大量时间将数据加载到数据帧中。

大熊猫文件的任何解决方案，每行有一个json，而在熊猫中没有forloop？ (2认同)

Answer 2

小智 6

使用 arglines=True 和 chunksize=X 将创建一个获取特定行数的读取器。

然后你必须创建一个循环来显示每个块。

这里有一段代码供您理解：

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

Run Code Online (Sandbox Code Playgroud)

块根据 json 的长度创建多个块（按行交谈）。例如，我有一个 100 000 行 json，其中包含 X 个对象，如果我执行 chunksize = 10 000，我将有 10 个块。

在我给出的代码中，我添加了一个中断，以便只打印第一个块，但如果删除它，您将一个接一个地得到 10 个块。

Answer 3

小智 5

如果您不想使用for循环，则可以使用以下方法：

import pandas as pd

df = pd.read_json("foo.json", lines=True)

Run Code Online (Sandbox Code Playgroud)

这将处理您的json文件看起来与此类似的情况：

{"foo": "bar"}
{"foo": "baz"}
{"foo": "qux"}

Run Code Online (Sandbox Code Playgroud)

并将其转换为由单列，三行组成的DataFrame foo。

您可以在Panda的文档中阅读更多内容

归档时间：	8 年，4 月前
查看次数：	5545 次
最近记录：	7 年，4 月前