将巨大的json字符串反序列化为python对象

pra*_*nya 8 python performance json simplejson python-2.7

我使用simplejson将json字符串反序列化为python对象.我有一个自定义编写的object_hook,负责将json反序列化回我的域对象.

问题是,当我的json字符串很大时(即服务器以json字符串的形式返回大约800K域对象),我的python反序列化器花了将近10分钟来反序列化它们.

我进一步深入研究它看起来像simplejson,因为它没有做太多工作,而是将所有内容委托给object_hook.我尝试优化我的object_hook,但这也没有提高我的性能.(我几乎没有提高1分钟)

我的问题是,我们是否有任何其他标准框架经过优化以处理大量数据集,或者是否有一种方法可以利用框架的功能而不是在object_hook级别执行所有操作.

我看到没有object_hook,框架只返回一个字典列表,而不是域对象列表.

这里的任何指针都很有用.

仅供参考我使用的是simplejson版本3.7.2

这是我的示例_object_hook:

def _object_hook(dct):
    if '@CLASS' in dct: # server sends domain objects with this @CLASS 
        clsname = dct['@CLASS']
        # This is like Class.forName (This imports the module and gives the class)
        cls = get_class(clsname)
        # As my server is in java, I convert the attributes to python as per python naming convention.
        dct = dict( (convert_java_name_to_python(k), dct[k]) for k in dct.keys())
       if cls != None:
            obj_key = None
            if "@uuid"in dct
                obj_key = dct["@uuid"]
                del(dct["@uuid"])
            else:
                info("Class missing uuid: " + clsname)
            dct.pop("@CLASS", None)

            obj = cls(**dct) #This I found to be the most time consuming process. In my domian object, in the __init__ method I have the logic to set all attributes based on the kwargs passed 
            if obj_key is not None:
                shared_objs[obj_key] = obj #I keep all uuids along with the objects in shared_objs dictionary. This shared_objs will be used later to replace references.
        else:
            warning("class not found: " + clsname)
            obj = dct

        return obj
    else:
        return dct
Run Code Online (Sandbox Code Playgroud)

样品回复:

    {"@CLASS":"sample.counter","@UUID":"86f26a0a-1a58-4429-a762-  9b1778a99c82","val1":"ABC","val2":1131,"val3":1754095,"value4":  {"@CLASS":"sample.nestedClass","@UUID":"f7bb298c-fd0b-4d87-bed8-  74d5eb1d6517","id":1754095,"name":"XYZ","abbreviation":"ABC"}}
Run Code Online (Sandbox Code Playgroud)

我有很多级别的嵌套,我从服务器收到的记录数超过800K.

Mos*_*oye 6

我不知道任何提供开箱即用的框架,但您可以对类实例的设置方式进行一些优化.

由于拆包字典为关键字参数,并把它们应用到你的类变量,走的是散的时候,你可以考虑将通过dct直接向你的类__init__和设置类字典cls.__dict__dct:

试验1

In [1]: data = {"name": "yolanda", "age": 4}

In [2]: class Person:
   ...:     def __init__(self, name, age):
   ...:         self.name = name
   ...:         self.age = age
   ...:
In [3]: %%timeit
   ...: Person(**data)
   ...:
1000000 loops, best of 3: 926 ns per loop
Run Code Online (Sandbox Code Playgroud)

试用2

In [4]: data = {"name": "yolanda", "age": 4}

In [5]: class Person2:
   ....:     def __init__(self, data):
   ....:         self.__dict__ = data
   ....:
In [6]: %%timeit
   ....: Person2(data)
   ....:
1000000 loops, best of 3: 541 ns per loop
Run Code Online (Sandbox Code Playgroud)

self.__dict__由于引用dct_object_hook返回之前丢失,因此不会担心通过另一个引用进行修改.

这当然意味着改变你的设置__init__,你的班级属性严格依赖于你的项目dct.由你决定.


您也可以替换cls != Nonecls is not None(只有一个None对象,因此身份检查更加pythonic):

试验1

In [38]: cls = 5
In [39]: %%timeit
   ....: cls != None
   ....:
10000000 loops, best of 3: 85.8 ns per loop
Run Code Online (Sandbox Code Playgroud)

试用2

In [40]: %%timeit
   ....: cls is not None
   ....:
10000000 loops, best of 3: 57.8 ns per loop
Run Code Online (Sandbox Code Playgroud)

你可以用一个替换两行:

obj_key = dct["@uuid"]
del(dct["@uuid"])
Run Code Online (Sandbox Code Playgroud)

变得:

obj_key = dct.pop('@uuid') # Not an optimization as this is same with the above
Run Code Online (Sandbox Code Playgroud)

在800K 域对象的范围内,这些将为您节省更多时间object_hook来更快地创建对象.