Trouble finding memory leak (tracemalloc/objgraph/gc not helping)

ixj*_*xje 6 memory-leaks python-3.x

I have a network process that communicates over TCP collecting data, deserialising it and finally storing parts in a LevelDB key/value store (via Plyvel).

Slowly over time it will consume all available memory to the point that my whole system locks up (Ubuntu 18.04). I'm trying to diagnose the cause but I'm running out of ideas how to investigate further.

The main suspects I have in mind is the data streams we operate on to deserialise the objects. The general thing done there is: receive data from a asyncio.StreamReader and call deserialize_from_bytes (see next)

    @classmethod
    def deserialize_from_bytes(cls, data_stream: Union[bytes, bytearray]):
        """ Deserialize object from a byte array. """
        br = BinaryReader(stream=data_stream)
        inv_payload = cls()
        try:
            inv_payload.deserialize(br)
        except ValueError:
            return None
        finally:
            br.cleanup()
        return inv_payload
Run Code Online (Sandbox Code Playgroud)

where BinaryReader initialises like this

   def __init__(self, stream: Union[io.BytesIO, bytes, bytearray]) -> None:
        super(BinaryReader, self).__init__()

        if isinstance(stream, (bytearray, bytes)):
            self._stream = io.BytesIO(stream)
        else:
            self._stream = stream
Run Code Online (Sandbox Code Playgroud)

and cleanup() is a convenience wrapper for self._stream.close()

What I've tried

  1. I started with trying out the display top 10 snippet of tracemalloc. At a point in time where /proc/$mypid/status shows a memory usage of 2GB (by VmSize), the most consuming item from tracemalloc reports a mere 38MB, followed by the second and third being 3MB and 360Kb.

    This already raises a question; where is the other ~1.958 GB?

  2. Not helped by the above output I tried objgraph. I break into the process with pdb and use

import objgraph
objgraph.show_most_common_types(limit=20)
Run Code Online (Sandbox Code Playgroud)

to get

function                   16451
tuple                      11456
dict                       10371
weakref                    3058
list                       2893
cell                       2446
Traceback                  2277
Statistic                  2277
_Binding                   2109
getset_descriptor          1814
type                       1680
builtin_function_or_method 1469
wrapper_descriptor         1311
method_descriptor          1284
frozenset                  992
property                   983
module                     810
ModuleSpec                 808
SourceFileLoader           738
Attrs                      593
Run Code Online (Sandbox Code Playgroud)

I don't see any objects specific to my program in that list (which could indicate references not being released). From other objgraph examples I found the above counts don't seem out of the ordinary.I inspected a couple of function objects and always found something similar to this (asyncio related) which doesn't suggest any memory is leaking there

(Pdb) objgraph.at(0x10f309b70)
<function _run_coroutine.<locals>.step_next.<locals>.continue_ at 0x10f309b70>
Run Code Online (Sandbox Code Playgroud)

Please correct me here if I'm missing something. Again, no step further

  1. A final try to see if I can enforce any memory being released I manually run gc.collect(). Repeatedly calling it gives values between 25-500 for number of unreachable objects. It doesn't trigger me as an issue (should it?). I also tried running the program with gc.set_debug(gc.DEBUG_LEAK) but that's generates so much output I can't make anything of it.

Any tips what I can try from here on?