Memory leaks are difficult to trace down. A nice description of finding one is "A server memory leak", a story about tracing a memory leak in a Django application.

I'm trying to work out why a server process is slowly using more and more memory. The server parses a stream of text, and shoves stuff into a database. It's written in Python, and about the only external code it uses is SQLAlchemy

I found a combination of Meliae and objgraph useful. Pympler also seems useful (the tracker class particularly). Dowser also looks useful, and might have made things a bit easier..

So, first we need to get whatever memory-analysing code into the server. You could use pyrasite, which injects code into an existing process using gdb (it even has helpers for Meliae)

However, it's usually easier and more reliable to modify the code a bit.

The simplest and tidiest way is to use a signal handler which starts pdb. To steal an idea from Bazaar, SIGQUIT is perfect for this..

In the the code's main() function, add something like this:

def start_pdb(signal, trace):
    import pdb
    pdb.set_trace()
 
import signal
signal.signal(signal.SIGQUIT, start_pdb)

Then, when the code is running in a terminal, pressing ctrl+ will start pdb. You can then run any Python code you feel like. After, you can either continue the program running with "c" (or "continue"), or kill it with "q" ("quit" or "ctrl+d"):

$ python sigquit_test.py
^--Return--
> sigquit_test.py(10)start_pdb()->None
-> pdb.set_trace()
(Pdb) c
^--Return--
> sigquit_test.py(10)start_pdb()->None
-> pdb.set_trace()
(Pdb) q
Traceback (most recent call last):
  [...]
bdb.BdbQuit
$

Now, we can run the server, and press ctrl+ at any time to start up a Python shell in that environment.

Firstly we can use objgraph to see what the most common type of object in memory is:

(Pdb) import objgraph
(Pdb) objgraph.show_most_common_types()
dict                150951
tuple               82757
InstanceState       65303
[...]

Perhaps unsurprisingly, there is a lot of dict and tuple objects. More interestingly, there is a lot of InstanceState objects.

Next, we can use Meliae to showing how much memory each data-type is using:

(Pdb) import meliae.scanner
(Pdb) meliae.scanner.dump_all_objects("/tmp/memorydump.json")
(Pdb) 

Then, in another Python shell we can analyse this dump:

>>> import meliae.loader
>>> md = meliae.loader.load("/tmp/memorydump.json")
loaded line 775946, 775947 objs, 118.7 / 118.7 MiB read in 21.3s        
checked   775946 /   775947 collapsed    74972    
set parents   700974 /   700975            
collapsed in 11.1s
>>> md.summarize()
Total 700975 objects, 874 types, Total size = 276.9MiB (290327578 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0  183823  26 176767720  60  60  786712 dict
     1   65303   9  72616936  25  85    1112 InstanceState
     2  151981  21  10535744   3  89     520 tuple

So that InstanceState is definitely not helping the memory usage.. Does it relate to those dictionaries?

>>> md.get_all('InstanceState')[0].p
[WeakInstanceDict(4333095200 6291752B 3856refs 1par), dict(4760925744 3352B 2refs 1par), instancemethod(4763880816 80B 3refs 1par), dict(4759467200 3352B 90refs 1par)]

The get_all method returns a list of all the InstanceState objects, ordered by size - so the first index is the largest one. It contains a reference to two dicts, a weak-ref and instance method..

For a better explaination, I recommend reading "Memory debugging with Meliae" and "Step-by-step Meliae", written by the author of Meliae.

This info is useful, but maybe a little unintutive. Luckily, objgraph exists.. So, back to the pdb prompt in our "live" server:

(Pdb) import objgraph
(Pdb) objgraph.count("InstanceState")
65303
(Pdb) objgraph.show_backrefs(objgraph.by_type("InstanceState")[0:50])
Graph written to objects.dot (316 nodes)
Image generated as objects.png

objgraph is nice in that is pure Python, and produes pretty reference graphs (by writing .dot files, which are rendered using graphviz):

InstanceState loop

I'd recommend reading "Hunting memory leaks in Python" and "Python object graphs", written by the author of objgraph

Now, this description till now makes the problem seem a bit.. easy and linear. Instrument up the code, look at what is using the most memory, and... there's the problem! Not quite..

After a bit of prodding around, the InstanceStates seemed to be mostly referenced by a Least Recently Used (LRU) cache in the code. I wasn't sure if this memory usage was legitimate cacing, or if the above tree pictured above was causing leaky reference cycles..

The code commits to the database approximately every 30 seconds, so I added some memory usage logging code into this.

Firstly, I created a function based of objgraph's show_most_common_types, which to writes the most used objects to a log file:

def debug_log_common_objects():
    print "Logging common objects"
    import gc
    print "garbage collector got", gc.collect()
 
    import objgraph
 
    # Reimplement show_most_common_types, but write to file
    import operator
    stats = sorted(objgraph.typestats().items(), key=operator.itemgetter(1),
                   reverse=True)
 
    limit = 30
    if limit:
        stats = stats[:limit]
 
    width = max(len(name) for name, count in stats)
 
    out = ["#START %s" % time.time()] + ["%s %s" % (name.ljust(width), count) for name, count in stats[:limit]
    f = open("most_common_object.txt", "a")
    f.write("n".join(out) + "n")
    f.close()
    print "Logging common objects - done"

Then I left the process running for a while. This function is doing exactly what we first done with the pdb prompt (calling "objgraph.show_most_common_types()"), but with a timestamp:

#START 1320397336.95
tuple                      30340
dict                       16511
function                   7662
list                       4792
set                        4174
[...]
#START 1320397377.14
tuple                      32176
dict                       20643
function                   7662
list                       4901
set                        4185
[...]

Then, I wrote a simple, horribly crappy, script to parse and graph these, so I could see which objects were increasing over time

#!/usr/bin/env python
 
f = open("most_common_object.txt")
 
alltimings = {}
things = {}
 
start = None
for line in f.readlines():
    if line.startswith("#START"):
        alltimings[start] = things
 
        _, _, thetime = line.strip().partition(" ")
        start = float(thetime)
        things = {}
 
    else:
        key, _, value = line.partition(" ")
        things[key] = int(value.strip())
 
 
times = {}
values = {}
 
import datetime
 
for time, things in sorted(alltimings.items()):
    for t, howmany in things.items():
        times.setdefault(t, []).append(datetime.datetime.fromtimestamp(time))
        values.setdefault(t, []).append(howmany)
 
 
import matplotlib.pyplot as plt
 
thing = plt.figure(figsize = (15, 200))
thing.subplots_adjust(hspace = .2)
 
for i, key in enumerate(times):
    ct = times[key]
    cv = values[key]
 
    ax = thing.add_subplot(len(times.keys()), 1, i+1)
    ax.set_title("%s" % key)
    ax.set_ylabel("Number of objects")
 
    ax.plot(ct, cv)
 
plt.savefig("mem.pdf", figsize = (200, 200), ymargin = 100)

This produces a PDF full of graphs, some of which look like these:

dict

So while there are a lot of dict objects, the number is fluctuating reasonably. Same with InstanceState, so that wasn't the leak (if there even is one)

I later confirmed this suspicion on StackOverflow - the graph presented above will not leak - as of Python 2.2, the garbage collector gained cycle detection. There is one important caveat, which is if any object in the cycle contains a __del__ method, it will NOT be garbage collected.

However, the number of CacheEntities is growing suspiciously, so were a few other types:

CacheEntities

..so I think that's what I'll investigate next

As I said, memory leaks are difficult to track down - I still haven't got to the bottom of the problem yet, but hopefully these techniques might help others