A primer on pymalloc
September 30, 2023 - 7 min read
TL;DR
Python uses
pymallocfor small (< 512 bytes) objects. Pymalloc requests memory in chunks called Arenas, which are 256KB each. An Arena is only freed only if all objects that are allocated on it are deleted. So allocating 1.000.000 objects of 512 bytes, totaling 512MB and deleting half of them does not mean that half of the memory (256MB) will be freed and returned to the OS.
Intro
Say you have a big tree. Suddenly you realize that some of the nodes can be pruned so as to reduce memory. You prune the nodes, but the memory usage remains largely the same, if not unaffected. What gives?
Background
The tree in question is a Trie Tree. A Trie Tree, also known as a Prefix Tree, is a data structure for string lookups, which can be also used for prefix searches.
The words dart, dance, dancer and danger are stored in a Trie Tree as follows:
root ─ d ─ a ─ r ─ t (dart)
│
└─ n ─ c ─ e (dance)
│ └─ r (dancer)
└─ g ─ e ─ r (danger)
Each letter is a node, and it points to the next letter. All leaf nodes are in fact words, and there exists a single path from the root to them. This can be also true for non-leaf nodes, in case a word is a prefix of another (e.g. dance and dancer).
This kind of tree supports use cases where we want to fetch all keys that start with a certain prefix. For example, say we want to search for all words that begin with 'dan'. This query would be executed as follows:
node = root
for char in query:
# find child node for given character
next_node = node.get_child(char)
if next_node is None:
# If we can't find the character, collect and return all words
# under the current node with a traversal (e.g. depth-first search)
return collect_words_under_node(node)
else:
# else just step over to the next node and continue the loop
node = next_node
For dan, we would stop at node n and return the words dance, dancer and danger. For a different query (e.g. dab), we can stop at node a and return
the words dart, dance, dancer and danger (or even return an error if we don't want to allow searches using prefixes that are not in our tree).
Now if we wanted to reduce the memory footprint of such a tree, we could try compressing the paths. There are certain nodes that are only part of paths and have no extra information. These can be bundled together, so as to avoid the memory overhead of each node.
The initial idea was to do this as a post-processing step. So the tree is generated as above, and after it is generated, we:
- Start from the root node
- For each node
a- If
ahas a single child namedb, replaceaandbwith a single nodeab - Repeat for
ab
- If
Compressing the tree above would yield the following:
root ─ da ─ rt (dart)
│
└─ n ─ ce (dance)
│ └─ r (dancer)
└─ ger (danger)
In this case, we went from 11 nodes in the original tree down to 7 nodes. Not bad - we removed ~42% of the original nodes. But did the memory usage also reduce by a proportional amount?
The problem
Well, no. That's where Python's pymalloc comes into play.
Python uses its own memory allocator called pymalloc for small (< 512 bytes) objects. It works by reserving chunks of 256KB of memory called Arenas, and fitting these small objects in those chunks [1] (oversimplification - Arenas are organized into pools which are organized into blocks, but for the sake of this post, we'll consider an Arena to be just a chunk of memory).
The catch here is that an Arena is freed and given back to the OS only when it's emptied. So even if we delete some of these small objects, we cannot be sure if their memory is given back to the OS, as we don't know if we cleared an Arena.
Here's an over-simplified example - let's say we have 12 nodes, which made it into 3 Arenas, like so:
|=========|=========|=========|
| Node 1 | Node 6 | Node 11 |
| Node 2 | Node 7 | Node 12 |
| Node 3 | Node 8 | |
| Node 4 | Node 9 | |
| Node 5 | Node 10 | |
|---------|---------|---------|
| Arena 1 | Arena 2 | Arena 3 |
|=========|=========|=========|
Let's say that after compression, we've removed nodes 2, 3, 4, 7, 8, 11 and 12. We end up with the following memory allocation:
|=========|=========|=========|
| Node 1 | Node 6 | |
| | | |
| | | |
| | Node 9 | |
| Node 5 | Node 10 | |
|---------|---------|---------|
| Arena 1 | Arena 2 | Arena 3 |
|=========|=========|=========|
In this example, we've removed 7 nodes out of 12, which is about 58%. We would naively expect that 58% of our memory is freed, but since only Arena 3 is emptied, we'll only see a 33% decrease in memory usage. If a single node was left in Arena 3, we would see no difference in memory usage.
How to spot this issue
We can use sys._debugmallocstats() to get some info about the memory usage. This function prints various memory usage metrics, and among other things the number of currently allocated arenas:
...
# arenas allocated total = 106
# arenas reclaimed = 73
# arenas highwater mark = 106
# arenas allocated current = 33
...
This tells us that:
- 106 arenas have been allocated so far
- 73 of them have been freed (reclaimed)
- 106 arenas were used at any point in time
- 33 arenas are used now
Here's a script that performs the following experiment:
- Print memory stats
- Store 1.000.000 strings and print memory stats
- Remove strings based on probability and print memory stats
import random
import sys
import gc
print("Startup memory stats")
sys._debugmallocstats()
strings = []
for i in range(1000000):
strings.append(str(random.random()))
print(f"{len(strings)} strings allocated memory stats")
sys._debugmallocstats()
p = 0.5
# Keep some of the strings based on a given probability
strings = [s for s in strings if random.random() < p]
# Try to force cleanup of these strings, since they are not referenced anywhere else
gc.collect()
print(f"{len(strings)} strings allocated memory stats")
sys._debugmallocstats()
Here's what was printed after running this script with Python3.11.
Startup stats:
# arenas allocated total = 2
# arenas reclaimed = 0
# arenas highwater mark = 2
# arenas allocated current = 2
1 million strings allocated stats:
# arenas allocated total = 80
# arenas reclaimed = 0
# arenas highwater mark = 80
# arenas allocated current = 80
~500k strings allocated stats (after removing ~500k strings randomly):
# arenas allocated total = 80
# arenas reclaimed = 0
# arenas highwater mark = 80
# arenas allocated current = 80
No arenas have been reclaimed. That's expected - sampling at random would yield something close to picking the same amount of strings out of every arena, so we don't expect any of them to be empty.
And if we reduce p (probability with which we keep a string) to 0.00005, where we end up with 50 strings, we get the following allocation stats:
# arenas allocated total = 80
# arenas reclaimed = 41
# arenas highwater mark = 80
# arenas allocated current = 39
Again, since we're using random sampling to remove strings, it makes sense that most of the arenas won't be emptied. Same behavior as above, but since we cleared many more strings, we do see a drop in currently allocated arenas.
On the other hand, if we just keep the first 500k strings (strings = strings[:500000]), we end up cleaning almost half of the arenas:
# arenas allocated total = 80
# arenas reclaimed = 37
# arenas highwater mark = 80
# arenas allocated current = 43
This is expected, as the last-allocated arenas most probably contain the last-generated strings. In other words, we're assuming that the first allocated Arena will contain the first X strings, while the last allocated Arena will contain the last X strings (where X is the number of strings that can fit in an Arena).
So deleting the last 500k strings means that we're probably clearing whole arenas, as they were probably stored next to each other.
Is this a Python problem?
Well, pymalloc is a Python construct, but generally speaking, memory fragmentation is not something that happens only in Python. It's something unavoidable, but manageable when you can control how memory is allocated. It's the kind of problem that makes perfect sense as to why it happens, but you only think of it when you stumble upon it.
Is there a solution?
There's not a single solution per se; in fact, the only way to make sure that we don't have half-empty arenas is to never allocate so many of them in the first place.
In this example, this means that we cannot create X nodes and delete Y of them down the road - we either have to create only Y nodes, or find a different solution like creating all X nodes, deleting some of them, writing the tree to disk in some format and re-reading it in a way that we only create Y nodes.