Comparison of sorted dictionary implementations in Python

First off, you may want to see my prior comparisons here.

Relative to my previous, related comparisons, this one uses larger data set sizes, includes sorteddict, and is heavy on writes (sequential once, random another) and light on random reads.

Some notes about the datastructures:

AA tree
a simplified form of a red-black tree.

AVL tree commonly used on disk, but here it is in virtual memory.

binary_tree_dict unbalanced binary tree, meaning it's nice if your keys are in random order, and awful if they are sorted.

B tree also commonly used on disk, but also used in virtual memory here.

dict the usual Python hash table. It is not ordered, and is included only for the sake of comparison. Interestingly, in CPython 2.7 it is more subject to MemoryError exceptions than the other datastructures here - despite being run on a 64 bit system. CPython 3.4, Pypy and Pypy3 did not exhibit the problem. Also, random workload did not exhibit the problem, but sequential workload did.

Red-black tree commonly believed to be the performance champion, but that's not what happened here.

Scapegoat tree believed to give a time tradeoff: time spent self-balancing against time spent searching a deep tree, using a user-specified constant (alpha, also written α) to determine how much time goes to each. Here we see that there is indeed a tradeoff, but it's not a very good one. In fact, at about 67108864 elements, some operations started taking much longer than they should.

Sorteddict has a pretty impressive constant.

Splay trees reorganize themselves on writes like normal, but also on reads. They're doing well in the sequential workload, but that's probably because of the infrequent random reads; I believe this because using a splay tree in order reduces it to a linked list, much like an unbalanced binary tree.

Treap sometimes said to be the fastest (even compared to red-black tree), but to have a greater standard deviation in operation times. Here we see it performing poorly compared to sorteddict.

Methodology:

The datastructures are each tested five times for a given workload and interpreter. A mean and standard deviation are calculated from the five durations.
Test computer load/quiescence
- If the load on the machine used for testing exceeds 1.0 just before a test run is to be initiated, the test script pauses 20 seconds and checks the load again.
- This was done on a system with 3 CPU cores, so with a load factor of under 1.0, only one of the 3 cores should be (partially) busy. This should give ample CPU time, but may still have an impact on CPU caches.
- Load spikes
  - If the load spikes during a test due to another process, it does impact that test, but it most likely won't impact the next
  - Also if there's a load spike from another process during a test, the corresponding error bar gains size in the corresponding graph.
  - The system was otherwise lightly loaded - there was pretty much nothing else running but a couple of sshd's and cronjobs. Backups were disabled.
A fresh interpreter is used for each interpreter workload and test. I toyed with preexhausting the heap with a linked list for a while, but it was giving uninteresting results and the prefragmentation took a long time, so it was scrapped.
The datastructure operations were 95% adds and 5% retrieves. A retrieve against an empty datastructure is treated as though nothing happened.
As soon as one interpreter workload test takes more than 30 minutes, it is cut from the subsequent, larger tests. This allows doing things like testing a binary tree with a sequential workload, without it taking a very long time at upper sizes.

Here's the comparison in graph form

Here's the comparison again as a collapsible HTML table

Hits: 4821
Timestamp: 2025-07-22 07:03:43 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments:

AA tree	a simplified form of a red-black tree.
AVL tree	commonly used on disk, but here it is in virtual memory.
binary_tree_dict	unbalanced binary tree, meaning it's nice if your keys are in random order, and awful if they are sorted.
B tree	also commonly used on disk, but also used in virtual memory here.
dict	the usual Python hash table. It is not ordered, and is included only for the sake of comparison. Interestingly, in CPython 2.7 it is more subject to MemoryError exceptions than the other datastructures here - despite being run on a 64 bit system. CPython 3.4, Pypy and Pypy3 did not exhibit the problem. Also, random workload did not exhibit the problem, but sequential workload did.
Red-black tree	commonly believed to be the performance champion, but that's not what happened here.
Scapegoat tree	believed to give a time tradeoff: time spent self-balancing against time spent searching a deep tree, using a user-specified constant (alpha, also written α) to determine how much time goes to each. Here we see that there is indeed a tradeoff, but it's not a very good one. In fact, at about 67108864 elements, some operations started taking much longer than they should.
Sorteddict	has a pretty impressive constant.
Splay trees	reorganize themselves on writes like normal, but also on reads. They're doing well in the sequential workload, but that's probably because of the infrequent random reads; I believe this because using a splay tree in order reduces it to a linked list, much like an unbalanced binary tree.
Treap	sometimes said to be the fastest (even compared to red-black tree), but to have a greater standard deviation in operation times. Here we see it performing poorly compared to sorteddict.