I'm seeing (under cpython-3.3): What does it mean?

dict was able to do 1048576 operations on a dictionary before taking more than 120 seconds to complete - it took 75.3 seconds to do 1048576 operations. So 1048576*2 operations took more than 120 seconds to complete and was thrown out.

AVL_tree was able to do 262144 operations on a dictionary before taking more than 120 seconds to complete - it took 66.1 seconds to do 262144 operations. So 262144*2 operations took more than 120 seconds to complete and was thrown out.

blist.sorteddict was able to do 65536 operations on a dictionary before taking more than 120 seconds to complete - it took 77.3 seconds to do 65536 operations. So 65536*2 operations took more than 120 seconds to complete and was thrown out.

For the 50/50 workload, the "operations" were half adding key+value pairs, and half lookups of values by keys we know are in the dictionary.

I used to run all the dictionaries for as long as it took to do 4 million operations, but for (EG) unbalanced binary trees, that would take far too long in the ordered tests, so I changed the code to try a given tree type until the time for a workload+interpreter at a given size became prohibitive. I didn't want to simply remove unbalanced binary trees from the test altogether though, because it's a strong performer for unordered values.

If you look at the graphs (I have to admit they've become a little cluttered), you can see the slower trees "escaping" rapidly (exceeding the 120 second threshold), while the better performing trees grow more slowly and are allowed to continue proving themselves longer. Inspecting these graphs may help in developing an intuition for how the tests were conducted.

The code implementing this method of testing is here.