The SortedContainers internal implementation is based on a couple observations. The first is that Python lists are fast, really fast. They have great characteristics for memory management and random access. The second is that bisect.insort is fast. This is somewhat counter-intuitive since it involves shifting a series of items in a list. But modern processors do this really well. A lot of time has been spent optimizing mem-copy/mem-move-like operations both in hardware and software.
But using only one list and bisect.insort would produce sluggish behavior for lengths exceeding ten thousand. So the implementation of SortedList uses a list of lists to store elements. In this way, inserting or deleting is most often performed on a short list. Only rarely does a new list need to be added or deleted.
SortedList maintains three internal variables:
_index. The first is simply the list of lists; each member
is a sorted sublist of elements. The second contains the maximum element in
each of the sublists. This is used for fast binary-search. The last maintains a
tree of pair-wise sums of the lengths of the lists.
Lists are kept balanced using the load factor. If a sublist’s length exceeds double the load then it is split in two. Likewise at half the load it is combined with its neighbor. By default this factor is 1,000 which seems to work well for lengths up to ten million. Lengths above that are recommended a load factor that is the square root to cube root of the average length. (Although you will probably exhaust the memory of your machine before that point.) Experimentation is also recommended. A load factor performance comparison is also provided. For more in-depth analysis, read Performance at Scale which benchmarks SortedContainers with ten billion elements.
Finding an element is a two step process. First the
_maxes list, also known
as the “maxes” index, is bisected which yields the position of a sorted
sublist. Then that sublist is bisected for the location of the element.
Compared to tree-based implementations, using lists of lists has a few advantages based on memory usage.
- Most insertion/deletion doesn’t require allocating or freeing memory. This can be a big win as it takes a lot of strain off the garbage collector and memory system.
- Pointers to elements are packed densely. A traditional tree-based implementation would require two pointers (left/right) to child nodes. Lists have no such overhead. This benefits the hardware’s memory architecture and more efficiently utilizies caches.
- The memory overhead per item is effectively a pointer to the item. Binary tree implementations must add at least two more pointers per item.
- Iteration is extremely fast as sequentially indexing lists is a strength of modern processors.
Traditional tree-based designs have better big-O notation but that ignores the realities of today’s software and hardware. For a more in-depth analysis, read Performance at Scale.
Indexing uses the
_index list which operates as a tree of pair-wise sums of
the lengths of the lists. The tree is maintained as a dense binary tree. It’s
easiest to explain with an example. Suppose
_lists contains sublists with
these lengths (in this example, we assume the load factor is 4):
map(len, _lists) -> [3, 5, 4, 5, 6]
Given these lengths, the first row in the index is the pair-wise sums:
[8, 9, 6, 0]
We pad the first row with zeros to make its length a power of 2. The next rows of sums work similarly:
[17, 6] 
Then all the rows are concatenated in reverse order so that the index is finally:
[23, 17, 6, 8, 9, 6, 0, 3, 5, 4, 5, 6]
With this list, we can efficiently compute the index of an item in a sublist
and, vice-versa, find an item given an index. Details of the algorithms to do
so are contained in the docstring for
For example, indexing requires traversing the tree to a leaf node. Each node
has two children which are easily computable. Given an index,
left-child is at
pos * 2 + 1 and the right-child is at
pos * 2 + 2.
When the index is less than the left-child, traversal moves to the left sub-tree. Otherwise, the index is decremented by the left-child and traversal moves to the right sub-tree.
At a leaf node, the indexing pair is computed from the relative position of the node as compared with the offset and the remaining index.
For example, given the following index:
_index = 14 5 9 3 2 4 5 _offset = 3 Tree: 14 5 9 3 2 4 5
Indexing position 8 involves iterating like so:
- Starting at the root, position 0, 8 is compared with the left-child node (5) which it is greater than. When greater, the index is decremented and the position is updated to the right child node.
- At node 9 with index 3, we again compare the index to the left-child node with value 4. Because the index is the less than the left-child node, we simply traverse to the left.
- At node 4 with index 3, we recognize that we are at a leaf node and stop iterating.
- To compute the sublist index, we subtract the offset from the index of the
5 - 3 = 2. To compute the index in the sublist, we simply use the index remaining from iteration. In this case, 3.
The final index pair from our example is (2, 3) which corresponds to index 8 in the sorted list.
Maintaining the position index in this way has several advantages:
- It’s easy to traverse to children/parent. The children of a position in the
(pos * 2) + 1and
(pos * 2) + 2. The parent is at
(pos - 1) // 2. We can even identify left/right-children easily. Each left-child is at an odd index and each right-child is at an even index.
- It’s not built unless needed. If no indexing occurs, the memory and time accounting for position is skipped.
- It’s fast to build. Calculating sums pair-wise and concatenating lists can all be done within C-routines in the Python interpreter.
- It’s space efficient. The whole index is no more than twice the size of the
length of the
_listsand contains only integers.
- It’s easy to update. Adding or removing an item involves incrementing or
log2(len(_index))items in the index. The only caveat to this is when a new sublist is created/deleted. In those scenarios the entire index is deleted and not rebuilt until needed.
The construction and maintenance of the positional index is unusual compared to other traditional designs. Whether the design is novel, I (Grant Jenks) do not know. Until shown otherwise, I would like to refer to it as the “Jenks” index.
Each sorted container has a function named
_check for verifying
consistency. This function details the data-type invariants.