During my Haskell tutorial, I used balanced binary search trees as an example. I’ve had people write to me asking me to write about that in a non-functional programming post, because the Haskell one got too tied up in how to do things without assignments, and with tail recursion.
Binary search trees (BST) are another one of the really fundamental simple data structures. They’re conceptually similar to heaps – they’re also a kind of size-ordered binary tree with O(lg n) performance – but they’ve got a different set of basic invariants to represent the different performance goals.
The goal of a structure like a BST is to have a variable-size stucture where
you can cheaply and easily increase or decrease the size of the structure, and where insert, delete, and searching for values are all inexpensive. BSTs are a good choice for implementing things like dictionaries, where you have a key associated with a value, or where you want to be able to search for an arbitrary member quickly.
BSTs are very widely used – for example, the default ordered collections, sets, and maps in the C++ standard libraries are all implemented as a kind of BST (in fact, the very kind that I’m going to describe below.)
If you have a meaningful ordering operation, then a BST is a good choice: expanding and contracting the structure happen automatically every time you do an insert or remove; and insert, delete, and search are all bounded by lg(n), where N is the number of values in the tree.
The basic structure is really simple. A BST is a tree where each node has a maximum of two children (called left and right), and for any node N, N > leftN, and N ≤ rightN.
For example, the diagram to the right is an example of a binary search tree.
It seems easy, initially, to do something like insert a value into a binary search tree. To sketch it out quickly:
class BinarySearchTree(object): def __init__(self, value): self._left = None self._right = None self._value = value def Insert(self, value): if self._value > value: if self._left is None: self._left = BinarySearchTree(value) else: self._left.Insert(value) else: if self._right is None: self._right = BinarySearchTree(value) else: self_right.Insert(value) def Print(self, indent): for i in range(indent): print " ", print self._value if self._left is None: for i in range(indent+1): print " ", print "None" else: self._left.Print(indent+1) if self._right is None: for i in range(indent+1): print " ", print "None" else: self._right.Print(indent+1)
In prose: to insert a value, you look to see if it’s smaller than the root node. If it is, you insert it into the left subtree, otherwise, you insert it into the right subtree. If the subtree is empty, then you create a new subtree with the new value.
The example above is what we’d get if we did an insert in the order 5, then 6, then 3, then 2, then 4, then 1.
Unfortunately, it’s not that simple. What happens if we insert the numbers 1 through six in order? We get this tree:
That’s basically just a very inefficient way of storing a list. Instead of being a fast tree structure, where we can do things in O(lg n) time, it’s a degenerate form, which ends up doing things as if it were a list – that is, linear O(n) time.
The problem is, the binary search tree only works well if the tree is balanced – that is, if for any node in the tree, the node has roughly the same number of children in its left subtree as it does in its right subtree.
So to get the performance that we’d want, we need to find a way of keeping the tree balanced. There are a bunch of ways of doing that; the earliest that I know of was something called an AVL tree. For the most part, people have settled on something called a red-black tree. A red-black tree is a modified version of a binary tree that keeps the tree in approximate balance, and does it in amortized O(lg n) time for all operations. (You can’t beat O(lg n) in a binary search tree, and red-black trees do less work (by a constant factor) maintaining balance than AVL trees.)
Approximate balance means that the tree is kept in a state where for any node, the size of its left subtree and the size of its right subtree differ by no more than a factor of 2.
Amortized time is a harder notion to explain. Amortized O(lg n) time means that if you do a long series of operations, the average cost per operation will remain O(lg n). Every so often, there will be an operation that is significantly more expensive. But by knowing how much it costs, and how often it happens, you can spread its cost out. The idea is, if you can show that an O(n) operation happens once every N times you invoke insert, then you can spread that cost out, by counting it as one extra step in each regular invocation. Then each invocation has an amortized cost of O(lg n + 1), which is the same as O(lg n).
In the red-black tree, we will periodically need to re-balance the tree. Rebalancing is expensive. But we can show that the rebalance operation will happen infrequently – and that it occurs at a rate which can be amortized. So if you observed the performance of 1000 inserts into the tree, you wouldn’t be able to tell that the performance of any of the inserts was O(lg n) – the time to do 1,000 inserts would be 1000 times the cost of doing one insert – and each one would appear to be O(lg n).
A red-black tree is a BST where each node is assigned a color – either red or black – and the nodes have the following color-related properties:
- The root of the tree is always black.
- All branches of the tree end with empty leafs which are counted as black.
- All children of red nodes are black.
- For all nodes in the tree, all downward paths from the node to a leaf contain the same number of black nodes.
What this means is that the longest possible path from the root to a leaf is a path which alternates red and black; the shortest path from root to leaf is a path which is all black. Since they’ve got the same number of black nodes, that means that in the worst possible case, the longest path is twice the length of the shortest.
To make this work, what we do is insert new nodes as leaves exactly the way that we did in the simple BST written above. New nodes are always red. After the initial insert, we need to re-balance the tree, by possibly doing a series of operations called pivots to make sure that the red-black tree properties are maintained.
A pivot rebalances the tree, by rotating the tree around a node and one of its children. For example, in the example diagrams above, you can see an example of a right pivot. The original root of the tree being pivoted is the node containing 4. A right pivot moves the left child counter-clockwise, pushing the node containing four up, to become the root of the tree. The former root, 8, becomes the right child of 4. The former right child of 4 (5) becomes the left child of 8. Now we’ve got a valid BST, and the height of the left subtree has been reduced by one, and the right subtree increased by one.
Now, what we need to do is to think about how inserts could cause the tree to become invalid, and how we can use pivots to correct it. To do that, we need to consider a set of up to 4 nodes: the new node, it’s parent, the sibling of its parent, which we’ll call it’s uncle., and the parent of its parent, which we’ll call its grandparent. For example, look at the red-black tree below. In this example, if we look at the node “8”, its parent is “7”; it’s grandparent is “5”; it’s uncle is “4”.
All possible violations of the tree balance focus on red nodes: newly inserted nodes are always red, and as we rebalance, we’ll always push the imbalance upwards, with the focus on a red node. If we look at a red node, N, to rebalance:
- If the node we’re re-balancing is the root of the tree (i.e., it has no parent), then we just change its color to black.
- If we’re re-balancing around the red node N, and N’s parent is black, then everything is fine. The tree remains valid, with no invariants violated.
- If the node’s parent is red, then we have a red node with a red parent, and we need to do something to fix it:
- If the uncle is also red, then we can change the colors of both the parent and the uncle to black, and the color of the grandparent to red. Then we need to re-balance around the newly red grandparent.
- If the uncle is black, then we’ll need to do a pivot.
If we need to do a pivot, that means that the tree really is out of balance. It’s not just a matter of book-keeping – we’ve got a real path in the tree from root to leaf that’s more than twice the length of the shortest path.
If the focal node, N, is the left child of its parent, and its parent is the right child of its grandparent, then do a right pivot around N and its parent, and continue to balance around the newly red former parent.
If N is a left child, and N’s parent is the left child of its grandparent, then we don’t pivot N; we do a right pivot of N’s parent and grandparent. In this case, we also do some re-coloring: we swap the colors of N’s parent and grandparent, so that the parent is black and the grandparent is red, and we’re in balance.
IF N is a right child, then we do the mirror image: if N’s parent is a left child, then we pivot N and its parent left, and then focus on the former parent. If N’s parent is also a right child, we pivot N’s parent and grandparent left, recolor, and we’re done.
That sounds scary, but it really isn’t that complicated. Below, there’s a quick Python implementation of a red-black tree, with the pivot annotated. (There were originally a couple of errors in this code, resulting from me changing the code from explicit member access to accessor methods at the last minute. Thanks to commenter Sasa for catching them.)
class RedBlackTree(object): def __init__(self): self._tree = None def Insert(self, n): if self._tree == None: self._tree = RedBlackTreeNode(n) self._tree.SetColor("Black") else: self._tree = self._tree.Insert(n) def Print(self): if self._tree == None: print "Empty" else: self._tree.Print(1) class RedBlackTreeNode(object): def __init__(self, value): self._left = None self._right = None self._value = value self.SetColor("Red") self._parent = None def GetParent(self): return self._parent def SetParent(self, parent): self._parent = parent def GetColor(self): return self._color def SetColor(self, color): self._color = color def GetLeft(self): return self._left def SetLeft(self, left): self._left = left def GetRight(self): return self._right def SetRight(self, right): self._right = right def GetGrandParent(self): if self.GetParent() != None: return self.GetParent().GetParent() else: return None def GetUncle(self): grand = self.GetGrandParent() if grand is not None: if grand.GetLeft() == self.GetParent(): return grand.GetRight() else: return grand.GetLeft() else: return None def Rebalance(self): # WP case 1: tree root if self.GetParent() is None: self.SetColor("Black") return self # WP case 2: The parent of the target node is BLACK, # so the tree is in fine balance shape; just return the # tree root if self.GetParent().GetColor() == "Black": return self.GetRoot() # From here on, we know that parent is Red. # WP Case 3: self, parent, and uncle are all red. if self.GetUncle() is not None and self.GetUncle().GetColor() == "Red": self.GetUncle().SetColor("Black") self.GetParent().SetColor("Black") self.GetGrandParent().SetColor("Red") return self.GetGrandParent().Rebalance() # By now, we know that self and parent are red; and the uncle is black. # We also know that the grandparent is not None, because if it were, the # parent would be root, which must be black. So this means that we # need to do a pivot on the parent return self.PivotAndRebalance() def GetRoot(self): if self.GetParent() is None: return self else: return self.GetParent().GetRoot() def PivotAndRebalance(self): # First, distinguish between the case where where my parent is # a left child or a right child. if self.GetGrandParent().GetLeft() == self.GetParent(): if self.GetParent().GetRight() == self: # WP case 4: I'm the right child of my parent, and my parent is the # left child of my grandparent. Pivot right around me. return self.PivotLeft(False) else: # WP case 5 # If I'm the left child, and my parent is the left child, then # pivot right around my parent. return self.GetParent().PivotRight(True) else: # My parent is the right child. if self.GetParent().GetLeft() == self: # WP case 4, reverse. return self.PivotRight(False) else: # WY case 5 reverse return self.GetParent().PivotLeft(True) def PivotRight(self, recolor): # Hurrah, I'm going to be the new root of the subtree! left = self.GetLeft() right = self.GetRight() parent = self.GetParent() grand = self.GetGrandParent() # move my right child to be the left of my soon-to-be former parent. parent.SetLeft(right) if right is not None: right.SetParent(parent) # Move up, and make my old parent my right child. self.SetParent(grand) if grand is not None: if grand.GetRight(parent) == parent: grand.SetRight(self) else: grand.SetLeft(self) self.SetRight(parent) parent.SetParent(self) if recolor is True: parent.SetColor("Red") self.SetColor("Black") return self.GetRoot() else: # rebalance from the new position of my former parent. return parent.Rebalance() def PivotLeft(self, recolor): # Hurrah, I'm going to be the new root of the subtree! left = self.GetLeft() right = self.GetRight() parent = self.GetParent() grand = self.GetGrandParent() # move my left child to be the right of my soon-to-be former parent. parent.SetRight(left) if left is not None: left.SetParent(parent) # Move up,and make my old parent my right child. self.SetParent(grand) if grand is not None: if grand.GetRight() == parent: grand.SetRight(self) else: grand.SetLeft(self) self.SetLeft(parent) parent.SetParent(self) if recolor is True: parent.SetColor("Red") self.SetColor("Black") return self.GetRoot() else: # rebalance from the position of my former parent. return parent.Rebalance() def Insert(self, value): if self._value > value: if self.GetLeft() is None: self.SetLeft(RedBlackTreeNode(value)) self.GetLeft().SetParent(self) return self.GetLeft().Rebalance() else: return self.GetLeft().Insert(value) else: if self.GetRight() is None: self.SetRight(RedBlackTreeNode(value)) self.GetRight().SetParent(self) return self.GetRight().Rebalance() else: return self.GetRight().Insert(value) def Print(self, indent): for i in range(indent): print " ", print "%s (%s)" % (self._value, self.GetColor()) if self.GetLeft() is None: for i in range(indent+1): print " ", print "None(Black)" else: self.GetLeft().Print(indent+1) if self.GetRight() is None: for i in range(indent+1): print " ", print "None(Black)" else: self.GetRight().Print(indent+1)
For a quick example of creating and populating, here’s code to generate
the red-black tree used an an example above.
b = RedBlackTree() for i in range(10): b.Insert(i) b.Print()
Red-Black trees are now used as the basic data structure for active tasks, in the new Linux Completely Fair Scheduler (CFS).
The order key is a value representing the run-time deficit of a task, in nanoseconds, relative to its CPU allocation (as determined by its priority).
The next task to be run is the leftmost task. available in O(1) by keeping a pointer to the leftmost task (in addition to the root of the R-B tree).
The previous scheduler used 40 (2 x 20) linked lists; it as called the O(1) scheduler.
I think you have an error in PivotLeft. Instead of
if grand.GetRight(parent):
you should have
if grand.GetRight() == parent:
Otherwise, great post, really readable code (this is why I love Python)
I meant to say the mistake is in PivotRight 🙂
Also, in Insert instead of
return self.GetLeft.Insert(value)
should be
return self.GetLeft().Insert(value)
(This still doesn’t mean the program will work as I didn’t actually try to run it:) )
PEP8 states that in Python class methods must start with lowercase (http://www.python.org/dev/peps/pep-0008/ )
Sasa:
Thanks for catching those. That’s what I get for trying to do a last-minute cleanup :-).
-Mark
Anonymous:
When I’m coding, I tend to stick to the style that we use at work. It’s not standard
for Python, but it’s what I’m used to – method names uppercase, private stuff prefixed
with “_”, two-space indent, etc.
Aren’t red-black trees log(n) time (not amortized)? I’m pretty sure splay trees are log(n) amortized but that AVL and red-black trees are both plain log(n).
The only reason I even feel qualified to ask this is because I just finished my data structures class about 3 weeks ago and it’s still fresh in my memory.
Eric is correct. The worst-case time to insert a node, delete a node, or search in a red-black tree is O(log n). (It’s also true that the amortized time is O(log n), but only in the trivial sense that all worst-case time bounds are also amortized time bounds.)
You should also check out Bob Sedgewick’s recent refinement, which he calls left-leaning red-black trees. (Sedgewick was one of the authors of the original red-black tree paper.)
“# WY case 5 reverse” so you use dvorak it seems? 😉
This brings back memories. I’ve worked with the Mumps/M/Cache programming language for over a decade now, and it stores its data using a B+Tree (I believe) that is very similar to your Red-Black tree. When I first started working in Mumps the database would occasionally get corrupted (they called it a database degradation) and then you would need to go in an manually rebuild the nodes to make sure everything pointed to the correct place. I haven’t had to do that in years.
~~shudder~~ ~~shudder some more~~
Mark, I notice that in the method RedBlackTreeNode.PivotAndRebalance(), the comment:
# WP case 4: I'm the right child of my parent, and my parent is the
# left child of my grandparent. Pivot right around me.
(emphasis added) is followed by a call to self.PivotLeft(False). Is the comment wrong or the code (I’m inclined to think the code is, since the comment follows your original description)?
This is a question about your Python coding style, not the concept (which you’ve explained clearly. Thank you very much!), so I’ll understand if you don’t want to address it. Or perhaps there’s a previous blog entry that explains your position on this…?
So: Why all the setters and getters? That’s appropriate in C++ or Java, but since nothing in a Python object is actually private (by deliberate design), the only time you’d need to use that is if you’re using the setter/getter functions as a ‘property’ of the object (using the built-in property() function).
Python members are meant to be accessible, as the Python language designer decided to assume that Python programmers would behave like adults and treat member data responsibly; using GetLeft() and GetRight() instead of just directly accessing _left and _right makes the code less clear (not to mention a lot longer!), which is the reason for that language design decision in the first place (I believe).
Just curious.
Wilson:
I’m in a bit of a rush right now, so I don’t have time to check the problem you pointed out. I’ll look at it later tonight, and correct it if I messed up, which is likely.
WRT style: Lately, I’m doing most of my programming in C++ (ick! it’s worse than I remembered!), and that’s definitely invaded the way that I’m writing code.
As a general thing, I don’t agree with a lot of the decisions of Python’s designers, and so I program in the way that I’d prefer things to work. In the case of the accessor functions, I was thinking of the code as something specifically for use by other code in a delegating style. So I definitely imagine other code coming in and munging the tree. I don’t like exposing instance variables to clients; I think it’s better to do it with a controlled API call.
An OT request:
There is currently a discussion on this thread at Pharyngula
http://scienceblogs.com/pharyngula/2008/05/we_happy_hooligans.php
concerning the dimensionality of circles and spheres. If you have some time, might it be possible for you to weigh in? Thank you.
There are probably times when that’s necessary, although I feel that if there’s some case where exposing instance variables is going to be a problem because a client might wreck ’em, then the answer is better documentation of correct usage of the instance variable.
Not only will it help clients to understand how to use the class, it will help six months down the road when the original programmer comes back to the code and needs to fix a bug or add a feature.
Clients, presumably, don’t want to wreck the objects they’re using (i.e. if someone is importing (or linking to) your code), they want it to work right, so they’re going to try to use it correctly.
So, if the usage and purpose of instance variables are correctly and clearly documented, clients should be able to avoid causing problems.
It may be that it’s easier simply to write
def GetInstanceVar(): return _var
than to explain what _var is for and how to avoid problems with it, but it won’t help down the road.
However, conceding that there might be occasions where it’s desireable and a net benefit to control (or, with Python, only appear to control) access to instance variables, I don’t think that doing so should be an automatic reflex.
There are times when doing that gets in the way, if not of functionality, then certainly of readability. In those cases, I always bring to mind a maxim of Martin Fowler: “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” (Refactoring: Improving the Design of Existing Code)
Providing an accessor function is good for readability if it hides some complexity (such as updating or validating a value it before returning it). Otherwise, it degrades readability. (Although, that may not be the case for people who are used to reading that style.)
Finally, even if there’s some argument to be made (for instance, enforcement of the concept of Encapsulation) that all instance variables should only be accessed by clients through accessor methods, all the time, why is the object itself treated as its own client?
My (probably very shaky) understanding of objects includes the idea that one of their advantages is that they carry their own state (in the form of their member data). Why should the object have to go through its own accessor functions to get at its own data?
In the example above, why should the
PivotRight()
andPivotLeft()
functions have to callself.GetLeft()
andself.GetRight()
? I’m really not trying to be snarky here (and I apologize if I come off that way); I really don’t understand why an object has its own data, but doesn’t access it directly.Just curious.
“As a general thing, I don’t agree with a lot of the decisions of Python’s designers, and so I program in the way that I’d prefer things to work. In the case of the accessor functions, I was thinking of the code as something specifically for use by other code in a delegating style. So I definitely imagine other code coming in and munging the tree. I don’t like exposing instance variables to clients; I think it’s better to do it with a controlled API call.”
Oh, jeez. Whatever. Solution for programmers that “mungle” data structures. Fire them. (or kill them with fire, whichever suits…)
Lord, save us from the people trying to save us from ourselves.
I, like, sooooo have to mention this:
http://www.eecs.usma.edu/webs/people/okasaki/jfp99.ps
One of the most elegant and simple Red-Black Tree implementations I’ve every seen… and purely functional.
Mark, I know I’m a few days late, but as I was reading the article I had a thought.
Red-black balancing is a pretty ingenious way to avoid a degenerate linked-list. I’m sure the original inventors spent a lot of time thinking about how to solve this (mentioned above is the name of one of the original authors: Sedgewick).
When you describe such interesting data structures / algorithms, would you mind giving us a little insight into how the mathematical mind solves these problems? What possibilities do you think he/she considered, the theory (topic?) of math they used to analyze it, maybe some of the types of research that was probably done? Heck, even what questions should I ask you to describe them considering?
As a wannabe mathematician, I’d like to learn some of the deeper mathematical insights, and I think you’d be able to share them in an informative (and approachable!) manner …