Category Archives: Programming

Finger Tree Update: I forgot something

As an alert commenter pointed out, I left out one really important thing in
my earlier post about finger trees. That’s what I get for trying to write when
I’m sick :-). Seriously, this is a point that’s implied by the post as it stands, but never explicitly stated – and since it’s really important, it
should be stated explicitly.

The monoid-annotated tree structure doesn’t replace the original
data structure: it’s superimposed on it.

So, as I said: cons-cell style lists are ubiquitous and beloved by
functional and non-functional programmers. In finger trees, you’re
not getting rid of them. The point of finger trees is to let
you keep the convenient data structure, with its basic operations
and properties intact, and to augment it with a tree that lets you search it efficiently.

To illustrate: here’s the number list example from yesterdays post. The
original list is at the bottom, with green pointers representing the
original cons-list next-pointers. The monoid-annotated tree is on top,
with red pointers. The combination of the original list with the
monoid annotated tree is a finger tree.

counted-finger-tree-list.png

The point of this is that you’ve still got your cons list. All of the
beautiful recursive/iterative algorithms that walk the list continue to
work exactly the way that they would in the traditional cons-list: for code that walks the list, the fact that there are finger-tree pointers
sitting on top of the list is irrelevant – and, in fact, completely invisible. For algorithms that want to search the list, the tree structure is there,
and allows the searches to be performed much more quickly than they could be on
the traditional list. The superposition of those two structures is the
genius of the finger tree.

Finally: Finger Trees!

stone fingers_478fc739a1585.jpg

For ages, I’ve been promising to write about finger trees. Finger trees are
an incredibly elegant and simple structure for implementing sequence-based
data structures. They’re primarily used in functional languages, but there’s nothing
stopping an imperative-language programmer from using them as well.

In functional programming languages, lists are incredibly common. They show up everywhere; they’re just so easy to use for recursive iteration that they’re ubiquitous. I can’t think of
any non-trivial program that I’ve written in Haskell, Lisp, or OCaml that doesn’t use lists. (Heck, I’ve wound up using cons-cell libraries a couple of times in C++.)

Of course, lists aren’t perfect. In fact, for some things, they absolutely suck. Suppose I’ve got a really long list – like a list of names and phone numbers from a telephone book. My local phone book has 600 pages of names; and the pages are incredibly dense – there’ve got to be at least a couple of hundred names per page. So in a phone book for a suburban region of New York City, the list of phone numbers will have 120,000 names. Now imagine that I want to look
up one name in that list – on average, I’m going to have to chase through 60,000 pointers. 60,000 isn’t a huge number for a computer, but still, chasing 60,000 pointers is a pretty big deal. If I stored them in an array, it would be a lot less convenient for some tasks, but I could use
binary search to find names, and I could find any name in no more than 16 comparisons – and the whole shebang would be contiguous, so I wouldn’t even be chasing pointers.

What finger trees do is give me a way of representing a list that has both the convenience
of the traditional cons list, and the search efficiency of the array based method.

Continue reading

Cloud Computing

cloud-creatures.jpg

In general, I try to keep the content of this blog away from my work. I don’t do
that because it would get me in trouble, but rather because I spend enough time on work, and blogging is my hobby. But sometimes there’s an overlap.

One thing that’s come up in a lot of conversations and a lot of emails it the idea of cloud computing. A lot of people are interested in it, but they’re not really sure of what it is, or what it means.

So what do we mean when we talk about “cloud computing”? What’s the cloud? How’s it different from good old-fashioned client/server computing?

Continue reading

Can simulations replace animal testing? Alas, no.

My good friend and blogfather, Orac, posted something yesterday about animal testing
in medical laboratories. I’ve been meaning to write something about that for a while; now
seems like a good time.

I’m not someone who thinks that being cruel to animals is no big deal. I have known
some people like that, but thankfully they’re very rare, and none of them
were scientists whose work involves doing animal testing in real laboratories.

But animal testing isn’t about pointless cruelty. It’s about understanding things that we
simply cannot learn about in any other way. It’s extremely important to minimize the
cruelty we inflict on the subjects of animal tests: there’s no benefit in torturing an animal;
there’s no good reason to inflict unnecessary pain on any living thing. When we
can study something without using animals, we should. When we must use animals
in scientific study, we should be as kind to them as we possibly can. But the fact remains that in many cases, animal studies are necessary.

I don’t want to get into a long discussion of the ethics of it here; that’s a discussion
which has been had hundreds of times in plenty of other places, and there’s really no sense
repeating it yet again. But there is one thing I can contribute to this discussion. One of the
constant refrains of animal-rights protesters arguing against animal testing is: “Animal
testing isn’t necessary. We can use computer simulations instead.”

As a computer scientist who’s spent some time studying simulation, I feel qualified
to comment on that aspect of this argument.

The simplest answer to that is the old programmers mantra: “Garbage in, Garbage out”.

Continue reading

Two-Three Trees: a different approach to balance

23-thumb.png

This post is very delayed, but things have been busy.

I’m working my way up to finger trees, which are a wonderful
functional data structure. They’re based on trees – and like many
tree-based structures, their performance relies heavily on the balance
of the tree. The more balanced the tree is, the better they perform.

In general, my preference when I need a balanced tree is a red-black
tree, which I wrote about before. But there are other kinds of balanced trees,
and for finger trees, many people prefer a kind of tree called a 2/3 tree.

A two-three tree is basically a B-tree with a maximum of two values per node.
If you don’t know what a B-tree is, don’t worry. I’m going to explain all the details.

A two-three tree is a tree where each node contains either one or two values. If
the node isn’t a leaf, then its number of children is one more than its number
of values. The basic value constraints are the same as in a binary search tree:
smaller values to the left, larger values to the right.

The big difference in the 2/3 tree is balance: it’s got a much stronger
balance requirement. In a 2/3 tree, every path from the tree root to the leaves
is exactly the same length.

Continue reading

Gap Buffers, or, Don't Get Tied Up With Ropes?

Last week, I promised to continue my discussion of ropes. I’m going to
break that promise. But it’s in a good cause.

If you’re a decent engineer, one of the basic principles you should
follow is keeping this as simple as possible. One of the essential skills
for keeping things simple is realizing when you’re making something
overly complicated – even if it’s cool, and fun, and interesting.

Working out the rope code for more complex operations, I found myself
thinking that it really seemed complex for what I was doing. What I’m
doing is writing myself a text editor. I’m not writing a string
processing framework for managing gigabytes of data; I’m just writing a
text editor.

So I decided to try, for the purposes of testing, to look at one of
the simpler data structures. A classic structure for implementing editor
buffers is called a gap buffer. I’ll explain the details below.
The drawback of a gap buffer is that large cursor motions are relatively
expensive – the cost of moving the cursor is proportional to the distance
that you move it.

So I wrote a test. I implemented a basic gap buffer. Then I built a
test that did 300,000 inserts of five characters; then slid the cursor
from the end, to the beginning, back to the end, back to the beginning,
and back to the end again. Total execution time? 150 milliseconds. And
that includes resizing the buffer 12 times, because of a deliberately
stupid buffer sizing and copying strategy.

And with that, out the window went the ropes. Because in Java,
using a slapped together, sloppy piece of test code, which does things in a dumb
brute force way, it can do something much more complicated than anything an editor
will encounter in routine use, the gap buffer achieves performance that is more than
adequately fast. (Based on some experiments I did back at IBM, delays of 1/10th
of a second are roughly when people start to notice that an editor is slow. If you can respond is less than 1/10th of a second, people don’t perceive a troublesome delay.)
If the gap buffer can achieve the needed perfomance that easily, then there’s
absolutely no reason to do anything more complicated.

Continue reading

Ropes: Twining Together Strings for Editors

It’s been a while since I’ve written about any data structures. But it just so happens that I’m actually really working on implementing a really interesting and broadly useful data structure now, called a Rope.

A bit of background, to lead in. I’ve got this love-hate relationship with some of the development tools that Rob Pike has built. (Rob is one of the Unix guys from Bell labs, and was one of the principal people involved in both the Plan9 and Inferno operating systems.) Rob has implemented some amazing development tools. The two that fascinate me were called Sam and Acme. The best and worst features of both are a sort of extreme elegant minimalism. There’s no bloat in Rob’s tools, no eye-candy, no redundancy. They’re built to do a job, and do it well – but not to do any more than their intended job. (This can be contrasted against Emacs, which is a text editor that’s grown into a virtual operating system.) The positive side of this is that they’re incredibly effective, and they demonstrate just how simple a programmers text editor should be. I’ve never used another tool that is more effective than Acme or Sam. In all seriousness, I can do more of my work more easily in Sam than I can in Emacs (which is my everyday editor). But on the other hand, that extreme minimalist aesthetic has the effect of strictly eliminating any overlaps: there’s one way to do things, and if you don’t like it, tough. In the case of Acme and Sam, that meant that you used the mouse for damn-near everything. You couldn’t even use the up and down arrows to move the cursor!

This is a non-starter for me. As I’ve mentioned once or twice, I’ve got terrible RSI in my wrists. I can’t use the mouse that much. I like to keep my hands on my keyboard. I don’t mind using the mouse when it’s appropriate, but moving my hand from the keyboard to the mouse every time I want to move the cursor?. No. No damned way. Just writing this much of this post, I would have had to go back and forth between the keyboard and mouse over 50 times. (I was counting, but gave up when I it 50.) A full day of that, and I’d be in serious pain.

I recently got reminded of Acme, because my new project at work involves using a programming language developed by Rob Pike. And Acme would really be incredibly useful for my new project. But I can’t use it. So I decided to bite the bullet, and use my free time to put together an Acme-like tool. (Most of the pieces that you need for a prototype of a tool like that are available as open-source components, so it’s just a matter of assembling them. Still a very non-trivial task, but a possible one.)

Which finally, leads us back to today’s data structure. The fundamental piece of a text editor is the data structure that you use to represent the text that you’re editing. For simplicity, I’m going to use Emacs terminology, and refer to the editable contents of a file as a Buffer.

How do you represent a buffer?

As usual with data structures, you start by asking: What do I need it to do? What performance characteristics are important?

In the case of a text buffer, you can get by with a fairly small set of basic operations:

  • Fast concatenation: concatenating blocks of text needs to be really fast.
  • Fast insert: given a point in a block of text, you need to be able to quickly insert text at that point.
  • Fast delete: given two points in a block of text, you need to be able to quickly remove the text between those points.
  • Reasonably fast traversal: Lots of algorithms, ranging from printing out the text to searching it are based on linear traversals of the contents. This doesn’t have to be incredibly fast – it is an intrinsically linear process, and it’s usually done in the context of something with a non-trivial cost (I/O, regular-expression scanning). But you can’t afford to make it expensive.
  • Size: you need to be able to store effectively unlimited amounts of text, without significant performance degradation in the operations described above.

Continue reading

B-Trees – Balanced Search Trees for Slow Storage

Another cool, but frequently overlooked, data structure in the tree family is called the B-tree. A B-tree is a search tree, very similar to a BST in concept, but optimized differently.

BSTs provide logarithmic time operations, where the performance
is fundamentally bounded by the number of comparisons. B-trees also
provide logarithmic performance with a logarithmic number of
comparisons – but the performance is worse by a constant factor. The
difference is that B-trees are designed around a different tradeoff. The B-tree is designed to minimize the number of tree nodes that need to be examined, even if that comes at the cost of doing significantly more
comparisons.

Why would you design it that way? It’s a different performance tradeoff.
The B-tree is a da
ta structure designed not for use in memory, but instead for
use as a structure in hard storage, like a disk drive. B-trees are the
basic structure underlying most filesystems and databases: they provide
an efficient way of provide rapidly searchable stored structures. But
retrieving nodes from disk is very expensive. In comparison to
retrieving from disk, doing comparisons is very nearly free. So the design
goal for performance in a B-tree tries to minimize disk access; and when disk access is necessary, it tries to localize it as much as possible – to minimize the number of retrievals, and even more importantly, to minimize the number of nodes on disk that need to be updated when something is inserted.

Continue reading

The Basic Balanced Search Tree: Red-Black trees.

During my Haskell tutorial, I used balanced binary search trees as an example. I’ve had people write to me asking me to write about that in a non-functional programming post, because the Haskell one got too tied up in how to do things without assignments, and with tail recursion.

Binary search trees (BST) are another one of the really fundamental simple data structures. They’re conceptually similar to heaps – they’re also a kind of size-ordered binary tree with O(lg n) performance – but they’ve got a different set of basic invariants to represent the different performance goals.

The goal of a structure like a BST is to have a variable-size stucture where
you can cheaply and easily increase or decrease the size of the structure, and where insert, delete, and searching for values are all inexpensive. BSTs are a good choice for implementing things like dictionaries, where you have a key associated with a value, or where you want to be able to search for an arbitrary member quickly.

BSTs are very widely used – for example, the default ordered collections, sets, and maps in the C++ standard libraries are all implemented as a kind of BST (in fact, the very kind that I’m going to describe below.)

If you have a meaningful ordering operation, then a BST is a good choice: expanding and contracting the structure happen automatically every time you do an insert or remove; and insert, delete, and search are all bounded by lg(n), where N is the number of values in the tree.

Continue reading

Implementing Compact Binary Heaps

Last post, I described the basic idea of the binary heap data
structure. But I didn’t talk about how to implement it – and there’s
a very cool way of implementing it – optimally space efficient,
very fast, and with respectable cache locality for moderate sized
datasets.

Continue reading