The Perils of Premature Optimization

A reader who’s known me for a while just sent me a note asking me to say something about premature optimization. The reason that I mention that he’s known me for a while (in fact, someone who used to be a coworker) is because this is a topic that I’m prone to rant about in person, but I’ve never mentioned on the blog. So he knows that it’s something that I’ve got a strong opinion about. Basically, he’s having trouble dealing with an annoying coworker doing this, and he wants a rant-by-proxy. I’m OK with that. :-).

When you’re writing a new piece of software, particularly on a modern computer, one of the unintuitive things that frequently happens is that the performance of your system is really quite different from what you’d expect.And even when everything is as expected, most people don’t have a particularly good sense of tradeoffs – if I make this thing faster, what effect will it have on that? and if it does really improve the performance, how much will it improve it, and at what cost? If you can’t answer that – and answer it precisely, with supporting evidence – then you have no business optimizing.

So when you sit down to write some new code, what you should really do is write code that’s algorithmically efficient – that is, you pick an algorithm that has good performance in asymptotic time – and implement it in a straightforward way.

But what many people do – in fact, what pretty much all of us do at least some of the time – is try to be clever. We try to find opportunities to change the code to make it faster. Doing that before you understand where the computer actually spends its time when it’s running the program is what we call premature optimization.

Why not?

There are a bunch of problems with premature optimization. The two big ones are that it can make the program slower rather than faster; and it can make the program more complex.

How can it make things slower?

  1. Compilers are really smart, and they can perform transformations of a program that a human wouldn’t think of. The catch is that in order to perform an optimizing transformation, the compiler needs to be able to prove that the transformation won’t change the results of the program. If you’re very clever about writing something in a cool, subtle way, you’re making it harder for the compiler to figure out if it can do an optimization. For example, a lot of common tricks for supposedly optimizing code in C or C++ do clever things with pointers. By doing that, they save what looks like a bit of time. But by. mucking with pointers, they make the program completely inscrutable to the compiler – and higher-level optimizations that would have had a great impact. on the actual performance of the system can’t be done, because the compiler can’t prove that the transformation would be correct.
  2. Hardware issues are often subtle. Modern CPUs are astonishingly complex, and performance tuning can be thoroughly counterintuitive. For a simple example, a hand-optimization of code might add a couple of instructions in order to swap an expensive operation for a couple of simple ones. The simple ones might be faster in theory – but by making the instruction sequence longer, it no longer fits perfectly into the processor’s instruction cache. And that can have a dramatic impact on performance. (The cache is a special region of very fast memory in the processor. When you access memory, the CPU copies a chunk of memory called a block into the cache so that it can access it quickly. If you access something that isn’t in the cache, the CPU will discard something in the cache to make room. The cache operates in terms of blocks; if you can make either the program code or a particular data structure fit into a single cache block, the performance of that block/data structure will be really good. If you make it a tiny bit larger so that it no longer fits in a block, you get a dramatic change in performance.)

Complexity is a more obvious issue. Changing your code to make it faster is often complicated. It makes the code harder to understand, harder to debug, and harder to maintain. It’s frequently better to let the code be a little bit slower, in order to make it easier to be sure that it’s correct, and easier to understand when someone comes back to look at it six months, or a year, or a decade after it was written.

There’s nothing wrong with hand-optimizing your code. It’s a good thing, and every good engineer spends time making sure that the crucial code of a system is as fast as it can be, without having an adverse impact on maintainability. But the catch – and it’s a biggie – is that you need to be sure that you’re optimizing the right thing, and that the impact of that optimization is actually a net positive.

Premature optimization is optimizing your code before you really know its performance characteristics. When you try to optimize too soon, you don’t know what parts of the code are actually too slow. Just because you think that something is going to be a performance bottleneck doesn’t mean that it really is. For reasons described above (and a variety of others), what you think is crucial to performance might not be. And until you know where the bottlenecks are, there’s a damn good chance that you’re wasting your time, making the code harder to understand, harder to maintain, and possibly even slower, by tweaking the wrong things.

The bit of this that I personally find most frustrating is that there’s a huge amount of “common knowledge” that revolves around performance issues that people use to make decisions about how to write their code that’s just plain wrong.

For example, think of the common loop in C++:


for (int i = 0; i < 1000; i++) { // do something
}

A huge number of C++ programmers will swear to you that using ++i is faster than using i++.

++i means “add one to i, and then return the new value of i”. i++ means add one to i, and then return the old value of i”. So in the i++ case, you need to save the old value; in ++i, you don’t. So ++i must be faster, right?

Not necessarily. It depends on the machine you’re going to run the code on. For example, let’s see what the code would look like on some imaginary stack machine.

 // i++
1: PUSH i          // push the address of i onto the stack.
2: FETCH          // fetch the value stored in the address on top of the stack.
3: DUP              // duplicate the value on top of the stack
4: PUSH 1         // push the value 1 onto the stack.
5: ADD              // add the top two values of the stack.
6: PUSH i          // push the address of i onto the stack.
7: STORE         // store the value in the second position of the stack into
                       // the address on top of the stack.

// ++i
1: PUSH i
2: FETCH
3: PUSH 1
4: ADD
5: DUP
6: PUSH i
7: STORE

They’re exactly the same length. In fact, they’re exactly the same instructions – there’s just a tiny change to the order of the instructions. On the other hand, if we were using a register machine, it would look like:

// i++; "result" stored in R0.
1: LOAD R0, @I    // load R0 with the contents of address I.
2: LOAD R1, R0    // load R1 with the contents of R0
3: ADD R1, 1.       // add 1 to R1
4: STORE R1, @I. // store r1 to address I

// ++i
1: LOAD R0, @I
2: ADD R0, 1
3: STORE R0, @I

In the register machine, ++i is one instruction shorter that i++. So on a register machine (which is, really, a more accurate model of a real computer than a stack machine), ++i really is faster.

Except that it’s usually not.

Because if the value of R0 (the saved copy of i before incrementing) is never referenced, then any compiler that isn’t a total piece of garbage will merge the R0 and R1 registers, dropping the extra save of the pre-increment value. So they’ll end up being exactly the same. In reality, the only place where the performance of ++i is really different from i++ is when it’s embedded in a complex context – and in a case like that, using the “++” operation is almost always the wrong thing to do, because any performance benefit is outweighed by the loss of clarity caused by a side-effecting arithmetic operator embedded in an expression.

There are a lot of places like that, where “obvious” saving don’t save anything. Another really common example is inlining. Normally, when you’re programming, if you’ve got some piece of code that’s going to be used in several different places, you turn it into a simple function, and call it from each of those places. Inlining is taking the calls to a function, and replacing them with copies of the code that would be called.

A trivial example:

// Not inlined:
void mult2(int& v) {
   v = 2 * v;
}

void main() {
  for (int i = 0; i < 100; i++) {
     int x = i;
     mult2(x);
     cout << x;
}

// Inlined:
for (int i = 0; i < 100; i++)  {
   int x = i;
    x = 2 * x;
   cout x;
}

In theory, inlining saves you the cost of actually doing a function call. Normally, to call a function, you need to push the parameters to the call onto the stack, store the return address onto the stack, and then do a jump to the start of the function. That's not a huge cost, but it's not trivial either. If you get rid of the call, you can get rid of the jump in to the function and the jump back; and you can get rid of the pushes and pops of the parameters. So if you've got code in, say, a loop, where it's going to get called a bunch of times, it's faster if the code is inlined.

Inlining makes code harder to read. If you call a function from multiple places, then you instantly know that the program is doing the same thing inside of those two calls. But if it's inlined, you need to look at the duplicate code in both places, and figure out that they're the same. Inlining makes code harder to maintain. If you find a bug in some code that's called as a function, then if you fix the function, you've fixed every place that calls the code. But if you inlined it, then you need to fix every copy separately - and it's all too easy to miss a copy.

The particularly frustrating thing about inlining is that it's exactly the kind of optimization that compilers are really good at. It's really easy to show that a function is inlinable. And it's easy to transform the code to do an automated inlining. And it's even easy to tell if a function should be inlined: if you call the function many times from a loop, or if the function is only ever called from one place, then you inline it. And yet, people will dedicate huge amounts of time to figuring out if they should inline a function by hand.

This kind of thing isn't just frustrating. It's counterproductive. You end up wasting time doing things that the computer could do for you, and that it's better at than you are. That time isn't free - it either delays your project, or it gets done in the place of doing something more valuable - like writing tests to make sure that your code is correct. Or like optimizing the code that really needs to be optimized.

The reason that we call it premature optimization is simply because it's done too soon. You hand-optimize code when you know that it's a problem. You don't optimize because you think that this probably will be a problem. You wait until you're sure that it's a problem. If you can't prove that there's going to be a performance problem, then you should be focused only on algorithmic complexity, correctness, and clarity.

46 thoughts on “The Perils of Premature Optimization

  1. Andrew Cooper

    Mark,

    An awesome post. Thanks. I’ve read many (much shorter) rants against premature optimisation on StackOverflow, but this post clarifies the issues for me really well.

    Reply
  2. Bob

    Any savings in clock cycles is offset by the amount of effort spent fixing and rerunning the code.

    After almost two decades away from it, I’m back working with Fortran. The choice of language is an optimization of sorts; most people would recoil from using Fortran as anything but a bogeyman to scare young programmers but the sad fact is that it’s still fast and effective for numerical programming. F77 lent itself to horrid optimization, faux-dynamic memory allocation via COMMON blocks, miserable pointerish hacks with EQUIVALENCE, and let us never forget the bizarre mangling of flow-of-control with GOTO and ENTRY statements. But those antics were functions of low-memory CPU-bound machines in the days when human time was cheaper than machine time.

    Contrast to today where the latest Fortran language spec has added constructs specifically to hint to the compiler what operations can be transparently parallelized or vectorized. There’s far less incentive to write ‘clever’ (brittle unmaintainable) code now. But even if there’s a motivation to squeeze more performance out of a piece of code, it’s still irresponsible to start optimizing without profiling metrics. Judicious choice of algorithms at the outset will give you the biggest win but most people don’t have the luxury of starting a project from scratch. It’s difficult to resist the urge to clean up & optimize all the little things you believe are eating performance (guilty as charged); data-driven optimization means your time is spent more effectively and you’re less likely to add bugs to working code that doesn’t really need optimizing.

    I found a bigger payoff in programming for clarity than for speed. Once the code is easier to understand, it’s easier to change without breaking, allowing for more ambitious refactorings. And even without a performance increase, the code is more maintainable and easier to review which is important in some environments.

    Great post; thanks for carrying the ranty torch for the rest of us. 🙂

    Reply
  3. Gerald

    Funny, I recently read another article about the for(;;++i) “optimization”. (Can’t find it, sorry)

    Profiling indeed showed that both i++ and ++i resulted in the same code and same speed… for the release version! But the non-optimized debug version was much faster with ++i.

    So the conclusion was that you might as well type ++i because as a programmer you will spend a lot of time running the debug version.

    Reply
    1. Mark C. Chu-Carroll

      In that case, the problem isn’t that one of “i++” and “++i” is slower – it’s that you’ve got your compiler set up wrong.

      If you’re going to meaningfully test while developing, you should have your compiler set up so that your test runs are as close as possible to your production runs. Doing your development and testing with all optimization off and then deploying with full-blown optimization is sloppy at best.

      And I’m not just being an anti-optimization snot. The difference between no optimization with debugging symbols and full optimization is not trivial. It can change the behavior of the program in significant ways. In particular if your program involves threading (and most programs for modern systems do), the difference between optimized and non-optimized programs can alter the execution of the program in ways that change things like synchronization behaviors – which means that running in opt mode may cause bugs that you’d never observe in the non-optimized build.

      Reply
      1. Richard

        All very well – but all you’re really saying is that debugging the optimised system is hard. It’s still best to establish the correctness of the program with a version that will allow you to set breakpoints in the source and read variable values anywhere. You may need tottest the release version – but I hope you’re not saying that you should go for the optimised version before your original is working correctly.

        Reply
  4. Albert Bufo

    Using “++i” instead of “i++” is one kind of micro-optimization that really does not hurt. It is true the postfix and prefix operator are equivalent for intrinsic types, but C++ allows to use them also on non intrisic types (e.g.: iterators). In such cases, using “++i” instead of “i++” saves a temporary variable copy. This advantage, anyway, is really appreciable iff the application is extremely CPU-bound, which is almost never the case. It is worth noting that sometimes, some compiler can avoid the temporary variable copy implied by using the postfix “++” operator.

    I’d like to add another note about function inlining: inlining a function is a good optimization if shaving off the function call procedure (the stack setup time) really improves the overall running time. But, inlining a function means bloating the resulting assembly code, that could force the CPU to get rid from the cache of other code, basically killing the huge advantage of cache locality. Balancing between the cost of a lot of function calls and avoiding killing the instruction cache it’s a difficult tradeoff, that is another reason to let the compiler deal with it.

    Reply
  5. Jason Dick

    One thing I don’t understand:
    Do people not understand the “inline” keyword in C++? Or the #define functionality in C?

    Reply
  6. David Leinhäuser

    The reason people should IMHO get into the habit of writing ++i instead of i++ is that in C++ you are quite often dealing with iterators instead of integers in a loop.

    Iterators can be complex things and i++ usually involves making a copy of the iterator (which admittedly the compiler can often optimize away as well) which can be costly and might even throw an exception where ++i would not.

    This is especially important when you are writing generic code, where you don’t quite know what you will be dealing with.
    Since ++i is at least as fast as i++ (unless people get creative with their overloaded operators) and arguably just as readable, it is just a good habit to pick up.

    There is really no downside to it that I can think of.

    Reply
  7. Roger Witte

    I always prefer using pre-increment to post-increment when coding in c++. My reasoning is that if at some future date the program is changed so that the variable has class type, with overloaded increment operators, then the difference in cost may become noticeable. However the cost of using preincrement in the first place is zero – it is not harder to read than postincrement and it is never slower than postincrement.

    Similarly if I think it might be useful for a function to be inlined, I don’t physically inline the code, but I do take care to make it easier for the compiler to optimise if it wants – I will avoid making it a virtual method unless I think it may need to be overloaded.

    That said, I agree with Mark about premature optimisation, even if I avoid coding the examples.

    Reply
    1. Mark C. Chu-Carroll

      I don’t want to get too bogged down in the “i++” versus “++i” stuff. It’s really not an important thing – in typical use, there’s no significant difference in readability, comprehensibility, or debugability, so the only real distinction between the two is personal preferences. Some people like to see the affected variable first; some don’t.

      But the whole discussion defending it is showing exactly the kind of reasoning that I was ranting about.

      Suppose that you’re programming in C++, and “i” is an iterator.

      If you’ve implemented “i++” and “++i” so that they’re semantically very different, then you’ve created a bug. A ton of library code is built on the assumption that pre- and post-increment are equivalent except for the return value.

      If you’re incrementing an iterator with simple behavior, then you should be writing efficient, careful iterators – which means that you should be implementing them in an inlineable way. If you’re doing that, then once the code is inlined, the difference between pre- and post-increment is essentially equivalent to the difference for integer increment and decrement – you’ll have an inlined copy statement to save the old value of the iterator, and the target of that copy will never be referenced, so it can be stripped as dead code.

      If you’re incrementing an iterator with behavior complex enough that it can’t be inlined, then the difference in time between saving a copy of the old value and not saving it is dwarfed by the complexity of the operation as a whole.

      In almost every case, the difference between the two isn’t important.

      To be clear, I’m not arguing that everyone should always use “i++”. In fact, I actually agree that it’s better to assume that the implementor of iterators wasn’t being as careful as they should, and since there’s no cost to using “++i”, it is preferable. But the amount of attention that people pay to an issue this trivial is astonishing. If the difference between pre- and post- increments in a loop is really the most important performance issue in your code, then odds are your code is good enough that you don’t need to worry about it. Most of the time, your optimization efforts are better spent focusing on real issues.

      In fact, in my experience, people focus on trivial things like pre- versus post-increment to the exclusion of important stuff. For example, when I was at IBM, I was working on an open-source SCM system called Stellation. One of my coworkers implemented some code that was looping over a collection of iterations looking for a change that caused a bug. For each version, the system would run a complete test suite. The idea was start at a version where the tests pass, and find the earliest version where the test failed. He carefully optimized the heck out of the loops – getting rid of every unessential copy, delaying every computation that wasn’t strictly necessary until it was actually needed. But he implemented it as a sequential search.

      The problem with that:

      The amount of time spent on running the test suite is so huge that all of the time he spent on optimization was absolutely wasted. It was statistical noise. He could save, in the best possible case, about a millisecond of computation per loop iteration; but running the test suite took a minimum of 4 seconds (that’s the time it took Java to launch a shell, run something in it, read back the buffered IO stream, and check the result; it doesn’t include any time to actually run the test!). Saving a millisecond of computation from a process that takes four seconds is a waste of effort.
      You can do that kind of search using binary search. You can’t do as many clever loop optimizations in a binary search – but who cares? You’re reducing the number of times you need to run the tests! Even assuming a small range of changes in the time between the known pass and the known failure – like, say, 10 changes – in the binary search, you’ll need to run four tests to pin down the failure, versus an average of five in the linear search case.

      Reply
      1. Dave Tweed

        It’s an interesting example of how people may hear what their mental model says people are saying rather than what they are actually saying. I’d be very surprised if any C++ programmer would say that preincrement on integers is faster than postincrement, but what they’re most are saying is that preincrement may be faster on some unknown type, particularly if you bear in mind that in templated code there may not even be one single type for a given variable in all the function calls, only the operations the type supports. It’s also slightly misleading to say that class preincrement and postincrement operator implementations shouldn’t have different performance. The fact is the specification of those operators have different semantics, it’s just that on modern compilers and processors that difference has eroded. (I’m not old enough to know how and why they got added to C: Were they “just” inherited from BCPL? In any case, I’d be surprised if at whatever time they were introduced into some language the difference between the two was made for some significant reason, whether efficiency or some correctness issue with volatile registers or some such stuff… Now with C++ the difference has again become significant in some cases.) It’s certainly not worth going through a code base doing these optimisations, but they’re perfectly reasonable “default habits” to use when you’re writing code.

        The reason for going on about this is that, whilst I entirely agree about performing optimizations prematurely, people have a tendency to condemn some caricature of what they think people do in rather than an actual incident they’ve seen.

        Reply
  8. Dave W.

    One caution about profiling: profiling will catch problems where you have a hot spot in a single routine or loop. It won’t, however, catch performance problems that cause performance to degrade more or less evenly across the board.

    One example, from a performance enthusiast I once worked for several years ago: he made sure that as many of his data structures as possible in C were a power of 2 bytes long, so that indexing into an array of such structures could be done with shifts instead of multiplies. He had done microbenchmarking on the target architectures of the day, and found that it made a difference of something like 20% on the speed of his array accesses.

    (Note: I would not assume that this is still true on today’s architectures without redoing the microbenchmarking. Also note that it is a pain to preserve this property when maintaining the code, especially if some data types have different lengths on different architectures you support. And finally, it really only makes a difference if you primarily access these arrays using indices, or pointer plus offset, rather than direct pointer references. This is an example, not programming advice.)

    This is an example of a performance difference that won’t (generally) show up in profiling – the performance hit is spread out over every array access throughout the code, rather than being localized to a single routine. Other examples include things like reorganizing data structures to reduce cache contention, or reduce paging behavior – the benefits tend to be diffuse rather than local.

    That said, there are certainly a lot of times where premature optimization is counterproductive.

    Reply
    1. Simon Farnsworth

      That’s more an argument for better profiling tools; something really heavy-duty like Valgrind should be able to spot that you’re spending much of your time accessing a data structure through an expensive access method, and that aligning it to a power of two boundary would let you use a cheaper access method.

      Similarly, the tool can spot that you’re choosing a cache-unfriendly access pattern. Can’t necessarily fix it, but can tell you you’re doing it.

      Reply
  9. Deen

    Completely agree.

    Minor nitpick about the ++i and i++ in C++ loops though: in many cases, “i” may not be an integer, but a class, like an iterator into a list or a map. In that case, “++i” might indeed be faster, but the compiler can no longer assume that “++i” and “i++” operators do the exact same thing in this context. For that reason, “++i” should be used when “i” is an iterator type. And for the sake of consistency, I would recommend always writing loops with “++i”, even if it’s just an integer.

    Reply
    1. Joseph

      This! I was going to post the exact same thing!

      ++i and i++ *don’t* do the same thing; that’s why ++i is faster than i++; ++i is decrement-and-return-value; i++ is return-value-and-decrement. The latter requires you return a whole fresh copy of the i before decrementing and the former lets you get away with just sending a reference to the current i.

      Perhaps a compiler will (is?) smart enough to realize that the value is never used and choose ++i over i++ for the STL, but in general, for some bizarre use case, i++ may be a wholly different beast from ++i and using ++i when i++ was written could be incorrect.

      Reply
      1. Joseph

        I just read a response above; I’d not thought about inlining. Yes, that should remove the save-a-copy-and-return as dead code in this situation.

        OTOH, ++i versus i++ is a simple habit to get into and seems perfectly healthy. If it doesn’t otherwise matter and might in some corner cases be better to do ++i, may as well just train your fingers to do ++i.

        I swear I read this in one of the Effective C++ or C++ Style books, or perhaps even Stroustrup, though.

        Reply
        1. MarkCC Post author

          First – like I said, i++ versus ++i isn’t an important case. There’s absolutely no advantage to “i++” versus “++i”. But it’s an easy to understand example about the kinds of unimportant things that people focus on. Which one of those you use almost never actually matters; and in the rare case where it does, it’s often a sign of something else being wrong with your program.

          Second – just because something is in “Effective C++” doesn’t mean it’s right. Book authors are just as prone to this sort of error as anyone else.

          Third – the fact that Stroustrup says anything doesn’t carry any weight with me. I’ve met him several times, and he didn’t exactly make a favorable impression on me. Every time I’ve seen him give a talk, he’s said something utterly boneheadedly wrong. He may have learned something in the decade since I last saw him speak, but as of then, he really didn’t have much of a grasp on what a compiler can or can’t do. He made numerous assertions about compilers being unable to do things that they obviously can (common subexpression elimination), and about compilers being able to do things that they can’t (restructuring optimizations without affecting stack unwind for exceptions).

          Reply
          1. John Fringe

            I was going to say that the problem is that people tend to think they are better doing things that other people who have worked harder on them. It’s a form of illusory bias.

            This explains the problem with premature optimization. People usually spend very little time optimizing, and they are not experts at all. But they believe they’re better at it than the compiler, which is the result of several generations of experts working full-time on it.

            It also explains the problem with not using external libraries, reinventing the wheel, and a lot of engineering problems.

            But then you made this post. I think you’ve fallen into the same trap.

            For example, about the “common subexpression elimination”. I don’t know the context, but I believe Stroupstrup may be right. If you have

            i = (2+3) * (2+3)

            you can eliminate the second evaluation of (2+3), yes. But you’ll agree that Stroupstrup knows that, won’t you? It’s too naive.

            But if I think a bit about it, I believe he has a point. In general, a C++ compiler can not eliminate a lot of common subexpressions. If you have non-trivial types, say

            i = (a+b) * (a+b)

            can you eliminate the common subexpression? In general, no. Only if the operations are inmutable and the compiler can prove it.

            You’ll say operator+ should be inmutable. Well, maybe it should be, but to eliminate the common subexpression the compiler has to prove it.

            Maybe a and b are geometric types, and operator+ represents the Minkowski sum, which is a costly operation, and the programmer writes a log to get track about the resource usage. Then operator+ is not inmutable any more, and the compiler can not eliminate the subexpression.

            Say “a” is a group of people attending a meeting, operator+ adds a person, and the programmer wants to send an email for each person included. Well, not a very good example (for the structure of the example), but you get the point. If there are side effects, the compiler can not do common subexpression elimination.

            How frequently can the compiler prove an operation is inmutable? If the operations are compiled into a lib, almost never. If I’m right, a lib file has no information about this. (If someone if thinking about const, think about const_cast).

            If you add pointers and memory aliasing, even an apparently inmutable operation can have side effects.

            So there are a lot of missed oportunities for eliminating commong subexpressions, where the compiler can not do anything (because it could change the behavior) but the programmer can, knowing the intention of the program.

            So, I’m too against premature optimization. And I’m you Mark in the essential. But I believe Stroupstrup was right about this one.

          2. MarkCC Post author

            It’s tempting to think that famous people know their stuff. But with Stroustrup, in this case, that just doesn’t work.

            This particular matter was in a discussion of optimization of floating-point arithmetic. Not complex datatypes, not overriden arithmetic operators. Good old-fashioned floating point arithmetic using numbers in the native representation used by the compiler.

            Remember: Stroustrup isn’t a PL or compiler guy by background. He specialized in simulation, and got into the language design business completely by accident. Most of the strange things he says about what compilers can and can’t do are related to his understanding of his implementation of the cfront C++ compiler. If he didn’t see how cfront’s implementation could compute or maintain the information needed for an optimization, he assumed that a compiler couldn’t do it.

          3. John Fringe

            Just to be clear, I believe the problem is Stroupstrup was talking about real classical compilers (C++ compilers, I guess), which have limited knowledge, while Mark is thinking about omniscient compilers (for example, a JIT compiler). It seems to me that invalidates his criticism.

          4. John Fringe

            > Good old-fashioned floating point arithmetic using numbers

            Well, certainly floating point is not very amenable to optimizations.

            I know almost nothing about compilers, and I’m not an expert in numerical programming, but I have developed enough numerical code to know how difficult is to write it, and how much you have to hand-tune the code not to spoil the condition of a system and ruin the solution.

            I don’t like premature optimization, but it is necessary for numerical code. Not for speed, but for accuracy and stability. I would love the compilers to be smarter in this sense.

            Of course, this has nothing to do with the discussion about common subexpression elimination 🙂

      2. Richard

        Let me set the cat amongst the pigeons…

        I beg to differ on this point – and my resasons are exactly the same as the motivation of the original post.

        When they were invented ++i and i++ mapped directly onto pre and post decrement addressing modes in contemporary machines (eg PDP 11) .

        Typically i++ might exist as a mode but ++i might not (eg original 68000 where –i and i++ existed but not i– or ++i) In that case clearly you know which one to use.

        Obviously here I’m talking about constructs like a[i++] – where there is a real benefit from using the addressing mode in the assembly language code that comes out. In that case also there is a real difference between a{++i] and a[i++] because you get a different element of a!

        In most general terms (and on modern processors even with non-primitive constructs) i++ is always preferable to ++i in the case where you are using the returned value – because the machine can optimise its simultaneous multi -instruction execution so that the ++ and the usage of i happen at the same time- not the case with ++i.

        If you are not using the returned value then the compiler should optimise it away anyway!

        Reply
  10. NotCompletelyUseless

    I have to say I’m a fan of optimization, but not of the kind you’re talking about here. I don’t like to compute something twice if I can do it once and then cache the result, especially if I have to, say, read multiple files or make 1000 DB calls to get the base information needed for computing it.

    The time saved by compiler optimizations normally vanishes when compared with the time saved by caching results. And at that point, either you’ve uncovered a performance bottleneck in some unrelated part of the code or there’s some other more important bug or feature to be worked on. At least in my work, “good enough” generally deals in seconds – the millisecond shaving will have to wait for another lifetime.

    Reply
    1. MarkCC Post author

      I disagree about the value of compiler optimization. In real applications, the things that the compiler does for you can have a huge impact.

      For example, part of what I’m currently working on is an interpreter for a little domain specific language. To test it before releases, I gather up all of the production code that uses the interpreter, and run it twice – once with the old release, and once with the new, to verify that my changes won’t break anything in production.

      If I build my interpreter without optimization, and also without any debugging or profiling information, it takes about 45 minutes to get a full run over all of the inputs. With optimization on, that goes down to under 10 minutes.

      There’s not an algorithmic bottleneck – the code is doing everything using sensible, performant code. And it’s not something that could be improved by cacheing – it’s not re-computing things that could have been cached.

      So where’s the dramatic performance difference coming from?

      (1) Lots of inlining.
      (2) Instruction-level parallelism enabled by instruction re-ordering.
      (3) Heap optimization (essentially converting heap-allocation to stack-allocation)
      (4) Locality optimization (re-arranging and clustering related data so that it resides in the same cache line.
      (5) Register allocation.

      Reply
  11. Michael

    In my experience, the biggest problem arising from premature optimization is then having to apologize to the compiler for an incomplete experience.

    Reply
  12. keithb

    While you alluded to it, don’t forget that premature optimizations can make it harder to debug. As Kernigham and Plauger put it:
    “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you are as clever as you can be when you write it, how will you ever debug it?”

    Reply
  13. keithb

    After I wrote the quote above, I came across another in the same book:

    “Keep code clean and straightforward – don’t try to make it fast while coding. Premature optimization is the root of all evil”

    Looks like K&P agree with MarkCC. 8^)

    Reply
    1. MarkCC Post author

      Let’s reverse that, and say MarkCC agrees with K&P. They’re so much smarter, more experienced, and just generally cooler than I am that it’s almost criminal to give me precedence :-).

      Reply
  14. Matt Skalecki

    All this obsession over ++i vs. i++ reminds me of another similar issue: prioritizing source code brevity over source code clarity.

    Perhaps I’m biased by my love for pure, side-effect-free code, but I can’t think of one example where imbedding ++i or i++ as an expression is a good choice, which means I can’t think of any reasonable example where the it’s semantically important to distinguish between the two. Eliminate one, disallow the other as an expression, and get rid of the temptation to combine to clear statements into one confusing one.

    Reply
    1. Michael

      I, too, prefer very clear code and don’t like the idea of mixing ++i and i++ in the same class. That said, eliminating either as an option would be wrong. While subtle, there is a difference in functionality between the two. Some prefer ++i. I personally prefer i++, although admittedly it’s probably simply because that is how I learned it many moons ago.

      Another great example of subtle differences in code are pre and post check loops. Arguably no language needs both pre and post options for loops yet some people prefer one over the other. While both options should exist, I think it’s up to each programmer to select one or the other for any given class as mixing the two hurts readability.

      Reply
      1. Matt Skalecki

        I was actually arguing that BOTH operations should be removed from the language(s) as they currently exist, and that the syntax of one of them should be repurposed as a simple increment statement (that can’t be embedded in an expression). The distinction between the two increment operators is only relevant when they are embedded in an expression, and that is often enough sufficiently ambiguous that I don’t think it should be done.

        Reply
  15. Rehno Lindeque

    Perhaps a small thing but I like to tell people that “if you’re writing C++ code then I really hope that it’s a high performance library… a codec or a ray tracer perhaps” (in which case I assume you already have a pretty decent understanding of low-level optimization)

    Reply
  16. Brendan Miller

    Probably the worst premature optimizations I see in c++ are:
    1. Using pointers where you can just copy by value. Many values are not significantly larger than an 8 byte pointer.
    2. Using out parameters, where you can just return by value. RVO usually does this behind the scenes anyway.
    3. Returning error codes rather than throwing an exception. For operations that rarely fail, it’s actually faster to throw an exception in G++, but most people assume opposite.

    95% of c++ programmers never actually look at their assembly, or time their code. So, most things done for efficiencies sake are just old wives tales. Some optimizations are based on timing with older compilers, visual studios 6, or g++ < 3.4, which works way different than g++4.6.

    More and more I want to move away from C++, just because I feel like C++ programmers worry about the wrong things, whereas javascript/python/ruby programmers would never dink around with these kind of optimizations.

    Reply
  17. Richard

    RE – your code example

    On the other hand, if we were using a register machine, it would look like:

    // i++; “result” stored in R0.
    1: LOAD R0, @I // load R0 with the contents of address I.
    2: LOAD R1, R0 // load R1 with the contents of R0
    3: ADD R1, 1. // add 1 to R1
    4: STORE R1, @I. // store r1 to address I

    // ++i
    1: LOAD R0, @I
    2: ADD R0, 1
    3: STORE R0, @I

    In the register machine, ++i is one instruction shorter that i++. So on a register machine (which is, really, a more accurate model of a real computer than a stack machine), ++i really is faster.

    Every register machine I’ve seen since the year dot has had an auto- post-increment instruction so your example is not representative of the real world – also you assume that the optimising compiler will not be able to keep i in a register for long enough. Underlying hardware will always make i++ faster in principle because of the implicit parallelism of the operation compared to ++i.

    Reply
  18. Richard

    Cache

    One thing I strongly agree with is your comment about the cache. In my recent experience memory hierarchy issues (register –multiple levels of cache- main memory (with row-column -sequential access considerations)-disk backup) is the major remaining determinant of performance once the complier has done its optimisations.

    The other reason for not optimising manually (beyond choosing the right algorithm) is that the compiler knows the ins and outs of your machine much better than you do modern machines are way beyond the state where most of us can say with any certainty “X” is faster than “Y”. The fact that we disagree so much about these details shows that all of us are wrong at least some of the time!

    Reply
  19. John Maxwell

    A slightly different perspective on the same point: Every time I’ve ever been involved in a serious optimization effort, the profiling data has shown that 1) The things my coworkers and I have expected to be bottlenecks aren’t, and 2) There were significant bottlenecks we hadn’t thought of.

    *Every single time*, our intuitions were seriously wrong.

    Reply
  20. Tom Purl

    Great post! It often amazes me how so many companies still don’t believe in profiling code. Like you said, if you don’t have ways of measuring code performance, then how can you actually make things better?

    As a sysadmin, I see this sort of “optimization by intuition” done very frequently in the name of overall system performance improvement. It seems to make sense to add a web server or increase JVM heap when a system isn’t performing well enough. But without decent metrics, these changes usually do very little at best and at worst, they can be expensive and actually *hurt* performance.

    Thanks again!

    Reply
  21. Chris Cogan

    With few exceptions, I/O (especially screen output, or I/O from/to files or databases on a server) will affect performance more than almost anything else, so reducing I/O calls will also improve performance far more than optimizing loops or expressions usually will. I annoy myself from time to time by realizing that I’m optimizing some trivial expression in the midst of code that is already taking thousands of times longer than an equivalent-sized chunk of normal code just because it’s mostly moving data to and from disk. I have even occasionally caught myself trying to optimize a bit of code which, it finally occurred to me, might only be executed a few times in several years, and that the optimizations would save at most a few milliseconds (total) in all of that time. This means that even the mere act of moving the mouse cursor to the code in preparing to change it was taking up far more person-time than the total of all of the time that could ever be saved by the change (though sometimes such changes can improve the readability of the code).

    Mark’s point about choosing a generally efficient algorithm is really the important one. That and large reductions in I/O are where most of the gains will come from. I’d bet that even in the world of fast-action games and other real-time applications, few of the optimizations people might make at the expression or statement level will make a noticeable difference. Better to just code the program in a straightforward way using good algorithms and then use the nit-picking optimizations only where profiling and comparison testing shows them to be advantageous.

    Reply
  22. Pingback: Tab Sweep – An Embarrassment Of Riches « Tom Purl’s Web Site

  23. hijodelachingada

    Your post has most probably compiled languages in mind. Interpreted ones are much trickier in that aspect, as a single function call in a loop can severely impair a programme’s performance.

    Concerning the increment operator in C++: For integers who cares indeed? But for other (complex) types it may be more efficient, as it avoids duplication. Anyway, I don’t think programmers loose much time on optimising ++’s location.

    Reply
  24. Stephen

    Since the early 80’s, i’ve run a couple benchmarks across the new machines and languages i’ve gotten access to. One is a matrix multiply. It’s not a particularly efficient matrix multiply, as the algorithm used is n^3. I understand that one can do better than that. In C, i maintain two versions. One uses array notation, and the other uses pointers. The pointer version isn’t written to do anything that the array version doesn’t clearly do – that is, it’s not optimized in any way. It’s just a translation.

    In the early 80’s, the pointer version was always faster. In the very late 80’s, the first machine/compiler that performed faster with arrays appeared. The machine in question had very powerful primitives for indexing, and i attributed the array version’s speed to these. But by the mid 90’s, there were array versions on much more “normal” architectures that were quicker. The oddest thing is that gcc on x86 architectures isn’t consistent in the last decade. Intel CPUs used to favor arrays, but most recently favor pointers. AMD CPUs have favored arrays of late. It might be noted that since the 486 days, the entire benchmark and data fit into on-chip cache easily.

    In the early 80’s on the PDP-11 (and other systems), the “spell” program could be speeded up by more than a factor of two by adding #include to the top of the file. That’s because it called isupper() and tolower() for every character in the input file. Without ctype.h, the linker bound libc.a functions to the code. With ctype.h, macros which performed the same function where inlined. With no function to call, these very short functions were quite fast. Subroutine call overhead was maybe six instructions, compared with maybe two for the function. The lesson was that inlining is good, and macros were the way to do it for short functions.

    However, modern C compilers can perform inlining. When gcc is allowed to do enough optimization, it will attempt inlining. If a function is declared static (meaning file scope), and references to the function come after the body of the function, the compiler can know that 1) there are no external references to the function, and 2) if there are any pointers to the function, and can therefore omit the function itself from the binary.

    Modern C compilers can do loop unrolling. But in the old days, i’d do loop unrolling by hand, and even use Duff’s device (which is pretty ugly).

    I still use C (not C++) if what i really want is speed. I can’t say that i’m often surprised by what i get. If i want to really be surprised by performance, i’ll use Perl or Scheme. For example, my benchmarks in the Perl language often show it to be 300 times slower than C. But my real life applications show it to be amazingly fast. But then, doing nine dimensional hashes are such a pain in C, that i don’t really have a good comparison. “Adequate” is the gold standard for program performance. I do not rewrite “adequate” programs, except for academic exercises.

    Reply
  25. AJS

    I think this is applicable in any “design” field, not just computer programming.

    In electronics design, where margins are so tight that saving a fraction of a penny on each of a production run can make or break a company, there is a similar tendency to optimise too early. When I worked in electronics R&D, I would hear things like “You can get rid of R4, there’s a DC path via R3, R9 and the base-emitter junction of T2 which will discharge C2 to 0.7V while power is off.” Of course, several design iterations (including at least one because the customer changed their mind after giving us the specification and one to compensate for another unintended consequence of the removal of R4) later, this path is gone; and C2 is now only getting discharged by its own leakage. So the circuit works fine as long as it’s been powered off for awhile, but goes flaky between tests. And the conclusion reached is ….. we need to fit a resistor to discharge C2.

    The proper time to start optimising is when you actually have something that at least does the job it’s actually supposed to, even if it does it sub-optimally. (Though you can legitimately “optimise” ahead of time to make the inevitable “customer changed mind” situation easier for you to deal with.)

    Reply
  26. Paul B

    Myself, I find that there is always an *enormous* performance difference between writing in C++ or in ++C.

    Reply

Leave a Reply