The "C is Efficient" Language Fallacy

I came across an article yesterday about programming languages, which hit on one of my major peeves, so I can’t resist responding. The article is at greythumb.org, and it’s called Programmer’s rant: what should and should not be added to C/C++. It’s a variation on the extremely common belief that C and C++ are the best languages to use when you need code to run fast. They’re not. They’re good at things that need to get very close to the hardware – not in the efficiency sense, but in the sense of needing to be able to fairly directly munge the stack, address specific hardware registers, etc. But they are *dreadful* languages for writing real scientific and/or numerical code.

To quote the part of the article that set me off:

First of all, these fears are nonsense. C and C++ are never going to disappear. Why? Because there
are classes of programming problems that are still and will always be CPU bound and there is still
no language as fast as C or C++ for these problems. I highly doubt that there ever will be.

I’m talking about things like: scientific number crunching, game/simulation physics, raytracing,
real-time 3d graphics, audio processing, codec implementation, high-speed network packet routing,
evolutionary computation (my personal favorite :), and of course implementing all these high-level
languages’ runtimes. There are also problems like OS and hardware driver implementation where you
need something “close to the metal” that can interact closely with and even embed assembly language.
C is basically shorthand for assembler, which is why it’s the preferred language for that kind of
thing.

For these tasks, premature optimization at the level of language and framework choice is not evil.
In some cases it’s a requirement. I predict that at least some of these tasks will still be done in
C, C++, or some language with similar characteristics 50 years from now. To give you an idea of just
how much faster C can be for tasks like this, I have found that evolvable instruction set based
evolutionary computation is almost twice as fast when competently implemented in C than a similar
competent implementation in Java.

Here’s the problem. C and C++ suck rocks as languages for numerical computing. They are not the fastest, not by a longshot. In fact, the fundamental design of them makes it pretty much impossible to make really good, efficient code in C/C++. There’s a good reason that Fortran is still the language of choice for real, intense scientific applications that require the absolute best performance that can be drawn out of our machines – applications like computational fluid dynamics.

Making real applications run really fast is something that’s done with the help of a compiler. Modern architectures have reached the point where people can’t code effectively in assembler anymore – switching the order of two independent instructions can have a dramatic impact on performance in a modern machine, and the constraints that you need to optimize for are just more complicated than people can generally deal with.

So for modern systems, writing an efficient program is sort of a partnership. The human needs to carefully choose algorithms – the machine can’t possibly do that. And the machine needs to carefully compute instruction ordering, pipeline constraints, memory fetch delays, etc. The two together can build really fast systems. But the two parts aren’t independent: the human needs to express the algorithm in a way that allows the compiler to understand it well enough to be able to really optimize it.

And that’s where C and C++ fall down. C and C++ are strongly pointer-based languages. The real semantics of almost anything interesting end up involving pretty much unrestricted pointers. In C and C++, there’s no such thing as an array – there’s just pointers, which you can subscript and a shorthand for pointer arithmetic and indirection(`x[n]` in C/C++ is the same thing as `*(x+n)`.)

That pointer based nature means that in a C or C++ program, it’s very hard for a compiler to figure out what things are independent. It comes down to a problem called alias detection. Alias detection is identifying when two variables *might* be referencing the same location. Alias detection becomes a horrific mess in the presence of unrestricted pointers. Let me show you an example:

for (int i=0; i < 20000) {
  for (int j=0; j < 20000) {
    x[i][j] = y[i-2][j+1] * y[i+1][j-2];
  }
}

If you look at that loop, it can be parallelized or vectorized without any problem if and only if the array pointed to by `x` and the array pointed to by `y` are completely distinct with no overlap. But there’s no way to write code in C or C++ that guarantees that. If it were Fortran-77, you could easily check if they were distinct. If it were Fortran-98, you could check if `x` or `y` were declared as possible pointer targets, and the programmer could make it obvious that they didn’t overlap if they wanted to. But you can’t do that in C or C++. (And Fortran isn’t even the best – an experimental language called Sisal from Lawrence Livermore labs used to be able to beat Fortran by around 20% on typical code!)

That example involves parallelization of code, but alias related problems aren't just an issue for parallelism; it’s just easiest to show an example for parallelism. The aliasing issues in C and C++ have a very direct impact on real code. Let me tell you about a concrete example of this, and then I’ll stop ranting. About six years ago, I was working on a project where I needed to implement a rather messy algorithm to compute something called the "longest common subsequence" (LCS) of two arrays. The standard algorithm for computing LCS is using something called dynamic programming; it's **O***(n3) time, and **O**(n2) space. There’s an algorithm that was designed by people doing computational biology that can do it in the same time, but using on average **O**(n) space.

I didn’t know what language to use for this project, so I decided to do an experiment. I wrote the LCS algorithm in a bunch of different languages, to compare how complex the code was, and how fast it ran. I wrote the comp bio algorithm in C, C++, OCaml, Java, and Python, and recorded the results. What I got timing-wise for running the programs on arrays of 2000 elements each was:

  • C: 0.8 seconds.
  • C++: 2.3 seconds.
  • OCaml: 0.6 seconds interpreted, 0.3 seconds fully compiled.
  • Java: 1 minute 20 seconds.
  • Python: over 5 minutes.

About a year later, testing a new JIT for Java, the Java time was down to 0.7 seconds to run the code, plus about 1 second for the JVM to start up. (The startup times for C, C++, and Ocaml weren’t really measurable – they were smaller than the margin of error for the measurements.)The Objective-Caml bytecode interpreter was faster than the carefully hand-optimized C program! Why? Because the OCaml compiler could recognize that the arrays were completely independent – it didn’t need to worry about one iteration of the loop stepping on the values used by another. The C compiler couldn’t apply a lot of useful optimizations, because it couldn’t be sure that they were valid.

And it’s not just non-assignment based functional languages where you can see supposedly less-efficient high level languages crushing the performance of C/C++. CMU CommonLisp can beat C/C++ on numeric code. There was a paper a few years back documenting it: using a Sun SPARC workstation, if you use the optional type declarations, and write scientific/numeric code in Lisp, using vectors (Lisp arrays) and assignments to implement exactly the same algorithm as C, the CMU CommonLisp code will perform better than C code generated by either the Solaris C compiler or GCC with maximum optimization.

168 thoughts on “The "C is Efficient" Language Fallacy

  1. Keith Sader

    Thanks for the breakdown. I *used* to be a C/C++ purist, but I got over it.
    All languages have their place in the Tao, but do not use COBOL if you can avoid it. 🙂

    Reply
  2. Walker

    I work with the ACM programming competition team here at Cornell. Lots of CPU bound programs in those competitions, and your program has to finish in 2 minutes to be accepted.
    Except for the start-up time, I have never seen any difference between Java/C/C++ that couldn’t be accounted for runtime initialization. GUI and graphics code is a whole ‘nother deal, for obvious reasons.
    OCAML is damn fast. I have seen some reall nice network stack code written in it. But I wouldn’t want to be doing applications in it.

    Reply
  3. Mark Chu-Carroll

    Gaurav:
    I haven’t personally used MATLAB, Maple, etc., so I don’t know much about their performance characteristics. I know my father used MATLAB for data analysis a few years ago, but the stuff he was doing wasn’t supercomputer complexity, but statistical
    analysis of semiconductor production runs, so expressiveness was more important than efficiency.

    Reply
  4. Mark Chu-Carroll

    Walker:
    In recent Java VMs, it seems that the performance of most things written in Java do end up being comparable to, or even better than, C/C++. The main thing in Java is the highly expensive startup – the JVM takes quite a bit of time to start up, *and* the performance ramps up gradually as the JIT compiler optimizes the code. Once the JIT stabilizes, the performance is often quite excellent.
    OCaml is definitely screamingly fast, but I don’t see why you wouldn’t want to write apps in it. I’ve done a *lot* of programming in OCaml, including writing several rather substantial applications. (The open-source Stellation project which I led had its first complete prototype implemented in OCaml. It needed to be rewritten in Java because we wanted to submit it to Eclipse.org. The Java version was about six times longer for equivalent function, and it was much slower.)

    Reply
  5. dileffante

    Thanks for the post, Mark, and specially for the experiment; I usually program sequence stuff, and discrete dynamics (automata, boolean networks), and only rarely truly numerical things. I’m a bit lazy about changing languages, though, but I’ll keep this in mind.
    …and, I have an (only slightly OT) question, which I guess some fellow commenter could answer (?). I usually program in C++, and I like the Visual Studio IDE. Since I had to give up illegal software, I’m abandoning VS6 and moving to VS Express. What I don’t know is what they mean with the “managed” code of VC++ Express… Does it somehow go through MS’ virtual machine, like VB and C#?? Will my little hard-working console apps run as fast now as they did with VC6?

    Reply
  6. Mark Chu-Carroll

    dileffante:
    The managed code is code that compiles to the CLR, which is MSs virtual machine. But it optimizes through a JIT just like Java – the startup time will be slightly higher, but the JIT will rapidly make up for it. Long running applications should perform *better* than the old-fashioned C++ compiler, because the JIT compiler has more information available – it can use dynamic information to guide the optimization.

    Reply
  7. KeithB

    While I know that C compilers are really conservative about this, doesn’t it matter as to how x and y are declared? After all if they are dclared as:
    int x[1000], y[1000];
    then the compiler can know that they are independent.
    However, if they are declared as part of the function definition, then yes, they could be overlapping.
    In other languages though, couldn’t you have the same problem with sub arrays passed to functions?

    Reply
  8. billb

    Horseshit! Bollocks! 🙂
    It’s entirely possible to tell the C or C++ compiler that you never alias or that you alias in a function-by-function basis. That, and alias analysis gets better in these compilers all the time. You’ve really overstated your case by a long shot on the aliasing front, Mark. Send me your LCS code, and I’ll take a crack at it.
    Also, if you have core routines that are really important and take up a large fraction of your code’s run time, you should probably write a hand-coded in assembly routine for each architecture that is important to you. In fact, you should probably have an expert from the CPU vendor do the coding for you, unless you happen to have an assembly expert laying around (e.g., K. Goto’s work on the BLAS). At that point, the glue language comes down to personal preference and expressiveness.
    BTW, I’ve seen a number of these comparisions where an expert in one language does an implementation in several languages and, lo and behold, discovers that their favorite language wins out. It’s all the same computer underneath, so if you aren’t getting fast code in one language (that’s sufficiently expressive) IT’S YOUR FAULT (whether because you aren’t a good enough programmer in that language, don’t know your compiler well enough, or aren’t using the right compiler (the old Programming Language Shootout used GCC, which often gets its butt kicked by the Intel compiler on Intel machines)) not the language’s.
    Finally, I’ve got a wicked fast parallel, unstructured (mesh), 3D, time-dependent CFD code that’s written in C++. It was originally a C/Fortan code (Fortran for the tight, important loops), and then it became a C-only and got about 5% faster. When we refactored it into C++ it gained another 5%. When we refactored the main matrix formation routines to be more compiler-like, we gained 5% in that part of the code.

    Reply
  9. Mark Chu-Carroll

    KeithB:
    `int x[1000]` in C or C++ doesn’t have much meaning. It’s semantically equivalent to declaring “x” as `int *x`. You can reassign X any time you want, to make it point to a different location.
    And your point WRT function definitions is dead on: the compiler can’t do much optimization if the parameters to the function are pointers.
    The reason why it’s often better in other languages is because they make it much harder to create situation where there’s ambigous aliasing. You *can* create cases where you can’t prove independence in other languages – but in C++, you can almost *never* prove independence; in Fortran, you can *usually* prove independence.

    Reply
  10. Jake Ham

    @ dileffante
    The “managed” C++ code means that it is managed by the .Net VM. This also means that it will be compiled into MSIL (the .net intermediate language). There are benefits for doing this, the VM will watch for buffer overflows, and it can also talk with non-managed code. How fast will this be? I am not sure, but I think that it is more directed for developing safe code than it is for speed.

    Reply
  11. Anthony Arkles

    To Mark Chu-Carroll:
    I think you’re mistaken about int x[1000] and int *x being equivalent. I just tried two different things with GCC, and they both failed to compile:
    int x[1000];
    int y[1000];
    x = y; /* fails with error: incompatible types in assignment */
    x++; /* fails with error: wrong type argument to increment */

    Reply
  12. KeithB

    I hope I am not teaching my grandmother to suck eggs, but the C faq has a whole section on pointers and arrays:
    http://c-faq.com/aryptr/index.html
    6.1:
    I’m still mystified. Is a pointer a kind of array, or is an array a kind of pointer?
    Answer:
    An array is not a pointer, nor vice versa. An array reference (that is, any mention of an array in a value context), turns into a pointer (see questions 6.2 and 6.3).
    There are perhaps three ways to think about the situation:
    Pointers can simulate arrays (though that’s not all; see question 4.1).
    There’s hardly such a thing as an array (it is, after all, a “second-class citizen”); the subscripting operator [] is in fact a pointer operator.
    At a higher level of abstraction, a pointer to a block of memory is effectively the same as an array (though this says nothing about other uses of pointers).
    But, to reiterate, here are two ways not to think about it:
    4. “They’re completely the same.” (False; see question 6.2.)
    5. “Arrays are constant pointers.” (False; see question 6.9.)
    See also question 6.8.

    Reply
  13. Mark Chu-Carroll

    billb:
    How do you tell a C++ compiler that there’s no aliasing between two three dimensional arrays of floats?
    I can’t give you the code I did the LCS tests on – as I said, it was six years ago! It’s somewhere on a backup tape, and I’d need to get permission from my employer to release it.
    At the time, I had just come off of a project where I was part of a team implementing a C++ compiler. C++ was definitely my strongest language at the time. The code in all of the languages was as carefully hand-optimized as I could make it. (The 0.8 seconds for C code for the kind of LCS I was doing was amazing performance, much better than I expected from *any* of the implementations.)
    I went into the experiment expecting C and C++ to both kick the butts of everything else, and I was trying to find out just how much penalty I’d pay by using one of the “higher level” languages. One of the big questions that I went into the experiment with was how much penalty would I pay for using C++ instead of C? (That is, I viewed C++ as higher level than C, because of things like templates, better typing, cleaner memory allocation/management, etc; and I wanted to know how much penalty I’d get from it compared to
    plain C, and where it fit into the scale of performance from what I expected to be fastest (C) to what I expected to be slowest (Python).
    And as I said, I did my best to optimize the C, because I wanted the numbers to be *accurate*. I didn’t want to spend two years writing a system, and then get beaten up by my management because of poor performance relative to what I could have gotten from C. I seriously wanted to make sure that I wasn’t going to screw myself over, so I really wanted to compare the best that each language had to offer.
    As I said above, at the time, I was a serious C++ coder; the thing I coded immediately before the Stellation experiments was a very hairy template analysis for a C++ compiler – roughly 10,000 lines of code fitting into a 1.5million line codebase.
    And at the time, I had *never* used OCaml; I had used its predecessor, Caml-Light for some experiments about 4 years earlier. But I was *far* from a highly skilled Caml hacker.
    The results were extremely surprising to me, and I did spend some time profiling to try to figure out just why the OCaml was so much faster. The specific reason was that the Caml code did some really clever stuff – it basically did something like local constant propagation that were based on
    be able to identify relations between subscripts used to access different arrays, and having done that, it could do some dramatic code rewriting that made it possible to merge loops, and hoist some local constants out of the restructured merged loop.

    Reply
  14. cleek

    if a loop contains code that can be broken into parallel paths, but you don’t notice it, or you notice it but don’t change your algorithm to make use of that knowledge, it’s not the language’s fault – it’s your fault.

    Reply
  15. Mark

    Nice to read this. All the youngsters around here make fun of me because I use Fortran; most of them use C++. I am not a programmer, but anyone who works in my field has to be able to program to do the work and to make sense of the results of programs others write and provide.

    Reply
  16. Blake Stacey

    @Guarav:
    Mathematica was legendary among MIT physics undergrads for being impossible to operate. I don’t think I got a single thing to work with it that I didn’t redo more quickly in another language. Of course, my sampling was undoubtedly biased, and your actual mileage may vary.
    MATLAB struck me as being the wrong tool for every problem. I was able to do just about everything I wanted, but laboriously. Handling strings in MATLAB code was downright painful: for example, IIRC, an array of strings must have all members the same length.

    Reply
  17. billb

    I dunno, “-fno-alias” on the command line, perhaps?
    BTW, C++ compiler writing in C++ may make you an expert in the parts of C++ that are useful for , but it doesn’t make you and expert in numerically (floating-point/scientific) intensive C++ coding. But I didn’t really intend to question your bona fides.
    My basic point was that you didn’t really get the best that the languages had to offer, but you did get some of the best that the languages had to offer to a few days (or weeks maybe) of optimization work.

    Reply
  18. Steinn Sigurdsson

    There is a well known way to write efficient C code for numerically intensive applications;
    write C wrappers to handle I/O and user interfacing, and have them call fortran subroutines…
    Works.

    Reply
  19. weapon

    > for (int i=0; i for (int j=0; j x[i][j] = y[i-2][j+1] *
    > y[i+1][j-2];
    > }
    > }
    I can’t see why the above code can’t be optimized wrt parallel computing. Now with global optimization, a good C++ compiler could easily figure out that whether x and y points to the same array, or whether they overlap. Doesn’t it?

    Reply
  20. KeithB

    “How do you tell a C++ compiler that there’s no aliasing between two three dimensional arrays of floats?”
    I don’t know about C++, but in C99 you have the “restrict” keyword:
    “…it serves as a ‘no alias’ hint to the C compiler. This means that the pointer is, at the moment, the only way to access the object to which it points.”
    This is from Harbison and Steele, 5th edition.

    Reply
  21. billb

    Blake: That’s why MATLAB eventually got cell arrays (a kind of meta-array for those who don’t know MATLAB). MATLAB is a great language for prototyping small problems that involve linear algebra or fourier transforms where you don’t care to spend the time implementing these core routines yourself of call a more complicated library. In my experience anything outside of 1-D is best done somewhere else.
    However, I still do all my line and scatter plotting in MATLAB. I’ve just gotten used to the way they look.

    Reply
  22. Mark Chu-Carroll

    billb:
    -fnoalias isn’t the kind of thing I was talking about. That’s an assertion that you should *always* assume that things are independent unless you can prove that they aren’t (ie, the opposite of normal). The big problem with a lot of programming in C/C++ is that there are *some* loops where all of the values are independent; and there are some where *some* of the values are independent, and there are *some* where nothing is independent. In Fortran, you can often recognize which variables are definitely independent; in SISAL, you can always recognize which variables are definitely independent; in C/C++ it’s very hard to recognize which variables are independent.
    In Fortran-98, there’s now support for pointers. But the way that they work, when you declare or allocate a value, you have to declare that it can be assigned in a way that produces aliases. There’s no equivalent for that in C/C++. (I know there was an attempt to get a “noalias” keywork into C++, but it failed.)

    Reply
  23. Astrochicken

    Mark Chu-Carroll:
    Did your OCaml implementation have problems with parallelism on an SMP machine? From what I read (and understand), OCaml does not allow multi-threading within the same process regarding accessing shared-memory. So, a numerical analysis algorithm written in OCaml might be limited to a single-processor machine if efficiency is desired.
    Source:
    http://caml.inria.fr/pub/ml-archives/caml-list/2002/11/64c14acb90cb14bedb2cacb73338fb15.en.html
    Sorry for the long URL.

    Reply
  24. Mark Chu-Carroll

    weapon:
    Sometimes, it’s easy to recognize whether two arrays are distinct; sometimes it isn’t. If you’ve got code where the
    declarations are right there, then it’s easy. But there’s lots of code where it’s very hard to tell.

    double a[325][17];
    double b[] = munge(a, x);
    double c[] = mungemore(b,a[13]);
    mungeb(b,a,c);

    When you start to see code like there – where the arrays are getting passed around to lots of different functions, and some of the arrays are return values from functions, and those functions are called from multiple call points… Tractable global analysis just doesn’t hold up well. There’s just too much potential aliasing going on.

    Reply
  25. Stephen Wells

    Your choice of language probably depends a lot on whether the most time-consuming part of the operation is the part that happens in the computer or the part that happens in your head, trying to a) implement your algorithm and b) understand what it did afterwards. I have simulation codes in C++ that could probably be made faster in Fortran, and parsers in Python that could probably be faster in C++, but in both cases it’s not the computer that’s the rate-determining step in my workflow, it’s me. Also if I want to write code so that someone else understands what it’s doing… I’ll trade a little speed for some clarity.

    Reply
  26. Mark Chu-Carroll

    Astrochicken:
    I have no idea how the code would have run on an SMP. The project was building a fast, lightweight SCM systems for developing software on workstations. So the tests were run on
    an IBM desktop PC. It’s long enough ago that I don’t recall which pentium generation it was, or what the processor clock speed was. But it was a single processor x86-family machine.

    Reply
  27. Astrochicken

    Mark Chu-Carroll:
    “But it was a single processor x86-family machine.”
    That was your first mistake. But, that is a religious war for another thread.
    Regarding SMP and OCaml, I think we can conclude that the hardware platform constrains the development language decision as much as the problem set.

    Reply
  28. billb

    Mark, no one disagrees that there are aliasing problems in C/C++, but I think that you’d find that compilers have come a long way in the last 6 years. A very long way. So much so that I think you would reevaluate your claim that they are dreadful languages for writing scientific code in if you took another look.

    Reply
  29. Mark Chu-Carroll

    Astrochicken:
    The machine wasn’t my choice. In industry, we’re often constrained by all sorts of external realities – things like hardware and operating systems (and even programming languages) are often dictated by the requirements of whoever’s paying for the research.

    Reply
  30. bigTom

    Well, the de-facto aliasing rules in Fortran are convienient for optimizing compilers. Two parameters to a subroutine are de-facto assumed to be independent, so no fancy dependence checking algorithms are needed. Of course if you know enough of the ins and outs of a specific implementation of (say C), you can probably find ways to change the compilers aliasing (or other) assumption, either via compile line options, or pragmas.
    The more serious problems in wringing performance out of systems has to do with the hidden, but very strongly non-uniform performance of memory. In general the languages aren’t designed to allow programmers to control things like bytes/words/cache-lines/ and memory pages, yet
    the hardware uses or abuses these constructs based upon the low level details, largely the sequence of memory addresses that comes from the program, and whether it makes efficient -or sometimes very inefficient usage of the hidden memory management layers of the hardware.
    Especially if your goal is to use SSE/Altivec, whose instructions can handle multiple ops/instruction, things like address alignment
    with respect to natural memory boundaries, and/or the compilers ability to know if two arrays, have the same alignment wrt these boundaries can be crucial to its ability to exploit such features.

    Reply
  31. jonmoore

    billb’s right about dealing with aliasing, although it obviously depends on the machine and compiler you’re using since it is outside the language spec.
    For example, I used SGI’s C/C++ compilers back in the 90s and even then you had the choice of
    (i) at the function level, declaring variables to be non-aliasing
    (ii) for individual loops, adding a pragma to say it was safe to parallelize
    It’s a pain to have to do this, but it’s not such a pain as to rule out C/C++ for numerical programming, not by a long shot.

    Reply
  32. Rob Knop

    But there’s no way to write code in C or C++ that guarantees that.
    Sure there is — use the Standard Template Library, and use a vector or some such.
    I have no clue if the compilers are smart enough to recognize that and do things that wouldn’t be safe to use with pure pointers, but in principle it could be done.
    I program mostly in C++, and I *almost never* use pointers any more. C++ still has its own quirks and problems, but the argument that “C++ is entirely a pointer-based language” is obselete in the face of the STL.
    -Rob

    Reply
  33. Mark Chu-Carroll

    Rob:
    The STL doesn’t get rid of pointers – it just disguises them. STL is implemented using pointers! And in fact, much of the STL code is some of the most hideous gorp that I’ve ever had the misfortune to have to read. Might be fine from a users perspective (although I don’t even like it from *that* point o view), but it’s a mess internally. Optimizing for alias elimination is *not* a reason to use STL.

    Reply
  34. Mark Chu-Carroll

    Gang:
    Please try to remember that my point isn’t that C and C++ suck; it’s that the following statement from the linked article is *incorrect*:

    First of all, these fears are nonsense. C and C++ are never going to disappear. Why? Because there are classes of programming problems that are still and will always be CPU bound and there is still no language as fast as C or C++ for these problems. I highly doubt that there ever will be.

    My point is that not only is it not true that there is no language as fast as C and C++, but that there are languages that are often *faster*. The fallacy in the title of this post is that using C/C++ is *the only way* to write code if speed is your primary concern.

    Reply
  35. Simon

    For non-real time systems, how much a performance hit is ok? If my app is 10% slower in language X than C++, but I can implement it twice as fast, then I get more time down the pub.
    I do quite a bit of simulation work, and have given up on fighting compilers and languages and avoiding stupid bugs arising from trying to do clever things. Its not worth the hassle. Although, I guess your mileage will vary.
    For me, writing bug free code is very important. C++ (and similar languages) offer too many places to make mistakes. In fiddly scientific computing this is a big issue. I wonder how many papers contain duff results because someone was accessing beyond array bounds in C?

    Reply
  36. Scott

    Every C/C++ compiler that’s used by people who care about performance has a version of restrict, __restrict, or __restrict__. The aliasing argument is certainly true if you’ve never written performant code and don’t really know what you’re talking about, but otherwise it’s pretty bogus.
    Second, you’ve got to be joking about OCaml. It’s capped at 25% of my current PCs CPU resources, and that doesn’t seem about to change, sadly: http://caml.inria.fr/pub/ml-archives/caml-list/2005/04/3fe33497071e1d9a065533d825ba848b.en.html
    Distribution across multiple CPU resources will be the most important optimization consideration within 2-3 years, if it isn’t already.

    Reply
  37. Mark C. Chu-Carroll

    cleek:
    Sometimes rewriting a loop in a way that makes it faster is the wrong thing to do.
    The example that I mentioned with OCaml is a good one. The original code had two separate loops; OCaml merged them and then extracted local constants.
    Could I have done the same thing by hand in the C code? Yes. But it would have been a stupid thing to do. The code was ultimately going to be used as a part of a large complicated system. The algorithm was based on taking two passes over the input arrays. The results of the first pass are used as inputs to the second pass. Semantically, the two are *really* very different.
    It happens that the compiler can figure out that on the n+1th iteration of loop one, you have enough information to run the nth iteration of loop two, provided you play a few tricks – which is what the compiler did.
    If I were to write the code as the merged loop, it would make it significantly harder to understand, and significantly harder to debug if there were a problem.
    In writing real code, that’s a common tradeoff. Even if the compiler *can’t* do that for you, swapping a 20% increase in execution time for a significant improvement in readability and maintainability is a *very* good exchange. There are even common cases where a 100% performance penalty is acceptable in exchange for better code clarity. (For example, in that prototype of Stellation, there were some parts of the code that were highly IO bound. Processing a line of input takes *much* less time – on the order of 1/100th the time – than *reading* a line of input. So making the code to process a line take twice as long in order to make what’s being done to it is fine, since the main effect of it is to reduce the amount of idle time the code spends waiting for the next line of input.
    You don’t want to deliberately *waste* cycles. But if you’ve got a choice between writing comprehensible, maintainable code, and writing the most perfectly optimal fast code, often you want to pick the clean stuff.
    So going back to that original loop example: merging the two passes into a single pass that plays pointer tricks would be a terrible thing to do. It would take the code from a very clear two-pass algorithm into a very hard-to-follow one-pass.

    Reply
  38. anupam

    wrt to aliasing why not just use the restrict keyword or some variant thereof that should do the trick. no ? also, i guess, compilers in general are conservative while doing code generation. it is _much_ better to have _slow_ but correct code, instead of fast but incorrect one. if you can nudge the compiler onto the right path, it should go a long way in improving code performance. but, when everthing is exhausted, you stil have assembly !

    Reply
  39. BMurray

    Any optimizations your C compiler is incapable of identifying you can always rectify with some inline assembly code!
    Seriously, there’s a reason we like compilers to optimize for us, and any C programmer should be well versed in what C can’t do for you rather than just assert anything can be done. Of course anything can be done — you can inline assembler.

    Reply
  40. cleek

    Sometimes rewriting a loop in a way that makes it faster is the wrong thing to do.
    true. but only if speed is not your primary concern.
    i write 2D image processing software professionally, and aside from correct results, speed is all my customers care about. they don’t care if i have to use MMX/SSE or twist the logic into knots, they just want it to run faster than my competitors. they regularly send me benchmarks of my stuff vs. Lead and Pegasus – if i’m slow, they’re unhappy. they don’t care if the code is unmaintainable – they’re not maintaining it, i am. so i do my best to keep it commented, so i’ll at least know why i did what i did.
    clarity is great. at my day job, i primarily use C#, and i go out of my way to simplify and clarify. but speed’s not so high on the priority list, there. we can always buy a faster web server.
    so… i agree, code clarity is important, but not always.

    Reply
  41. Mark C. Chu-Carroll

    BMurray:
    Actually, writing inline assembly code really isn’t a good thing on modern architectures. On modern architectures,
    making code run fast is incredibly hard – probably beyond the ability of almost any human being in a reasonable amount of time. You need to work out the memory access times, to figure out when to issue fetches and stores; you need to arrange instructions to keep the pipelines full without blocking, etc. A few years ago, we just routinely said we could throw in some hand-optimized assembler when we needed to; that’s really not true anymore. Now, most hand-written assembler will be *worse* than what the compiler generates, because the compiler can optimize for for cache, pipeline, and memory and the tradeoffs between them.

    Reply
  42. billb

    Mark, you’ve switched arguments. You declared C and C++ unusable for scientific purposes, but now you’re saying that we should simply consider other languages that might be slower or might be faster. What gives?

    Reply
  43. Mark C. Chu-Carroll

    billb:
    I didn’t intend to switch arguments :-). I just went a bit overboard in how I stated it. I never intended my point to be that no scientific code should be written in C/C++; I was responding to the idea in the linked article that scientific/numeric code should *only* be written in C++, and that *nothing else* is *as fast* as C/C++, and that further nothing else *will ever be* as fast as C/C++. But in fact, not only is C/C++ always the fastest thing, but it’s often *not* the fastest, and there are numerous choices of languages that are at least as fast as C++.
    Personally, *I* would *never* write scientific or numeric code in C or C++. But then, these days, I can’t think of much of anything that I would voluntarily write in C++. I still use C from time to time, and think it’s a very good language for low level OS-type code; I don’t think it’s a good choice for numeric/scientific. I’ve come to *despise* C++, and personally, I don’t think it’s suitable for much of anything. It’s a giant wretched pile of muck.
    You *can* write scientific code in C or C++. But it’s a whole lot harder to make it run as fast as possible using a language like C/C++ than it is in a lot of other languages.
    It really is a lot easier to write most kinds of numeric/scientific code in Fortran than in C++, and it’s much easier for the compiler to generate optimal code for Fortran than for C/C++. Sisal routinely beat Fortran by 20%, and Fortran routinely beats C and C++.

    Reply
  44. billb

    Considering what can be done in C++, I’d say the level of muck (from negative muck to decidedly no muck to considerable muck) is up to the author. Of course, I’m getting ready to replace my input parser with embedded lua, so what do I know?
    I’m not a C++ bigot by far, but I can make it do what I like, and it interfaces seamlessly with the C and POSIX standard libraries on the UNIX-like sytems I do all my work on (large clusters). Those two together make it hard for me to beat with anything else.

    Reply
  45. Markk

    “Actually, writing inline assembly code really isn’t a good thing on modern architectures. On modern architectures,
    making code run fast is incredibly hard – probably beyond the ability of almost any human being in a reasonable amount of time. You need to work out the memory access times, to figure out when to issue fetches and stores; you need to arrange instructions to keep the pipelines full without blocking, etc”
    This is true but not true – for small functional units, especially in constrained systems, we have never found the compiler better than hand doing it. The compiler is often good enough that it isn’t worth the effort, but once you get down to less than 50 “important” instruction units, people can iterate and get things as good or faster. You learn how many times you can unroll with a given workload, you can usually throw out some instructions, you really can do a better job than a compiler most of the time in getting superscalar order. There aren’t really that many ways to be out of order that matter, and cache is of course really of supreme importance – data cache mostly.
    Of course the amount of code this matters for is almost zero really. If we had just dumped doing this I don’t know if anyone would really have noticed, ecept for not having hard limits, but it did make the programmers aware of things. (This was a couple years ago, but nothing has changed much I hear.)

    Reply
  46. Alexei K

    for those interested, here is a bunch of straight forward (as in not part of huge projects) algorithms common in many fields of computer science, though you mustn’t take it as some absolute benchmark like 3dmark is for example:
    http://shootout.alioth.debian.org/gp4sandbox/
    this does at times show the superiority of non C languages, though in grand majority of cases, C family holds the crown of performance.
    Though what I find interesting about that benchmark is the “alternative” implementation section (multiple implementation in the same laguage). I particularly like the example of alt algorithm in Lua taking about 80% less time than “naive” version. The trick was that for computing alternating sum it’s better to do 2 loops, one for plusses and one for minuses, and then subtract the results, than it is to do a one loop alternating summation. Though apparently that was interpreter’s fault and has been fixed now, but there are other (less dramatic in performance diff) examples there.
    That’s the type of thing I love about that site.

    Reply
  47. anjan bacchu

    hi there,
    nice post.
    “About a year later, testing a new JIT for Java, the Java time was down to 0.7 seconds to run the code, plus about 1 second for the JVM to start up.”
    I’m wondering what type of hardware you had. Even now, a JVM on a Pentium 4 2.x GHz 2 GB RAM on windows xp, it takes more than a second for the JVM to load and give control to the main() method.
    Also, it should be noted that the C startup times can be as big/small as you want. You can trim down the libraries that get initialized and called to what you really need in the application. For the hardware that was popular 6 years ago (say year 2000), the C startup time can be less than 0.1 second.
    BR,
    ~A

    Reply
  48. John

    Are there any languages out there that are designed to take advantage of SMP machines? It would seem that would be awfully useful for scientific work.

    Reply
  49. Mark C. Chu-Carroll

    anjan:
    I don’t remember the exact model. It was the IBM distro of the JVM (IBMs JIT was better than Suns), running under Linux on a single processor intel PC.
    I’m surprised that you see such a slow startup for Java. My machine (a brand new MacBook Pro, Intel Core Duo 2Ghz) can run Java code much faster than that… Quick testing looks like it’s under 0.2 seconds to main. Hard to nail it down any more precisely than that.

    Reply
  50. Mark C. Chu-Carroll

    John:
    There are lots of languages that try to take advantage of SMP. A few examples are: Sisal, which I mentioned above, was designed for parallel processing, and did a great job. One of the design points in Fortran-98 was to make it easy to write programs that could be compiled for various kinds of parallel machines, including SMP. SCL is a common lisp designed for scientific computing, and which has support for SMP. Lots of C/C++ compilers provide SMP support.

    Reply
  51. Koray

    Nice thread.
    Even though it’s admittedly flawed, check out the Programming Languages Shootout (http://shootout.alioth.debian.org/). If you dive deep, you can see some surprising results.
    However, I must admit that C is still not necessarily slower than anything else given manually inserted aliasing pragmas, profile based branching optimizations, CPU manufacturer provided compilers, etc.
    Fortunately, only a tiny percentage use C like that. I bet most people don’t even care enough to set the CPU arch flags right.
    However, a lot of people do care about things like threading and multiprocessing to get performance. C can be in serious trouble here.

    Reply
  52. Alexei K

    @John: that’s why we have the all mighty MPI and its lesser brethren (by lesser, I’m talking about my opinion of them, not their theoretical ability). Languages as such don’t really have parrallelism built in in any way I’d call serious, but some are made receptive to integration with other API, of which MPI is one. (you might be more familiar with Thread like api’s that most languages have built in, but I don’t like that programming paradigm and would never code using it)
    @Derek: owned ya there by 26 min 🙂

    Reply
  53. BC

    You *can* write scientific code in C or C++. But it’s a whole lot harder to make it run as fast as possible using a language like C/C++ than it is in a lot of other languages.

    Wha? There’s plenty of evidence that C/C++ runs quickly in a lot of different situations. In fact, you yourself state:
    * C: 0.8 seconds.
    * C++: 2.3 seconds.
    * OCaml: 0.6 seconds interpreted, 0.3 seconds fully compiled.
    * Java: 1 minute 20 seconds.
    * Python: over 5 minutes.
    In this list, C/C++ aren’t the fastest, but they’re a whole lot better than some other languages on the list. Derek’s link says the same thing: http://shootout.alioth.debian.org/gp4/benchmark.php?test=all〈=all
    Further, you’re making a mistake by saying that because C/C++ handles parallelism bad, it’s all-around bad.

    I’ve come to *despise* C++, and personally, I don’t think it’s suitable for much of anything. It’s a giant wretched pile of muck.

    And you’re showing that you’re a language snob with that statement. All you’re doing is trying to justify your own personal preferences by trying to find situations where C/C++ does badly and acting like it’s all-around bad.
    BTW, I’ve seen C++ code which is optimized for different CPUs, and setup to take advantage of the parallel processing inherent in those CPUs – you can get a big speed boost out of that kind of code. You’re making the same mistake that so many other people make: you don’t understand the language well enough or don’t want to write the opimized routines, and then complain that the language is just inherently inferior, justifying your own preconceived ideas. There are lots of ways to judge languages (ease of use, personal preferences, ability to create programs quickly, speed, memory usage, etc), and even I don’t like all programming languages, but I’m not going to construct some bad argument about why you shouldn’t use language X, that I happen dislike.

    Reply
  54. Alexei K

    This is totally random, but I’d like to point out that there was a game for Playstation 2 (can’t remember the title, but it was one of the major blockbasters of the time) that was written *entirely* in Scheme, but it was some sort of compiled version of Scheme that actually produced assembler code for PS2.
    I don’t really mean this as a proof of anything, just thought it was a nice thing, and it made me for the first time consider that perhaps C family isn’t the be-all-end-all of programming.

    Reply
  55. los alamos

    comments about stl efficiency are a little too hasty, imho.
    have you seen blitz++?
    http://www.oonumerics.org/blitz/
    also see http://www.oonumerics.org and http://acts.nersc.gov/pooma/
    “””Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. It uses template techniques to achieve high performance. The current versions provide dense arrays and vectors, random number generators, and small vectors and matrices.”””
    “”” Goals of the Blitz++ library
    The C++ programming language offers many features useful for tackling complex scientific computing problems: inheritance, polymorphism, generic programming, and operator overloading are some of the most important. Unfortunately, these advanced features came with a hefty performance pricetag: until recently, C++ lagged behind Fortran’s performance by anywhere from 20% to a factor of ten. As a result, the adoption of C++ for scientific computing has been slow.
    Is there a way to soup up C++ so that we can keep the advanced language features but ditch the poor performance? This is the goal of the Blitz++ project: to develop techniques which will enable C++ to rival — and in some cases even exceed — the speed of Fortran for numerical computing, while preserving an object-oriented interface. The Blitz++ Numerical Library is being constructed as a testbed for these techniques.
    Recent benchmarks show C++ encroaching steadily on Fortran’s high-performance monopoly, and for some benchmarks, C++ is even faster than Fortran! These results are being obtained not through better optimizing compilers, preprocessors, or language extensions, but through the use of template techniques. By using templates cleverly, optimizations such as loop fusion, unrolling, tiling, and algorithm specialization can be performed automatically at compile time.
    Another goal of Blitz++ is to extend the conventional dense array model to incorporate new and useful features. Some examples of such extensions are flexible storage formats, tensor notation and index placeholders. “””

    Reply
  56. los alamos

    comments about stl efficiency are a little too hasty, imho.
    have you seen blitz++?
    http://www.oonumerics.org/blitz/
    also see http://www.oonumerics.org and http://acts.nersc.gov/pooma/
    “””Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. It uses template techniques to achieve high performance. The current versions provide dense arrays and vectors, random number generators, and small vectors and matrices.”””
    “”” Goals of the Blitz++ library
    The C++ programming language offers many features useful for tackling complex scientific computing problems: inheritance, polymorphism, generic programming, and operator overloading are some of the most important. Unfortunately, these advanced features came with a hefty performance pricetag: until recently, C++ lagged behind Fortran’s performance by anywhere from 20% to a factor of ten. As a result, the adoption of C++ for scientific computing has been slow.
    Is there a way to soup up C++ so that we can keep the advanced language features but ditch the poor performance? This is the goal of the Blitz++ project: to develop techniques which will enable C++ to rival — and in some cases even exceed — the speed of Fortran for numerical computing, while preserving an object-oriented interface. The Blitz++ Numerical Library is being constructed as a testbed for these techniques.
    Recent benchmarks show C++ encroaching steadily on Fortran’s high-performance monopoly, and for some benchmarks, C++ is even faster than Fortran! These results are being obtained not through better optimizing compilers, preprocessors, or language extensions, but through the use of template techniques. By using templates cleverly, optimizations such as loop fusion, unrolling, tiling, and algorithm specialization can be performed automatically at compile time.
    Another goal of Blitz++ is to extend the conventional dense array model to incorporate new and useful features. Some examples of such extensions are flexible storage formats, tensor notation and index placeholders. “””

    Reply
  57. los alamos

    comments about stl efficiency are a little too hasty, imho.
    have you seen blitz++?
    http://www.oonumerics.org/blitz/
    also see http://www.oonumerics.org and http://acts.nersc.gov/pooma/
    “””Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. It uses template techniques to achieve high performance. The current versions provide dense arrays and vectors, random number generators, and small vectors and matrices.”””
    “”” Goals of the Blitz++ library
    The C++ programming language offers many features useful for tackling complex scientific computing problems: inheritance, polymorphism, generic programming, and operator overloading are some of the most important. Unfortunately, these advanced features came with a hefty performance pricetag: until recently, C++ lagged behind Fortran’s performance by anywhere from 20% to a factor of ten. As a result, the adoption of C++ for scientific computing has been slow.
    Is there a way to soup up C++ so that we can keep the advanced language features but ditch the poor performance? This is the goal of the Blitz++ project: to develop techniques which will enable C++ to rival — and in some cases even exceed — the speed of Fortran for numerical computing, while preserving an object-oriented interface. The Blitz++ Numerical Library is being constructed as a testbed for these techniques.
    Recent benchmarks show C++ encroaching steadily on Fortran’s high-performance monopoly, and for some benchmarks, C++ is even faster than Fortran! These results are being obtained not through better optimizing compilers, preprocessors, or language extensions, but through the use of template techniques. By using templates cleverly, optimizations such as loop fusion, unrolling, tiling, and algorithm specialization can be performed automatically at compile time.
    Another goal of Blitz++ is to extend the conventional dense array model to incorporate new and useful features. Some examples of such extensions are flexible storage formats, tensor notation and index placeholders. “””

    Reply
  58. KeithB

    MarkCC wrote:
    “The example that I mentioned with OCaml is a good one. The original code had two separate loops; OCaml merged them and then extracted local constants.
    Could I have done the same thing by hand in the C code? Yes. But it would have been a stupid thing to do. The code was ultimately going to be used as a part of a large complicated system. The algorithm was based on taking two passes over the input arrays. The results of the first pass are used as inputs to the second pass. Semantically, the two are *really* very different. ”
    Is their any reason why this is unique to OCaml? Couldn’t *any* compiler do the same thing?
    If this thread keeps going are you going to make C/C++ Friday’s Pathological Language?
    8^)

    Reply
  59. Mark C. Chu-Carroll

    KeithB:
    Lots of compilers *could* do something like what the OCaml did; the key is whether or not it can determine enough information about the semantics of the program to be certain that the transformation is valid.
    Most functional language compilers would do the equivalent transformation. I’m pretty sure that most of the really good CommonLisp compilers would; Sisal *definitely* did that kind of transformation (Sisal was designed specifically to make transformations like that easy); SML/NJ *might*.
    I know that some Java JITs do that kind of transformation on a dynamic basis; the information to ensure its validity isn’t there in a form that can be easily detected statically in Java, but the JIT gets to use dynamic information as well, and most JITs can recognize a loop as a hot spot in the code, analyze and generate specialized code with a very fast dynamic independence check.
    Pretty much *every* high performance fortran compiler would do the equivalent transformation; most Fortran code would be tractable to recognize the validity.
    I won’t go so far as to say that no C++ compiler does it, but it’s *very* hard to get enough static information about multidimensional arrays in C++ to be able to do something like that safely.

    Reply
  60. Mark C. Chu-Carroll

    And WRT the question about whether I’d make C++ the weeks pathological language… In general, I don’t consider C++ to be particularly pathological. Poorly designed? Yes. Pathological? No.
    (The reason I said “in general” is because the template system in C++ can get pretty damned pathological. The C++ template system is, I *think*, Turing complete. I know for certain that it’s at least primitive recursive. One of the tests that I saw a member of the C++ compiler team write was a program that made the compiler compute a factorial *at compile time*. Templates are very close to being a functional programming language of a particularly bizarre sort.)

    Reply
  61. Dan Vanderkam

    There’s nothing stopping the compiler from writing two versions of the function, a slow version for the aliased case and a fast, parallel version for the more common case. Or simpler, just throw a giant if statement around the whole function:
    if (overlap(x,y)) {
    // ... slow ...
    } else {
    //... fast, parallel ...
    }

    If there are 20,000*20,000 iterations, then this optimization would be well worth it.

    Reply
  62. JJS

    Erwin Unruh showed in 1994 (in connection with the ANSI C++ commitee) that you could use C++ templates to compute an arbitrary number of primes at compile time, and output them as compiler error messages.
    More recently, Todd Veldhuizen produced something more direct, showing how to embed Turing machines as template instantiations.
    As to your comment about templates being close to a bizarre functional programming language, it only takes a couple of lines of template code to produce a template providing the typical functional way of calculating factorials — at compile time.

    Reply
  63. Mark C. Chu-Carroll

    JJS:
    Thanks for the pointer. I didn’t know about Veldhuizen; his paper is a piece of diabolical genius. For anyone who’s interested, here’s the URL for a paper showing how to build a Turing machine using C++ templates, so that the compiler needs to run the TM simulation to compile the program!
    http://osl.iu.edu/~tveldhui/papers/2003/turing.pdf
    Brilliant!

    Reply
  64. Pseudonym

    My pet peeve is lumping C and C++ into the same basket. C++ is at its most horrid when it’s programmed like you would program in C.
    In the case of numeric programming, for example, C++ has some very nice BLAS bindings these days, like Boost.uBLAS. And unlike C or Fortran, they’re even nice to use, because you can use the addition and multiplication operators instead of having to remember what SAXPY means.
    Incidentally, try writing the LCSS algorithm in Haskell some time. Dynamic programming in Haskell is really, REALLY elegant. You set up a data structure full of unevaluated thunks which depend on other thunks, then just evaluate the element you want. Lazy evaluation takes care of the rest.
    You can even do this with infinite data structures.

    Reply
  65. Mark VandeWettering

    I must admit, I’ve been programming professionally for over twenty years now, and most of that in C, although I’ve done a fair amount of Common Lisp (none lately), C++, Python and a smattering of other languages. I’ve now settled into a fairly consistent mix of C and Python.
    While I understand the point that you are making quite clearly, I think that you exaggerate the degree to which it really matters. Even in your example, C was only about a factor of two off the time for compiled OCaml, whereas both C and OCaml were hundreds of times faster than interpreted Java and Python. While there are certainly downsides to using C (and even more in using C++ if you ask me), C compilers (courtesy of gcc) are virtually ubiquitous, and compilers like icc from Intel do even better at code optimization.
    As programmers, it is part of our duty to pick appropriate tools for appropriate jobs. It simply isn’t as idiotic to pick C for numerical codes as you make it out to be, and other considerations might actually make it a reasonable choice.

    Reply
  66. Mark C. Chu-Carroll

    Pseudonym:
    In Fortran-98, it does array arithmetic using the arithmetic operators. F’98 also has operator overloading, etc. It’s got pretty much everything that’s good about C++ combined with everything that was *already* good about Fortran. It’s actually a surprisingly nice language.
    I would actually love to try something like LCS in Haskell. I’ve really been wanting to do some more coding in Haskell; of the languages I know, it’s the one that I’m fascinated by that I’ve gotten to use the least. LCS is the kind of thing that would be great to try, because it’s something that you’d write *so* differently in Haskell compared to the other languages I use frequently.

    Reply
  67. Ravi

    C, C++ may be fast, but they are difficult to code large programs in. Don’t bring all that REAL PROGRAMMERS USE “X” LANGUAGE shit.
    I am a selfish programmer. Just as everybody wants to have a shiny OS or a shiny app that lets them do more, I like a shiny high level language, that lets me do more in the time I can spend. If it does not run fast enough, dude …. , it will, sometime in the future. And guess what, my code would be far more readable than the one written in C/C++ then. You see, the all your futures are belong to high-level languages :D.

    Reply
  68. jayinbmore

    Mark,
    Not to get too cute, but your own numbers demonstrate that there’s a general purpose C program out there which outperforms your hand tuned C program in that particular task. It’s called the OCaml interpreter.

    Reply
  69. Przemek Klosowski

    Some people here pooh-poohed specialized languages like Octave/Matlab, which is really . If the main computation in the code is doing some special-purpose operation, like linear algebra along the lines of BLAS, it is naive to expect to do better than the tailored low-level code that the special-purpose language will dive into. There is an amazing amount of work put by the likes of Kazuhiro Goto or
    the ATLAS group into optimizing their BLAS speed. They took care of esoteric things like minimizing cache and TLB misses, which I fully expect to be botched by mere mortals in their attempts to reimplement it in C/C++/Fortran.
    A case in point: Octave, which is a Matlab functional equivalent, is not known for top interpreter speed, When doing straight numerics on large arrays, however, it smokes, simply because it just jumps into optimized low-level BLAS or FFTW or other library code. Even if one used such low level libs from C/C++ rather than rewriting them, one would still have to consistently use it in a best possible way.
    This discussion reminds me of another debate going on in the embedded computing/microcontroller area, on relative merits of assembler vs. HLL programming. It is true that top assembler coders beat compilers on selected pieces of code: they spend effort on finding non-obvious code tricks and shortcuts, etc. etc. At the same time, the compilers consistently plow along, generating better code than an average coder, and even beating top assembly programmers on bodies of code that are too large for humans to comprehend in detail required for superoptimization.
    As a bonus assignment, look up the GNU superoptimizer: a program that combinatorially goes through finite sequences
    of machine instructions, trying to find shorter expressions of simple algorithms–exactly the kind of magic that top
    assembly programmers were famous for, except that the superoptimizer never gets tired or distracted.

    Reply
  70. tomvandijck

    “This is totally random, but I’d like to point out that there was a game for Playstation 2 (can’t remember the title, but it was one of the major blockbasters of the time) that was written *entirely* in Scheme, but it was some sort of compiled version of Scheme that actually produced assembler code for PS2.”
    If I recall correctly that was Ratchet and Clank.
    But that had not much to do with performance, although they did a very good job on the compiler.. I think this was chosen mostly because it was easier to define gameplay constructs in that language.
    In that same area I did a full 1.5 compliant Java Virtual machine for the PS2 and Xbox, just to make it easier to write games, without having to relink the game.. giving gameplay programmers a faster turnaround time, and a better tool to debug their games. In the final product we would just ‘pre-jit’ all code into a binairy form anyway.

    Reply
  71. Charlie

    “This is totally random, but I’d like to point out that there was a game for Playstation 2 (can’t remember the title, but it was one of the major blockbasters of the time) that was written *entirely* in Scheme, but it was some sort of compiled version of Scheme that actually produced assembler code for PS2.”
    You’re probably thinking of Jak & Daxter, which was written in a Lisp-like (the syntax was Lispy, and it had macros, but it didn’t have closures or garbage collection) called GOAL.
    Naughty Dog got great performance from GOAL for a couple reasons: one, it offered every assembly instruction as an intrinsic, with very nice Lispy syntax. (ADD R1,R2,R3 becomes (add r1 r2 r3).) So there were great swathes of code written in what amounted to assembly-language-with-variables. Second, it had runtime code replacement – you could make a change and see the results in about half a second.
    That meant that the graphics guys could tune the code *very* finely. They’d pull up a bunch of performance metrics, and fiddle with the game while it was running. (“How about if I swap these two instructions – faster or slower? OK, how about these two…”) This is, of course, only reasonable because the code ran on the Playstation 2, and *no other* hardware. Every processor it would ever run on would be identical.
    There’s also a secondary argument that the productivity advantages of Lisp/Scheme macros meant that the programmers had more time to optimize performance-critical parts of the code.

    Reply
  72. Charlie

    “If I recall correctly that was Ratchet and Clank.”
    Ratchet and Clank was written in C++, although they had access to the Jak & Daxter codebase in GOAL. (Insomniac and Naughty Dog, the two companies in question, used to be in the same building and still have a close relationship.) I don’t know to what extent they used the Jak/Dax assembly in their engine.

    Reply
  73. T C

    Excellent article. But it’s wrong to call the original statement a “fallacy”. It’s a “falsehood”. Those two things are not the same! 🙂

    Reply
  74. jd

    By far, the most important factor in any algorithm’s performance for any language is the skill of the programmer implementing it. The primary reason people say C is always going to be the fastest is that almost every other “higher-level” language is implemented in C. Ergo, if language X can do something in Tx time, and X is implemented in C, then C can do the same task in Tx time. It is pretty hard to get around that logic. All you can do is look for another language that compiles directly to assembly. Then, the same argument about the speed of assembly applies. True, you may need an expert in C or Assembly to get the best speed possible. But given that all code is eventually implemented in (probably) C and (definitely) Assembly, we must conclude that Assembly is, indeed, the fastest language. The reason there are other language is because we don’t all want to have to be experts in C and/or Assembly. But if you want optimum speed, with no other consideration, the only logical choice is C and/or Assembly coded by an expert.
    PS – Take another look at the earlier comment that says, in effect, you are fastest in your favorite language. That person knows what they are talking about. I have similar anecdotal evidence myself for this. A few years ago, some Java person claimed that Java was, in fact the fastest language and posted the code. Of course, the C++ code was awful. I re-wrote the C++ using modern templates and easily beat the Java code. Not only that, the C++ code in question also beat the C code (written by the Java guy). This particular example was complex matrix math. The C code used lots of pointers. My C++ code used the STL and looked, on the surface, relatively inefficient. However, the C++ compiler is made for C++ after all, and it made my code very, very fast. Again, not much evidence for my example, but just as much evidence as you have provided:)

    Reply
  75. penguinpusher

    Umm before claiming C++ sucks rocks for numerics you should really find out how people use it for numerics. In particular expression templates which solve the exact problem you complaining about.
    You may find this enlightening too http://gamearchitect.net/Articles/WhyC++.html a little quote is “It’s easy to hire people who call themselves C++ programmers, but engineers who understand both the low-level details of the C++ object model and the compile-time power of C++ template metaprogramming are hard to come by. And C++, in the hands of an programmer without a good grasp of the language, is a dangerous thing.”

    Reply
  76. Anonym

    Very interesting discussion. On the C++ side you can take a look at Blitz++. On the Python side there is Psyco, that if used well can reduce your running time to 1 minute or less. Shedskin is a Python compiler that often can reach C++ speeds.

    Reply
  77. Christopher Granade

    I would prefer specialized languages like MATLAB were it not for some of the coding style things which give me grief; specifically, the lack of explicit variable typing and declaration annoys me in MATLAB. There’s no doubt that the parallelizing and vectorizing features of MATLAB are superb, but C/C++ could relatively easily be extended to support that, thus giving both a more structured programming approach and the absurdly powerful vectorizing techniques of MATLAB. To some degree, GPU-only languages like Brook and Cg do this to accommodate the massively parallel structure of modern GPUs. With hyper-threading, multi-core processing and vector registers, I see it being only a matter of time before languages like C/C++ are forced to either adapt to vectorization or fall by the wayside.

    Reply
  78. Lambda Llama

    Interesting that you mention Lisp. Usually, I find that Lisp is about 2/3 as fast as C, but is well worth using for scientific programming because 1) fewer lines are needed 2) the selectable safety level lets you catch array and other bugs 3) it is interactive, bypassing much of the tedious edit-compile-run-debug cycle, and allowing easier subunit testing 4) it simplifies much of the high-level data bookkeeping (lists, hashes) that consumes much of ones time. Absolute CPU time is not the only consideration when discussing speed. The real question is “how long does it take you to get the right answer?”

    Reply
  79. Ben St. John

    Excellent comments in general. Mark, I’ll agree with your general point that aliasing is a problem (but not an insurmountable one) in C/C++, and that sometimes other languages can do a better job. However, I think for many performance-intensive tasks, C/C++ is the best offering out there, when you balance a number of factors out. (Speed of naive implementation, ease of finding programmings, portability, interoperability, maturity of tool set, optimization possibilities (including in-line assembly))
    Obviously Fortran is also a contender, although my impression is that it’s more that its primary benefits are the existing libraries, and arrays.
    A few minor points though: there is a significant difference between
    int x[100]
    and
    int * x;
    just take the ‘sizeof’ both, for example; one is 400 (usually!) the other, 4. So the compiler has some idea what’s going on. As others have noted with keywords, annotations, and compiler flags, you can generally get to where you want. You can even (via pragmas) often do nice things like cache pre-fetches, or cache-line alignment.
    Regarding how complex writing assembly is, I would say your argument really only applies to Intel x86. Maybe AMD too, I’m not sure. For other systems (I’ve used MIPS R3000, 68k, and TI DSPs) hand-coded assembly was consistently better. It’s partly dependent on how complex the chip is, partly on how mature the compiler/tool chain is. Intel x86 is extreme by both measures.
    Finally, regarding GOAL, (1) it was only for the animation system of Jak & Daxter, (2) it had inline assembly, (3) after Naughty Dog was bought, in the next version of the game the animation system was re-written *without* GOAL, although apparently more because of infrastructure/programmer issues than language issues. So it’s not really a poster child for non-C efficient coding. The main advantage they cited was being able to change things on-the-fly. The wikipedia article is not the best, but has good links.
    I’m not against other languages — C++ is beginning to feel like a monster to me too. But I still think it’s the best tool for many tasks.

    Reply
  80. AndyC

    An oft overlooked point in this area is that the people writing all this numerically intensive code are not necessarily hardcore programmers. It’s often scientists and grad students who just want to get the right results as quickly as possible. Expecting them to hand tune C/C++ or make the best use of compiler switches is somewhat unrealistic. Any language which allows the naive case to produce something closer to the optimum code has to be a good thing.

    Reply
  81. Wouter Lievens

    Not really. Numerically intensive code isn’t just academic. Just think about the chips inside your digital video recorder.

    Reply
  82. zhaphod

    I found the following very interesting link while doing some research on how to learn OCaml {because some of the comments mentioned that it kicks ass when it comes to performance}. The link provides the bench marking results for various languages.
    http://shootout.alioth.debian.org/gp4/
    I found that C {gcc implementation is invariably faster than all other languages}
    Cheers

    Reply
  83. Jake Ham

    This post has got way out of wack. I don’t know if people aren’t reading his post, or understand his points. People become defensive if you start taking jabs at their favorite language.
    The point of the post is that there are alternatives! Sometimes C/C++ will be faster, and sometimes it won’t, but if you don’t ever try your application in a different language you will never find out.

    Reply
  84. Blake Stacey

    All technical issues aside, I think MarkCC should be proud of this one: ninety comments and a place on the “most e-mailed” sidebar! And all without mentioning string theory or Richard Dawkins.

    Reply
  85. zhaphod

    Follow up to link on shoot out at debian {http://shootout.alioth.debian.org/gp4/}:
    I should say, according to the shootout at debian, C {gcc} is faster than many but not all. My mistake.

    Reply
  86. WarEagleBob

    While interesting, your post is comparing apples and oranges by saying a functional language is better than a proceedural language for your LCS problem. If you want to bitch about a particular language compared to another, you should do it fairly. Show a what languages are best at different classes of problems.

    Reply
  87. Flaky

    One of the main uses of C and C++ is system programming as they are seen as being fairly low-level. It certainly is the case, that the ability to directly manipulate pointers is necessary for interacting with the hardware. As for the other features of those languages, I see no particular reason, why a language with higher-level features and a more orthogonal feature set couldn’t be used for the same purpose. (BTW, Lack of orthogonality is my main grievance with C++. C++ is not a complex language because of its abundance of features, but because its features are interdependent in ways that requires you to be constantly aware of several features that might interplay in any piece of code you are currently writing.)
    Is someone here aware of any language suitable for low-level system programming that would be aligned more to the functional paradigm?
    Such a language could have function closures, (mostly) immutable values (save for pointers and references of course), optional GC (with static+stack allocation as an alternative (requiring some way for the compiler to compute the maximum memory usage, of course)), pattern matching, and all of this (and more) packaged with strict typing for ease of development and optimisation by the compiler.

    Reply
  88. Mark C. Chu-Carroll

    WarEagleBob:

    While interesting, your post is comparing apples and oranges by saying a functional language is better than a proceedural language for your LCS problem. If you want to bitch about a particular language compared to another, you should do it fairly. Show a what languages are best at different classes of problems.

    No, that’s not what this post is saying.
    My main point was that C/C++ is *not* always the best, fastest language. The example was just that – an example, which demonstrated a computationally intensive numerical array-based algorithm for which I happened to have performance numbers for carefully implemented versions in several different languages. It’s the *same algorithm* in each of the languages; and the C++ version was the original(at the time, C++ is the language that I was most comfortable with); the implementations in the other languages were as close to the C++ version as I could reasonably make them.
    The OCaml variant was *not* particularly functional. The functional nature of OCaml had *nothing* to do with the way that it performed better. It’s an array iteration algorithm, which is something that all of the languages happen similarly; and OCaml arrays are *not* functional; array slots are fully assignable just like in any of the other languages I tested.
    OCaml just happened to have an incredible compiler and runtime which were able to infer and make use of static
    properties of the code to optimize it and run it without much overhead.
    Later versions of my system were written in Java, and the Java performance got to the point of matching C, but it never quite made it to OCaml level. But with Java, there was a huge startup cost (which didn’t really hurt us, since our system was a long-running server; 5 seconds at startup is fine for a system that’s going be invoked 1000 times spread over a couple of days.). With OCaml, I got peak performance for the algorithm, and with low overhead. It was *really* impressive.
    I’m not trying to make the argument that everyone should rush out and start writing all of their scientific/numerical code in OCaml; it certainly wouldn’t be *my* first choice. (I’d probably pick one of the better CommonLisp’s out there; SBCL seems pretty damned fast.) In particular, as someone pointed out earlier in the thread, the OCaml system doesn’t take advantage of multicore processor, which really sucks for serious scientific computation. My whole argument is that the claim that “C/C++ is the fastest, and always will be” is a stupid claim.

    Reply
  89. Mark C. Chu-Carroll

    Flaky:
    I don’t know of any functional languages that I would really consider a good choice for really low-level programming. The closest thing to really functional that I’ve seen used for down-to-the-metal programming of things like operating systems and device drivers is a Lisp variant.

    Reply
  90. Michael Chermside

    Thank you… wonderful article.
    I am slowly coming around to the belief that higher-level languages will eventually be FASTER than low-level languages. The trouble with low-level languages is that you’re allowing the programmer to specify some of what the compiler does for the high-level language. As time goes by, compilers get smarter (and programmers get dumber1) so eventually the higher-level language will win. As you point out, the increasing difficulty of optimization just accelerates the trend, and the move toward fancy tricks like JIT is really putting on the pressure.
    —-
    1 – “as time goes by, programmers get dumber”: I know that *I* do anyhow.2
    2 – At least I can say that there’s more I *don’t* know about programming than there was when I was a beginner.3
    3 – Curiously, this makes me a better programmer.

    Reply
  91. Xanthir, FCD

    Taking the tangent about Naughty Dog further…
    I actually read an article by these guys. It was, in fact, Crash Bandicoot that they used Lisp on. Basically, they were *really* pushing the limits with that game in terms of what the PS1 could handle, and so optimizing it became too complex for a human. Instead, they wrote an AI controller that rewrote code, rearranged data organization on the disc, and optimized memory calls so that they could just worry about writing the game instead of tweaking it endlessly to get it to perform well with seamless loads and such.
    Apparently, they leaned on it even more with the later Crash games. They were all pretty good, so I’m impressed.

    Reply
  92. Memet

    I’m going to just add something on the whole billb & Mark thread of conversation. For everyone’s sake I’ll be succinct.
    Foreword: I like C++. I use it often. But I’m not a fanatic either. Also use LISP, VB, js…
    Re C++: sure you can’t prove correctness etc in C++ because there are huge loopholes which can be exploited in the language (like reinterpret_casts). But CC++ *is* close to the metal, and so what certain compilers may lack in brightness, the CC++ programmer can make up for in ingenious ways.
    Does this mean C++ is faster “out of the box”? No. But then again, I can write you a program that will make a Java app leak memory.
    So should we use C++ in enterprise scale applications because it’s just inherently faster? Probably not. Does this mean we shouldn’t use it in enterprise level applications? Not really either: unlike what most people think, the biggest thing C++ has got going for itself is not speed, it’s templates. The kind of templates that only LISP based languages trump. Java don’t got templates, and I don’t think OO programming languages like OCaml do either (but maybe I’m mistaken). And I’m talking advanced templates (cf. ATL, STL).
    And final word: C and C++ are as far apart as 286 and 386. C is basically a high level assembler. C++ is a very rich language with incredible compile time mechanisms (such as templates) hidden right under the surface.

    Reply
  93. Xanthir, FCD

    The kind of templates that only LISP based languages trump. Java don’t got templates, and I don’t think OO programming languages like OCaml do either (but maybe I’m mistaken). And I’m talking advanced templates (cf. ATL, STL).
    And final word: C and C++ are as far apart as 286 and 386. C is basically a high level assembler. C++ is a very rich language with incredible compile time mechanisms (such as templates) hidden right under the surface.

    Must… resist… urge to pimp Lisp…

    Reply
  94. dave

    It seems to me that the real fallacy that Mark is addressing is the assumption that being closer to the machine makes code run faster. If what you need done is fiddling bits and chasing pointers in particular ways, then a language like C or C++ is going to give you fast code because it lets you say exactly what you mean (and your C code will probably be almost as fast as, and (if you do it right) far more portable than, writing it directly in assembly).
    The problem with the “C is fast because it’s close to the machine” argument is that there is a large (and growing) class of problems with solutions that are made easier to optimize by moving *farther away* from the machine, into a higher-level language that gives the compiler more information and/or lets it make more assumptions about what’s really going on, and lets the optimizer work its magic based on that instead of making the programmer do all the work. Lower-level languages can sometimes make some of this up with (specialized, usually non-portable, often hard to get right) hints to the compiler, but they’re starting so far behind that it’s usually not worth the effort.

    Reply
  95. Ron Avitzur

    While on the topic of efficency, let me emphasize a concern, though orthogonal to this discussion, which is sometimes overlooked at high cost: with main memory access speeds lagging ever behind CPU speeds, the difference between a calculation which fits entirely in on-chip cache and one which saturates the main memory bus may incur more than 100x performance penalty on the “same” algorithm and code. The typical example is to benchmark a calculation with a nested loop over an array to see the performance difference of changing the order of nesting. One of the reasons matrix libraries are so fast is that they use block algorithms whenever possible to operate of portions which fit in the on-chip cache. This issue becomes more complicated on multiprocessor systems where each CPU core has its own cache, possibly shared with other cores on the same chip, maybe not. This is both a low-level detail of the system architecture, but also directly influences the design of high level algorithms to maximize locality of reference.

    Reply
  96. Xerxes1729

    As a mathematician with limited programming skills, I actually like Mathematica quite a bit. I know it’s not the most efficient system out there, but I find that it’s easy to use for solving mathematical problems. Also. it’s awfully nice to have a huge library of built-in functions and tools for things like solving NDEs. If you’re not doing applied math, Mathematica is much more useful than MATLAB because of it’s ability to handle symbolic computations.

    Reply
  97. Gaurav

    Bilb:
    You said SMP isn’t enough parallelism. Could you expand on this a little bit?
    Also, isn’t this something towards which chip vendors are making a push with their multi-core stuff? Any takes on that?
    Thanks.

    Reply
  98. AndyS

    Boy! I sure enjoy a passionate programming language discussion. For some people _any_ criticism of their favorite language is like telling them their child is ugly.
    (This is a great thread. I learned a lot. Thanks to everyone.)

    Reply
  99. Alexei K

    @AndyS: the problem is that this article isn’t _any_ criticism, but a fundamental acusation of utter failure of a language in an entire field of algorithms. Which is nonsense, and rests on very few very weak examples. Such argument only needs few examples of other languages failing to other flaws to counteract, and that has been done (link to that benchmark site).
    This article started out as a rant and should’ve been left as such, but the fact that Mark tried to defend it just made him look exactly like Granville Sewell whome Mark criticized last month in his post.

    Reply
  100. Xanthir, FCD

    Alexei:
    Did you *read* the post? Mostly, Mark just criticized the other blog’s assertion that C is the fastest language and will always be. He held that up with some examples of where other languages can perform better than C (taken from a period when he was programming extensively *in* C).
    He goes further when he says that C sucks for numerical computing, and that can be challenged. But he admitted that he may have gone too far there. His main point, though, still stands – any assertion that “C is always going to be the fastest language” is false, and will get more and more false as time goes on. Playing close to the machine is simply too difficult on modern, single-core machines. Multi-cores will make the whole thing even more complex, and the situation can only continue to deteriorate.

    Reply
  101. Mark C. Chu-Carroll

    Marius:
    No, sorry, the classic algorithm for LCS is n3. I’m not sure if that’s a proven lower bound, but the major algorithms for computing it are all variations on dynamic programming, and have an upper bound of O(n3).
    The confusion is probably because “subsequence” doesn’t quite mean what most people think seeing the term.
    Suppose we’ve got two sequences:
    [1, 3, 6, 9, 10, 11, 12, 14, 18, 19, 21, 22, 23]
    and
    [2, 4, 6, 8, 10, 11, 12, 14, 16, 18, 20, 22]
    Most people’s intuition on hearing the term “longest common subsequence” is that the LCS of those two would be [10, 11, 12, 14]. But in fact, the LCS is [6, 10, 11, 12, 14, 18, 22]. A subsequence is an arbitrary subset of the values in a sequence, which occur in the same order in the subsequence as they do in the original sequence.
    If the elements of the list always appear in sorted order, like those two lists above, then it’s easy. But the classic LCS problem allows arbitrary re-orderings. So, for example,
    taking the LCS of [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and
    [2, 6, 3, 4, 8, 7, 10, 9]. The answer is that there are two equally long longest subsequences: [2, 3, 4, 8, 9] and [2, 3, 4, 8, 10].
    Now, think about computing that on a two arrays of 10,000 values. (Stellation was an SCM system which attempted to be smarter than typical line-based systems; so a source file that was 1000 lines long could easily balloon into a list of 50,000 tokens; and we wanted the LCS of the two file versions. We reduced the tokens into numeric hashcodes for computing the LCS. If the idea had come out looking promising, we would probably have taken something like Wegman’s perfect hash function generator to prevent collisions, but it never got that far; the sub-line based diff computation is just too expensive in terms of memory to be tractable. Just think about two arrays of 10,000 elements each, and an algorithm which uses *on average* n lg n space, but can use up to n2 space. If if you only use one byte per compared line-pair, that means that you could wind up eating 100 megabytes on the comparison table! And every trick we could come up for partitioning, merging, etc., to reduce the comparison table size couldn’t make word/token based diff computation manageable in general.)

    Reply
  102. Mark C. Chu-Carroll

    George:
    (1) OCaml is implemented in OCaml. It’s got its own native code generator. The only C code is interfaces to system routines. Ocaml came about because of a brilliant coder (I think it was Xavier Leroix) who was working with the CaML group at INRIA, and thought the regular CaML compiler was too big, so he implemented a small one, which he called Caml-Light. CaML light was implemented in the subset of CaML that it was designed to compile, and was bootstrapped. Since then, it’s been continually developed using itself. CaML light eventually replaced the old heavyweight CaML compiler, and Xavier et al extended it until it became the OCaml of today. Damn shame that they don’t seem to be doing anything to take advantage of the parallelism in modern systems.
    (2) Even if Ocaml were implemented in C, it doesn’t really change my point. Suppose you have a task, T, and you’ve got a choice between two languages, language A and language B. In language A, you can write an algorithm in a clear and simple form which the compiler can optimize into an incredibly efficient executable; and in language B, you can get the same speed, but you need to go through all sorts of contortions to try to force the compiler to ultimately generate the fast code. My argument is that for the problem domain of T, A is well-suited to the task, and B is not.

    Reply
  103. billb

    Gaurav: My work involves distributed memory parallelism (clusters and other supercomputers). There are some PGAS language extensions out there (UPC, Co-array Fortan, Fortess, X10, etc.) that give some old languages the ability to have a global address space across the memory on these machines, but they don’t take advantage of the inherent parallelism that the languages mentioned above might. That’s what I meant by SMP not being enough.

    Reply
  104. Nicky Mclean

    It wasn’t until I encountered C and its derivatives that PL/I was raised in my estimation, which took some doing. Avoiding a rant at C’s inane syntax, I would merely point out that a simple mistype (adding, omitting or mistyping a single symbol) easily produces a syntactically correct source file. If say thirty minutes a week are lost in tracking down one such mistake, will it likely ever be recouped in any supposed faster-running of a C-source code file? Why else have one programmer typing and a second one watching for miscodes and mistypes?
    It is ridiculous to assert that some programming language is always the best for all possible purposes. Even assembler, delivering the maximum possible performance (for the specific task on its specific system), the only possible candidate for this acclaim, is unsuitable because the programming costs are so high as to make its use unproductive except in very special cases. Why else do we use higher-level languages? Even if C is an elaborated assembler. And yes, I have written a multi-precision calculation of e, and pi, in assembler and not.
    Fortran was designed from the start for number crunching, and it excels at this function. Its first implementation also considered complex array indexing optimisation, too error-prone to be done by hand when writing assembler. This was to counter one criticism from assembler programmers. Computer science types tend to mess about with text and complex data structures; they create languages for doing this expediently. Number crunchers from the science community want many large arrays and massive arithmetic, as found in Fortran. Just look at the silly syntax for arrays in C, and the obsolete idea of fixed lower bounds for arrays.
    When dealing with statements such as X = A + B (all being arrays) there is no excuse for the compiler not generating good code for this, as distinct from having to write something like
    for i:=1:N do
    for j:=1:M do
    x(i,j):=A(i,j) + B(i,j);
    next j;
    next i;
    (I’m not going to besmirch the page with C’s redundant gibberish) which is not just more tedious and error-prone to type (array bounds? transposed index variables? plain mistypes with so much to type?) but also presumes that accessing the arrays rowwise (varying the second index to run along the columns of row i) is better than column-wise. With the statement of intent, X = A + B, the compiler would be free to implement it in any manner expedient. Rowwise, columnwise, in chunks of eight (or 16, or…) at a time: whatever, similarly the array indexing and iteration control would be its own. Possibly a comprehensive optimisation analysis would convert the explicit loops to such code, and probably, not.
    Comparison between computer languages is made more difficult by a modern trend, the provision of a compiler for language X by in fact providing a transliteration from X to C, and then compiling the C stuff that results. It seems unlikely that features not found in C (amazement! puzzlement!! incomprehension!!!) would result in code that improved on that which would be produced by a compiler explicitly recognising them and their meaning, rather than just their transliteration.

    Reply
  105. Brooks Moses

    One minor correction: There isn’t any such thing as “Fortran 98”. I presume you’re meaning to refer to Fortran 95?
    (Well, I suppose it’s possible that you’re implying Fortran 95 plus the little bit of Fortran 2003 that allows allocable object components, which is a pretty common “standard” to write to, and could humorously be called “Fortran 98” or somesuch — though I’ve never seen anyone do so!)

    Reply
  106. Olivier

    dileffante:
    Nothing to do with the main subject, just precisions about
    Managed C++: this language is able to compile targetting .NET CLR (Microsoft VM) and to do so benefits from new syntax. But it is also fully (or 99%) compatible with previous versions of MS C++. So, Visual C++ Express should compile (to native code) your old VC++ 6 programs. The projects should even be opened and converted by the new IDE. MFC and ATL are supported.
    Well, my experience is based on using the commercial version (Visual Studio 2005). But I suppose this should also be true for VC++ Express. It’s definitely not an obligation to convert your code to the managed flavour of C++. And the resulting binary will be pure native Windows code.
    In fact this new language (MC++) is able to mix in a single binary native and managed code…

    Reply
  107. JY

    Mark,
    I think it’s the case that the calculation of the length of the LCS is O(n2), but the calculation of the LCS itself is O(n3). In any event, here is the alg. for the length in Haskell, which is clearly O(n2).

    import Array

    calccell a0 a1 mat (row,col) =
      let (_, (rmax,cmax)) = bounds mat in
        if row == rmax || col == cmax then mat //[((row, col),0)]
        else
          if (a0!row) == (a1!col) then mat //[((row, col),(1 + (mat ! (row+1,col+1) ) ) )]
          else mat // [((row,col),(max (mat!(row+1,col)) (mat!(row,col+1))))]

    lcs as bs =
      let rows = length as
        cols = length bs
        aArr = listArray (0, rows) as
        bArr = listArray (0, cols) bs
        mat = array ((0,0),(rows,cols)) [] in
          (foldl (calccell aArr bArr) mat (reverse $ indices mat)) ! (0,0)

    Reply
  108. Mark C. Chu-Carroll

    JY:
    That could be the source of the confusion. I’ve never looked for the *length* of the LCS, so I don’t know what the complexity is if you don’t try to actually generate the LCS, but only identify how long it is.

    Reply
  109. billb

    Just to add fuel to the fire (on what might be considered the “other side” from the position I originally adopted), FFTW (literally the Fastest Fourier Transform in the West, which is portable and usually compettitive with vendor-optimized DFTs) is written in ANSI C by an OCaml code generator! See http://www.fftw.org/faq/section2.html#languages
    🙂

    Reply
  110. Stephen

    I’ve been running a matrix multiply benchmark on many machines since about 1982. It is written in C, Perl, Fortran, Java, BASIC, Pascal, and APL. There are two versions in C. One uses pointers and the other uses arrays. The array version looks like this:

    static void
    matmult(void)
    {
    register int i, j;
    int k;
    for (i = 0; i
    The pointer version looks like this:

    static void
    matmult(void)
    {
    register ARTYP *ap, *bp; /* in order of importance */
    register int k;
    register int i;
    register int j;
    register ARTYP *cp, *aptmp;
    cp = &c[M-1][N-1];
    ap = &a[M-1][N-1];
    i = M;
    do {
    aptmp = ap;
    j = N;
    do {
    bp = &b[M-1][j-1];
    k = M;
    ap = aptmp;
    do {
    *cp += *ap * *bp;
    ap–;
    bp -= M;
    } while (–k);
    cp–;
    } while (–j);
    } while (–i);
    }

    In the 80’s, with most C compilers on most machines, the pointer version was faster. This started to change in the early 90’s. These days, the array version is faster almost all the time.
    The benchmarks were run with ARTYP set to “float” and also “double”. In the old days, “double” ran faster – sometimes twice as fast. This was due to the old compilers converting float to double before doing an operation, then converting back. These days, “float” is faster.
    In the 80’s, the do…while loop syntax was lower overhead than the “for” loop syntax. It compiled to a simple “subtract one and branch” instruction. These days it appears to make less difference.
    Now, these benchmarks have been run using whatever was handy. Seldom has a commercial compiler been handy. Most of the systems were running Linux, with gcc, f77, and maybe Java was downloaded from Sun, and/or gcj was used. Generally, the commercial compilers produce faster code, with the right options. Over time, C has been the winner on almost every machine. Fortran was second, but steadily improving with respect to C. The notable exception was a supercomputer, where Fortran was much faster.
    With C as the standard, Perl is about 1000x slower. Compiled Perl is about 200x slower. Java started out 300x slower, but has improved steadily to be only 3x slower. Fortran started out 2x slower, but has steadily migrated toward C’s speed.
    Now, alot of what I do is web CGI programming. Lots of short little things that put up a page and exit. For one application, written in Perl, single hits were taking some 7 seconds to complete. I decided to rewrite it in C. I achieved a 30x improvement – which was less speed up than expected. Further, most of the improvement came from superior algorithms. Constant time is much better than exponential. It turns out that things like opening files, going to databases, and just startup tend to dominate my work. C is good, but so is Perl. I’m much less happy with Java. In particular, the JVMs used where I work are huge. Perl and C have a much smaller memory footprint, and respond faster.
    Over time, assembly language has always been faster than C. Here’s how it works. You write the thing in C, look at the instructions generated, and goof around with it until you have some improvement. There’s always something you can do, so you always get something.

    Reply
  111. Rüdiger

    I’m not sure if it was already mentioned But you might consider using the c99 keyword restrict to tell the compiler that there is no aliasing

    Reply
  112. vlad

    Mark,
    I fully agree with your statement that today a human cannot write effective assembly and this is compiler who can produce really fast code. But I disagree that it’s impossible with c++. I’ll try to prove this (more or less) formally – someone already mentioned managed c++ here, so my point is that this generated code will in no way differ from a similar code generated from java or c#, so it’s either ms’s bytecode has a big design flaw – that it cannot be optimized for smp, or the opposite – which implies that it’s somehow possible to generate smp-optimizable c++ code. and in fact most compilers can optimize assuming no aliases…

    Reply
  113. Lars Petter Endresen

    Hello,
    Thank you for a fascinating discussion!
    Modern compilers allow intermixing of multiple levels of parallelism,
    taking advantage of whatever parallelism and locality is available:
    DLP, TLP or ILP. For example, both Intel Fortran 9.1 and Intel C++ 9.1
    automatically vectorizes (DLP) parallelizes (TLP) and unrolls (ILP)
    the matrix multiplication example neatly,
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #define SIZE 1024
    float a[SIZE][SIZE];
    float b[SIZE][SIZE];
    float c[SIZE][SIZE];
    void init_matrix()
    {
    int i, j;
    srand( (unsigned int)(time( NULL )) );
    for(i = 0; i < SIZE; i++)
    for(j = 0; j < SIZE; j++)
    b[i][j] = a[i][j] = (double)rand()/RAND_MAX;
    }
    void matmult()
    {
    int i, j, k;
    for(i = 0; i < SIZE; i++)
    for(j = 0; j < SIZE; j++)
    for(k = 0; k < SIZE; k++)
    c[i][j] += a[i][k] * b[k][j];
    }
    int main(void)
    {
    clock_t start, finish;
    init_matrix();
    start = clock();
    matmult();
    finish = clock();
    printf(“n%2.3f secondsn”, (double)(finish – start)/CLOCKS_PER_SEC);
    printf(“s=%fn”,c[0][0]); // Avoid dead code elimination
    }
    The below code gives the asm of the unrolled and vectorized multiplication loop –
    loop is threaded too but this is not shown here,
    xor ebx,ebx
    movss xmm0,dword ptr [esi+edi*4+827500h]
    xor ecx,ecx
    shufps xmm0,xmm0,0
    movaps xmm1,xmmword ptr [ebx+ecx*4+0C2F600h]
    movaps xmm2,xmmword ptr [ebx+ecx*4+0C2F610h]
    movaps xmm3,xmmword ptr [ebx+ecx*4+0C2F620h]
    movaps xmm4,xmmword ptr [ebx+ecx*4+0C2F630h]
    mulps xmm1,xmm0
    mulps xmm2,xmm0
    mulps xmm3,xmm0
    mulps xmm4,xmm0
    addps xmm1,xmmword ptr __pow10neg+160h (427400h)[edx+ecx*4]
    addps xmm2,xmmword ptr __pow10neg+170h (427410h)[edx+ecx*4]
    addps xmm3,xmmword ptr __pow10neg+180h (427420h)[edx+ecx*4]
    addps xmm4,xmmword ptr __pow10neg+190h (427430h)[edx+ecx*4]
    movaps xmmword ptr __pow10neg+160h (427400h)[edx+ecx*4],xmm1
    movaps xmmword ptr __pow10neg+170h (427410h)[edx+ecx*4],xmm2
    movaps xmmword ptr __pow10neg+180h (427420h)[edx+ecx*4],xmm3
    movaps xmmword ptr __pow10neg+190h (427430h)[edx+ecx*4],xmm4
    The present program takes about 0.7 second on my 1.66 GHz Intel Core 2 Duo CPU.
    By applying special compiler options like /O3 /Qipo /Qparallel the software
    engineer do not need to worry so much about performance; the matrix multiplication
    loop is automatically threaded, vectorized and unrolled.
    In my opinion Intel C++ may at times be as fast as Intel Fortran, but it is usually
    a little harder to gain the same performance in C++ as in Fortran. The reason may be
    the usage of pointers instead of fixed size arrays, and many other issues also.
    Thus I tend to agree with Mark that C++ is not always the best choice, for certain
    applications yes, but there are so many examples where Fortran or C is much better.
    BTW: Take a look at the C99 performance of Intel C++, using restrict and the
    right compiler settings may resolve many performance issues.
    Best Regards,
    Lars Petter Endresen

    Reply
  114. Lars Petter Endresen

    Hello,
    Today I have checked the matrix-matrix product on a Intel Core 2 Duo 2.0 GHz T7200 4MB L2 CPU, with 3 different compilers. Results are as follows:
    1. Intel C++ 9.1032
    0.375 seconds
    /O3 /QxT /Qparallel
    2. GCC 4.1.1
    7.5 seconds
    -O3 -march=prescott -mfpmath=sse -msse3 -ffast-math -funroll-loops
    3. Microsoft C++ 2005
    8.766 seconds
    /Ox /Ob2 /Oi /Ot /Oy /GT /GL /GS- /GR- /arch:SSE2 /fp:fast
    Now, that’s a big difference. Also I checked Mark’s original C example, which is also easily vectorized and parallelized using Intel C++ 9.1.
    for (int i=0; i < 20000) {
    for (int j=0; j < 20000) {
    x[i][j] = y[i-2][j+1] * y[i+1][j-2];
    }
    }
    Linking…(Intel C++ Environment)
    IPO: performing single-file optimizations
    IPO: generating object file ipo_29246obj.obj
    .matmult.c(35) : (col. 2) remark: LOOP WAS AUTO-PARALLELIZED.
    .matmult.c(35) : (col. 2) remark: LOOP WAS VECTORIZED.
    So, the selection of compiler is also important, not only the selection of language. And, if you are really skilled, a C code can usually be as efficient as a Fortran code I think. I have more problems with higher level languages, as optimizing these becomes too much nitty-gritty manual work. In Fortran and C it is possible to keep it simple and make the compiler do most of the optimization, and I have not seen that higher level languages have such capabilities yet.
    Best Regards,
    Lars Petter Endresen
    P.S: I have fixed a problem with my original test-program, c array was not initated to zero. The updated version is:
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #define SIZE 1024
    float a[SIZE][SIZE];
    float b[SIZE][SIZE];
    float c[SIZE][SIZE];
    float x[SIZE];
    float y[SIZE];
    float z[SIZE];
    void init_matrix()
    {
    int i, j;
    srand( (unsigned int)(time( NULL )) );
    for(i = 0; i < SIZE; i++){
    for(j = 0; j < SIZE; j++){
    b[i][j] = a[i][j] = (double)rand()/RAND_MAX;
    c[i][j] = 0.0;
    }
    }
    }
    void matmult()
    {
    int i, j, k;
    for(i = 0; i < SIZE; i++)
    for(j = 0; j < SIZE; j++)
    for(k = 0; k < SIZE; k++)
    c[i][j] += a[i][k] * b[k][j];
    }
    int main(void)
    {
    clock_t start, finish;
    init_matrix();
    start = clock();
    matmult();
    finish = clock();
    printf(“n%2.3f secondsn”, (double)(finish – start)/CLOCKS_PER_SEC);
    printf(“s=%fn”,c[0][0]); // Avoid dead code elimination
    }

    Reply
  115. HeWhoSpeaksOfDarkness

    MarkCC:
    “*despise* C++”
    Why do you despise it?
    It the only real programing language i know well so i would like to know why other languages are better. Personally i find any non OO way of solving something very counter intuitive, but would like to change my intuition. what other models work better and why?
    PS – any online or print references you would recommend for learning heskel? i want to braden the ways i can think

    Reply
  116. Mark C. Chu-Carroll

    HeWhoSpeaksOfDarkness:
    Why do I hate C++?
    Because it’s a sloppy, poorly designed language with bizarre semantics and ugly syntax.
    Templates are a wretched monstrosity – a feature designed to provide generic types which accidentally be evolved into
    a Turing-complete meta-language?
    The type declarations are utterly incomprehensible, due to the pointless mix of prefix and postfix modifiers.
    The automatic type conversions are hideous, particularly the
    fact that a one-argument constructor implicitly defines an automatic type conversion – so you can’t define a one-argument constructor without it turning into an implicit conversion. Back before STL, I was writing some code using
    an IBM XLC library. It should have been linear-time. But we were seeing quadratic time behavior. Turned out that there was an typo which caused a chain of two implicit type conversions, one of which was doing a complete list copy. The particularly frustrating thing about it was that there was no reason that anyone would ever want to have that kind of type conversion happen for the collection types I was using – but since a one-parameter constructor is a type-conversion…
    Need I mention silent truncation on assignment of subtype to supertype? Or the wretched mess that is built-in arrays in C++? Or the horrors of the preprocessor? Or the twisted mess of exception-handling? Or RTTI?
    C++ is just a thorough mess. It’s a mass of ill-considered features that were slopped together without consideration of how they’d interact, and without looking at the history of other programming languages to see what worked and what didn’t – so that the repeated mistakes that had been solved decades earlier.

    Reply
  117. Mark C. Chu-Carroll

    HeWhoSpeaksOfDarkness:
    In answer to the second part of your question:
    It isn’t so much the object-orientation of C++ that bugs me; but I think it’s important for a programmer to understand more than just OO. OO is one good way of doing things – and for a lot of problems, it’s a fantastic way of decomposing a system into manageable pieces. But it’s far from the only way.
    Here’s a list of some other great languages that are worth the time to learn. Even if you never do any real programming in them, learning them enough to be able to do some interesting exercises will influence your thinking about programming, and make you a better programmer in whatever language you use.
    (1) Scheme. Everyone should have some experience with Scheme, or one of the other really good lisp variants. Writing Lisp code can be a mind-altering experience, and the way that you can create things in Lisp is just astonishing. I particularly recommend learning it with the SICP textbook linked in my sidebar.
    (2) Prolog. Prolog is a fascinating language. It’s based on predicate logic, where you program by writing a description of a solution – and that description is interpreted to guide a search for a solution that matches the description. What’s particularly fascinating is that Prolog has an amazing kind of reversability. I’ll never forget one assignment from my undergrad programming languages course. We wrote a prolog program that took a description of an FSM and an input string as an input, and generated the sequence of states that the input caused the FSM to go through as an output. Then we took the same program, and gave it an input string and a sequence of states – and it generated FSMs that would generate that sequence of states for the specified input. And then we took the ame program, and gave it an FSM and a state trace – and it generated a list of strings that could cause that machine to go through that sequence of states! Amazing.
    (3) Smalltalk. Think C++ or Java is really OO? Try doing some programming in Smalltalk sometime. It’ll change your mind. There’s object-orientation a-la C++/Java – and then there’s OO a la Smalltalk. They’re much more different than you’d expect!

    Reply
  118. zzo38

    So, C is still very fast, but OCaml is faster. Still, when programming embedded code (including Nintendo DS), C is very fast. Probably programming it in assembly language is faster, if you know how

    Reply
  119. HeadMonkey

    “In C and C++, there’s no such thing as an array – there’s just pointers…”
    Is this proof that your phd is just an expensive piece of paper hanging on the wall ?

    Reply
  120. Xanthir, FCD

    Care to elaborate, HeadMonkey? I suspect this was just a drive-by defacement, but if you *do* return, please expand this.
    Because, um, that really is all arrays are in C/C++.

    Reply
  121. HeadMonkey

    I’m sure you are qualified to go and read the various C standards. I think you’ll find that they use the word ‘array’ somewhere in them, as well as the restrictions the standards place on compilers as to their behavior.
    Also, the AMD x86 Code Optimization Guide has a nice chapter on why arrays can be faster than pointers. Just because the C standard says that a reference to a non-indexed array, is interpreted, by the compiler, to be a pointer to the start of that array, does not mean that “arrays are pointers.” Compilers are able to make various optimizations at compile time on array based code, involving memory location aliasing, that cant otherwise be made when using pointers.
    Also, not to add insult to injury here, but the statement “They’re good at things that need to get very close to the hardware – not in the efficiency sense, but in the sense of needing to be able to fairly directly munge the stack, address specific hardware registers, etc.” is just as bogus as the “arrays are pointers” statement. C, and so far as I know, C++ ( im not a ++ programmer so someone please correct me if this is wrong ) do not provide “direct access to hardware registers” … this is implemented as inlined assembly, for whatever platform is being coded for, and is not part of any C or C++ language specification.
    This article seems to be, nothing more than, a justification for a language the author doesn’t like, based on false assumptions and a misunderstanding of the difference between “Library Code A” and “Compiler Implementation B.” It’s definitely not an apples to apples comparison.
    Let me know if the school decides to refund your phd monies.
    ps. C will also let you execute code stored at an offset in memory, by writing machine language to that memory location and using a function pointer. In this regard, you could say, C will let you execute any valid machine code that your cpu and/or hardware supports. Therefore, for any given hardware, C will allow you to execute all combinations of valid machine code, and in this manner, can run as fast as the hardware allows. The best any language that does not support this ability can do, is match its speed. However, this assumes a programmer with the knowledge required to write valid assembly, or a C library to do this for him/her, and I don’t think the article was really talking about this scenario, as it involves optimized assembly language programming. Nevertheless, C will let you do it 😉

    Reply
  122. Henrique

    The idea about aliased pointers is reasonable, but you are using a wrong argument. The C language _has_ a way to avoid aliases using the restrict keyword. Consider this code:
    void multiply(float * restrict * restrict x, float * restrict * restrict y)
    {
    for (int i=0; i

    Reply
  123. Chris L.

    Working on compilers for one of the biggest vendors, I like to pretend I know a little about this stuff sometimes.
    C and C++ are both pretty bad at aliasing. Compilers work hard to avoid the issue, but the truth is Fortran and other high performance languages are considerably better at that problem. Essentially, free pointers cause more register spill, less auto-vectorization and auto-parallelization, greater stack consumption, and so on, because the compiler has to use loose and approximate aliasing rules such as type-based aliasing. The problem is at the language level. For some simple numeric code like matrix multiply and the like one can use `restrict` to gain some of the Fortran subroutine aliasing rules, but the problem is systemic to these free-pointer languages.
    in Fortran, one can say:
    integer, target :: i
    integer, pointer :: p
    call sub foo(p)
    contains
    subroutine foo(a)

    and then if those are the two declarations, the compiler can actually optimize the pointer away, and replace all accesses of P with a direct access of I. This saves a register, saves an indirect load from an address, and it lets you apply this optimization transitively to the body of foo.
    All of this is possible because i is the only possible target of i in this compilation unit. C pointers can point at all sorts of things, and some users expect this behaviour (casting their pointer types, for example) so even loosey-goosey type-based aliasing isn’t possible.

    Reply
  124. Anonymous

    Just use the native SIMD types and intrinsics. They’ll be available in C and almost certainly *not* available in any other language. The resulting code will out-perform anything a compiler can come up with.

    Reply
  125. Ken

    I wouldn’t even bother responding to the greythumb.org guys. If they say:
    “To give you an idea of just how much faster C can be for tasks like this, I have found that evolvable instruction set based evolutionary computation is almost twice as fast when competently implemented in C than a similar competent implementation in Java.”
    I say, wow, almost twice as fast? I’d give up 50 of my performance in a heartbeat if it meant I’d get increased reliability. Heck, I’ve got a dual-core CPU here with one core sitting idle 99.9%+ of the time. I have a desktop C++ app crash at least once a week (often Firefox, but frequently others). Recreating my work after something dies is a lot more painful than a slightly slow computer.
    And this was the best example they could find? Ouch.

    Reply
  126. Anonymous

    c++/c is not for parallel execution. If you really need it, why don’t you use OpenMP with c++/c?
    The real thing is that compiler still cannot beat the carefully crafted c/c++ program in most cases.
    And I doubt the different between c/c++ in your benchmark. What is the reason that they are different? I don’t see any reason that LCS needs virtual functions and such, so c/c++ performance should be the same.

    Reply
  127. Shane

    I don’t understand why everyone in the comments is getting bent out of shape about this article. I thought it was a great rebuttal to the cited article and didn’t imply that one should never use C or C++. It was merely correcting the notion that C or C++ aren’t the only game in town for speed, and it had numbers to back up that claim. What more do you want?

    Reply
  128. Jackson

    I think the reason why there is so much speculation, is due to the coders haven’t a hard time believing that the code used in this experiment was actually that efficient. There could be problems in the code, and problems in the compilation. If this article had listed the code for all the languages, and the various ways it was compiled, then I’m sure the comments list would be much shorter. Nevertheless, an interest article. If Ocaml is really that fast, then I know what languages password crackers are going to come in.something.

    Reply
  129. Scott R.

    No doubt this is why most video game rendering engines, video, audio and image editing software is written in OCaml and not C or C++.

    Reply
  130. Mark C. Chu-Carroll

    Re #147:
    I’ve been trying to ignore the new comments here, since they’re mostly just repeats of stuff from a two-year-old post. But this is just particularly stupid. It’s basically the old bullshit argument that popularity implies quality – the same argument that Britney Spears music is better that Bach – after all, who sells more records?
    (A) Where in the world is there a group of programmers who always have the option to use the best possible tools for what they’re doing? I work at Google, one of the coolest, most dynamic, creative, open workplaces anywhere – and I don’t get to choose any programming language I want; I’m constrained by lots of other factors. Some of my code might be as fast or faster written in OCaml instead of C++ – but the fact that upwards of 90% of the engineers at Google have a deep understanding C++, and probably less that 5% (if that many) know Ocaml means that OCaml isn’t a good choice for typical Google projects.
    (B) Even if the social aspects of development don’t restrict language choices, practical ones often do. You can’t use a language that doesn’t exist on your platform. If I want to write a mapreduce
    at Google, I can’t write it in a language that doesn’t run on our computing infrastructure. Even if I could write a reducer that in some language that was ten times faster than it would be in C++, it wouldn’t matter. That’s the story for most video games systems: they’re writing programs for very unusual hardware, and they have to use the languages that have support for those environments. Even if there’s a brilliant language that could, in theory, let them write faster code, it wouldn’t matter unless someone ported it and integrated the necessary support.
    (C) Even given the social and technical ability to use alternate languages, most programmers don’t know about them. Unfamiliarity/obscurity is a killer. A few years back, there was a wonderful language called Sisal, which was targeted at the kinds of scientific applications that are normally written in Fortran. Sisal achieved uniformly better performance than Fortran; it took better advantage of vector processing, was able to automatically do more in parallel. It was pretty much never slower than Fortran, and on real applications, generally managed to be at least 20% faster (and in scientific applications, getting a 2% speedup was considered significant). How much computational fluid flow code is written in Sisal? None. It’s pretty much all Fortran. Because the physicists that write most of the CFF know Fortran.

    Reply
  131. Matthew

    No doubt this is why most video game rendering engines, video, audio and image editing software is written in OCaml and not C or C++

    Software also has external constraints to deal with. There are probably a lot more libraries for game developers to use in C and C++ than in OCaml.

    Reply
  132. Bob Foster

    One anecdote does not an argument make. You wrote the C code. Perhaps you’re not a good C programmer. You don’t say what compiler you tested with with what level of optimization. In fact, I’m so deeply suspicious about the interpreted OCaml beating compiled C that I suspect you chose the example to show how great OCaml is. You’re correct that C is not a good language for numerical applications and FORTRAN is, so if you were doing a numerical application that required sophisticated array optimizations where are the numbers for optimized FORTRAN?

    Reply
  133. Mark C. Chu-Carroll

    For goodness sakes, read the original fucking post.
    First – this stuff is old. This post is 2 1/2 years old. The experiment was at least 4 years before that!
    Second – I had *no* reason to bias the experiment. I went in with no preference, and I wanted to know what would give me the performance I needed. I included OCaml based on the recommendation of a *very* smart friend. (In fact, my preferred outcome would have been Java, since that was what I’d been working in at the time.)
    Third – it wasn’t a numerical program. It was longest common subsequence. And it was for an SCM system. So Fortran wasn’t an option.
    Fourth – the way that I got involved in the project was by being on the team that implemented a C++ compiler. So I do know a thing or two about how to write good C code.
    Fifth – the point of the experiment wasn’t just to find the fastest – it was to find a complexity/speed balance. My expectation was that raw C would be the best, with C++ a close second, but that the complexity would be worst for C. The code *deliberately* wasn’t hand-optimized. It was written for clarity.
    Finally – OCaml uses a remarkable interpreter. I was absolutely astonished by the results, and even 7 or 8 years after first studying how it works, still marvel at it. It’s truly brilliant. There’s very little difference between OCaml interpreted code, and native compiled code; for most opcodes, the “interpretation” of code loaded into the interpreter is one machine instruction – an indirect jump. Take a shallow interpreter like that, combine it with aggressive inlining, and you get seriously kick-ass code. I’m not *nearly* smart enough to have written it; but I am start enough to take advantage of brilliant work done by someone else.

    Reply
  134. Joaquín M López Muñoz

    I know this thread had probably gotten too long for your taste, but there’s one particular detail I’d like to comment on: you registered the following times for C and C++, respectively:
    · C: 0.8 seconds.
    · C++: 2.3 seconds.
    How come the C++ program was almost 3x slower than the C program? For starters, you could have compiled the C code with your C++ compiler (modulo negligible incompatibilities between both languages) and should have gotten the same performance (modulo compiler quality), so either your C++ compiler was really poor or your C++ code differed in significant ways from the C code. Maybe you can shed some light on this?
    Thank you,

    Reply
  135. Xiao

    I just think the author of the original site probably meant that C won’t be replaced by the newer “cooler” languages like .net/java/python etc which you proved to be true with your own experimentations. It’s probably not a claim for them to be the best languages ever

    Reply
  136. SK

    If Mark doesn’t know the difference between int x[100] and int *x = malloc (100 * sizeof(int)) then I have to take his evaluation of C with a pinch, no, a pound of salt. Also, for the example mentioned, you can always check for array overlaps using simple pointer and length comparison. Heck, that’s how memmove works properly when the src and dest overlap.
    I agree that the C compiler isn’t going to parallelize every loop that can be parallelized. May be if you play around with the optimization flags and ask it to optimize for speed and not size, then it might end up parallelizing part of the loop. But I wouldn’t expect Mark to be a master at C compiler optimization flags.
    Sorry, to sound like a fan boy, but if you are going to criticize a language or evaluate it, you better be an expert on it before you go around bad mouthing it. Your evaluation just tells you that _you_ personally can write the most efficient code only in OCaml.
    I’m not saying C is the best, just that this evaluation is bollocks.

    Reply
  137. dagome

    That pointer based nature means that in a C or C++ program, it’s very hard for a compiler to figure out what things are independent.

    Many users have mentioned the restrict keyword, but technically, the compiler does not have to figure out the may-aliase question for x and y. It can force the issue by generating an explicit check:

    if( x & y overlap )
    unoptimized original
    else
    optimized code

    This has the obvious drawback of code bloat, but that’s probably not an issue for small inner loops like your example. I suspect that the issue is less to do with C and C++ as languages, than it is with the fact that C/C++ compiler writers have not targeted numeric code. (Which may in part be a self-fulling prophecy, given that writers of numeric code are happy with Fortran.)
    Regarding language issues, I would think the explicitly parallel constructs of high-performance Fortran (e.g., forall) are a better example of a “built-in” advantage over C for numeric domains.
    Also, numeric analysis is only one domain. C/C++ are likely to really excel when explicit memory management is important. (Note this has nothing to do with “getting very close to the hardware.”)

    Reply
  138. Mark C. Chu-Carroll

    Re #155:
    In an acronym, STL.
    As I said in the original post – the experiment wasn’t originally just about speed. I was implementing an SCM system – which is a complex system that includes a couple of CPU intensive algorithms, as well as a whole lot of complex data manipulation. I was looking for a balance of speed and expressiveness – speed for making the intensive algorithms fast, and expressiveness to make it easier to implement the more semantically complex code.
    The C code was raw, bare-bones C: arrays and pointer arithmetic all the way.
    For C++, I didn’t just translate the C code to C++; I tried to take advantage of the expressiveness of C++. So instead of a pointer to an array of char*s, I used a vector of pointers to strings. That was, pretty much, the entire difference between the C++ and the C: in C++, I used vector, whereas in C I used char**.

    Reply
  139. Mark C. Chu-Carroll

    Re #158:
    It’s not as simple as just adding an overlap check like that. Try looking up array independence checking in a decent optimization textbook. Sometimes you can vectorize or parallelize code even on overlapping arrays, provided you can prove certain data independence properties. But if you can’t be sure whether the arrays are aliased,
    or if there are aliased pointers to elements of the array, you can’t prove the indepedence properties.
    Of course C/C++ is really good for some things. I never intended to suggest otherwise. But it’s *not* the be-all end-all of efficient programming languages. The entire point of this post was that the idea that C/C++ is the *ultimate* in efficiency for *all* applications is wrong. For *some* application domains, C/C++ is absolutely the best choice – both for reasons of efficiency and expressiveness. But for many other domains,
    it’s an awful choice, for a variety of reasons.
    I’m also *incredibly* skeptical of people who claim manual memory management as a performance benefit. My own experience in multiple applications is that it’s as likely to cause *inferior* performance as it is to improve performance. Once again, it depends very much on the problem domain: there are tons of applications where some kind of manual memory allocation is simple, maintainable, and super fast. There are also tons of applications where manual memory management is complex, slow, and error prone. GC can kick the ass of manual memory management for some problems.

    Reply
    1. David Starner

      As for manual memory management, the latest release of Chrome closed 7 CVEs, 4 of which were “use after free”. Since September 2010, there have been 157 bugs with a CVE filed on them closed labeled “use after free” in the changelog. It takes quite a bit of win over garbage collection to justify 40 security holes a year caused by manual memory management.

      Reply
  140. dagome

    Re #160:
    I’m not as up on array vectorization optimizations as I could be, but I’ll stick by the original point. It is not always necessary for the compiler to statically identify potential aliases in order to apply loop and array optimizations. In many cases — such as the example presented here — it is easy and relatively cheap to put in disambiguating dynamic checks.
    I suppose I am only moderately (not *incredibly*) skeptical of automatic GC beating manual memory management for most applications, but “each wins for certain problems” is hard to disagree with. For my part, I also would not argue with “C/C++ is good for some things and bad for others.” I just meant to emphasize that “proximity to hardware” might not be their only strong point.
    At any rate, I would be interested to hear more about problem domains where /performance/ of GC clearly dominates (or even consistently matches) manual collection. (Outside of performance, GC is clearly the bee’s knees.) While there’s been a lot of research into memory management, I have the impression that it’s still a rich area to explore. For example, I’ve seen the choice of malloc implementation have a huge impact on performance. (Even “manual” memory management is still pretty abstracted.)

    Reply
  141. anonymous

    This is a response to #128
    Since 1995 and definitely with the introduction of ISO C++ (1998) the problem with the conversion constructors is solved. It is possible to prevent such conversions. This lead us to the important question why you don’t know about that fact in 2006?

    Reply
  142. Mark C. Chu-Carroll

    Gosh, I dunno, maybe because that particular incident happened in 1994? (Note that I specifically mentioned that it was pre-STL. That particular problem happened when I was a summer student working at IBM to port the Concert/C distributed programming constructs to C++.)
    Further, the fact that you can write C++ code in a way that prevents things like conversions, that doesn’t mean that library code uses them. The specific problem that I was talking about was in a library. The fact that C++ includes that kind of implicit type conversion is, in my opinion, a serious flaw in the language. The cases where it’s actually really beneficial are, in my experience, very limited; the cases where it causes unexpected problems are very common. It’s a lousy tradeoff.

    Reply
  143. borofergie

    How is it that the “big 3” commercial CFD codes, Fluent, STAR-CCM+ and CFD all dumped FORTRAN a long time ago in favour of C/C++?

    Reply
  144. Graham

    These articles are funny.
    “Finally after all these years, this one guy has demonstrated that C/C++ is not efficient.” What hubris and arrogance.
    A couple of the comments after the article straightens things out.
    One item of note:
    The author is not fundamentally wrong about C/C++ alias detection issues for optimization purposes. Yet, the example is contrived.
    No reasonable C/C++ engineer would create a nested loop, as described, if fast execution for numerical computing was desired.
    They would use a library(ies) that handles optimization:
    http://www.boost.org/doc/libs?view=category_Math
    ..or create their own library. Other opportunities, such as GPGPU (CUDA, OpenCL) also exist.

    Reply
  145. Lauren

    I saw your article by searching some key words on internet and I am very glad to read completely your article now. Wow, I am highly respect that you have much experience about programming. But it’s not easy for me to do this since my English is not very good. I can pratice my English reading ability and programming knowledge by reading your web site.
    I am a software engineer in Taiwan. I always think that it’s very important to enhance software effectiveness. In this time, since we regard as we have a faster CPU and cheaper RAM, many people design software without careful planning and implementing. It’s not a correct attitude, oppositely, we must to develop web or pc software as attentive as developing the feature phone or mobile phone. As possible as perfect, we must distinguish all of technologies; understand the computing method and OS, and familiar with relating platform structure and principle. We just probably can create the perfect system.
    If my English is good enough, I can describe much correctly. The language is indeed a barrier for knowledge sharing. Isn’t it? Please give me some advices. Thanks ^_^

    Reply
  146. so4pro

    I know this thread had probably gotten too long for your taste, but there’s one particular detail I’d like to comment on: you registered the following times for C and C++, respectively:
    · C: 0.8 seconds.
    · C++: 2.3 seconds.
    How come the C++ program was almost 3x slower than the C program? For starters, you could have compiled the C code with your C++ compiler (modulo negligible incompatibilities between both languages) and should have gotten the same performance (modulo compiler quality), so either your C++ compiler was really poor or your C++ code differed in significant ways from the C code. Maybe you can shed some light on this?
    Thank you,

    Reply

Leave a Reply