Erlang and it’s „99.9999999 % uptime“ (continued)

In a recent posting I said „if you are claiming Erlang has proven to reach „nine nienes“ of reliability you are dishonest or don’t know what you are talking about – choose what ever you prefer.“

Alexis considered this „a bit harsh“. Probably we have an misunderstanding with the term „dishonest“. I did’t use it as an ethical category but as in „disonest to oneself“. Perhaps „not completely candid“ would have been better. What I really mean: „doesn’t follow good scientific reasoning“. When we are talking about Computer Science most things are provable correct or incorrect. Often we have a mathematical proof in other instances we have a „softer“ proof, like the one in front of a court. But it is not about believing and feeling. Reliability engineering is a nice hard science and either you reach 99.9999999 % reliability or not. And reaching that is always done by proof. Be it statistical or formal or soft-kuddel-muddel. If you are talking about high reliability just observing doesn’t cut it.

The thing about 9 nines is, that it is an extremely ambitious goal. I’ve spend some years as an lecturer in „dependable distributed systems“ and at our institute we were happy to build systems with 99.999% reliability in a non-stop environment. Actually there are many critical systems (e.g. trading floors) which get by with much lower reliability by being idle 12 hours a day. With that is somewhat easy to get a decent reliability during daytime. If you act in global market places you can shift your processing around the world as the day progresses.

With 99.999% you have 5 minutes downtime a year. Or a medium outage (half an hour) every 6 years or a mayor outage (6 hours) every 142 years. Quite good.

So there are claims that Erlang is build to be 10.000 fold that good. If you claim 99.9999999% this is a outrageous claim and you better back that up by facts.

Suppose an Erlang VM had only an single worry: bit errors introduced by cosmic rays or whatever. Estimating the chances of bit flips is a somewhat involved process but let’s assume that per MBit a single bit error occurs every 50 years. That means that you see a single bit error about every second day on a machine with 1GB RAM. Let’s assume somehow you use „better RAM“ (ECC, concrete shielding, whatever) which is 100 times more reliable. So you get roundabout two bit errors per year.

Let’s assume there are „good“ and „bad“ bit errors. A „good“ bit error is one which can be fixed automatically. E.g. dropping a connection and restarting the process or something like that. A „bad“ bit error crashes the system, e.g. because some internal data structures of the VM or the Operating System got corrupted. We also assume that a „bad“ bit error results in 5 minutes downtime for rebooting. To get 99.9999999 % reliability such a reboot can happen only every 10.000 years. That means only on in 20.000 bit errors is allowed to be an „bad“ one. This means that of your 1 GByte RAM only 52 kByte may contain critical data structures which would result in a „bad“ bit flip / in a reboot.

One actually might be able to construct a system where only 52 kByte contain critical data (stack return addresses, process table, MMU data, etc) but it will be a tall order. And this will get you only 99.9999999 % with regard to a single error source: RAM failures. What about other hardware failures, infrastructure failures, operator errors, infrastructure errors.

I’m convinced Erlang is a very good base for building reliable systems. Probably you can build more reliable systems in less time with the Erlang/OTP stack than with any other mainstream approaches. But see, I consider Erlang mainstream here. For some much more heavy handed approach for building highly reliable systems see „the space shuttle main engine controllers“, „the redundancy management in the Space Shuttle Avionics System“ and An Assessment of Space Shuttle Flight Software Development Processes (1993).

But if a bunch of Rocket Scientists (ok, the CS crowd had written the software) have gone through an extremely well thought process and where not able to get to 99.9999999 % reliability everybody else who claims to have gotten there should have better evidence than marketing literature.

While I’m perfectly willing to belief that a AXD301 cluster so far has only dropped 1 in 1.000.000.000 calls that doesn’t mean it has nine nines of reliability. I think basically the Erlang community is doing itself a disservice by claiming 99.9999999 % reliability. Like the teenage boy which claimed he had already kissed 10.000 girls. It theoretically could be true but it is much more likely that so far the boy hasn’t kissed an girl at all.

For why Dr. Amstrong in 2003 writes „For the Ericsson AXD301 the only information on the long-term stability of the system came from a power-point presentation showing some figures claiming that a major customer had run an 11 node system with a 99.9999999% reliability, though how these figure had been obtained was not documented.“ (Amstrong, „Making reliable distributed systems in the presence of sofware errors“, p 191) and in 2007 „The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable … but we did 9.“ (What’s all this fuss about Erlang?) is a mistery to me. But if an academic source tells me there is no documented evidence of nine nines and some language advocacy page claims otherwise to me the Phd thesis wins with me.

But to be frank, I’m also somewhat disturbed by this two claims on reliability. Compare the wording: „figures claiming that“ (2003) to „has achieved“ (2007). Perhaps there is data I’m not aware of. But as long as it is unpublished it is more or less irrelevant and just marketing hype.

If I search Google for 99.9999999 reliability I get 513 hits. If I search for 99.9999999 reliability Erlang. So it seems that more than 10 % of all discussion of 99.9999999 % reliability on the internet seem to discuss Erlang (another 70% or so seem to discuss the power grind and we know how reliable this is)

Lets see what this we find on the Web on Erlang and reliability:

* [Erlang/OTP] has been used by Ericsson to achieve nine nines (99.9999999%) of availability.
* The most reliable Erlang-based systems [… have …] 99.9999999% uptime.
* AXD301 […] has a fault tolerance of 99.9999999% (9 nines!) That’s 31 ms a year.
* Erlang powers the telephone system in the UK with 31ms downtime per year ? that?s 99.9999999% availability
* Erlang was designed for 99.9999999% uptime

The general sentiment seems to be that it is a fact that AXD 301 reaches nine nines avability/uptime/whatever. And as stated above I have big problems believing that. And just because a system didn’t go down for a dome time doesn’t allow you to reason about reliability. Before a power failure drained the USV the server this blog has been running on had a uptime of about 420 days. So it had NO downtime in a year. Does this mean 100% reliability? No.

Erlang is an interesting language. OTP is great engineering. Erlang has considerable momentum compared to other languages with unusual concepts. There is no need use 99.9999999 % which ring so hollow.

BTW: Besides various Powerpoint slides and such stuff I fond a nice set of Numbers on the ASD301 in „Four-fold Increase in Productivity and Quality ? Industrial-Strength Functional Programming in Telecom-Class Products„, Ulf Wiger, 2001. There it is Assumed an AXD301 runns 1.460.00 LoC C/C++, 1.240.000 LoC Erlang and 27.000 LoC Java if you include Erlang itself. But then you also would have to include the C Runtime and the Java VM & Compiler and possible the OS (Solaris) which would result in C being the absolutely dominating Language. If we check only the ATM-Switch specific Software Wiger reports 1.000.000 LoC Erlang, 1.000.000 LoC C/C++ and 13.000 LoC Java. The popular Erlang advocacy documents report that the AXD 301 system includes 1.7 million lines of Erlang (e.g. Byte).

So obviously Erlang is not the only thing which makes an AXD 301 tick. I assume there is also a lot of clever reliability engineering in the C code and in the hardware.

6 comments on “Erlang and it’s „99.9999999 % uptime“ (continued)

  1. arnonrgo
    2008-10-16 at 00:07 #

    MTCBF vs. MTBF

    MTCBF vs. MTBF
    What these claims say is not that no component within the Erlang based system fails – the claim is that the overall system has that availability.
    Erlang is build on the notion that the failure of one component would make another computation path execute instead.

    Arnon

    This comment was originally posted on 20070829T11:58:42

  2. alexis
    2008-10-16 at 00:07 #

    arnon beat me to it

    arnon beat me to it
    Arnon is correct. What people talk about is really availability. You can keep on adding more nodes to get more availability.

    You are right that people often make claims without fully evidencing them. But the standards of evidence are different between different practices. As an ex mathematician I personally consider the standards of ‚proof‘ in computer science to be lower than those in mathematics, for instance.

    This comment was originally posted on 20070829T12:27:28

  3. alexis
    2008-10-16 at 00:07 #

    links on availability / reliability

    links on availability / reliability
    I’ve always enjoyed Cameron Purdy’s slides on this topic, which are available here: http://85.92.73.37/downloads/11-08-05_Clustering_Performance_Challenges_Solutions.pdf

    If the link is broken try googling for „tangosol availability mtbf“

    This comment was originally posted on 20070829T12:33:08

  4. mdornseif
    2008-10-16 at 00:07 #

    Arnon: what I’m interested in are failures which can hit the overal system: be it the Erlang VM, the OS (mentined above) the load balancer, the switches, the Heartbeat/Failover Protocol used by the switches, the power supply or whatever.

    And that’s the reason in the real world you can’t add more nodes to get more availability. Because besides the nodes you have to replicate all the sorrunding infrastructure. Powersupply, HVAC and especially the infrastructure to route requests to the nodes and interconnect the nodes. While there are are all kinds of approaches to reduce the shared infrastructure needed (e.g. Wackamole) you still need it.

    And that’s the reason to my knowledge there is no (N-node) system in existence reaching 99.9999999%. I’m happy If somebody shows me one which has proofen to reach 99.9999999% reliability/uptime. Be it matemathical prrof, CS proof, legal proof or any other proof ABOVE „marketing proofen“.

    This comment was originally posted on 20070829T18:35:11

  5. alexis
    2008-10-16 at 00:07 #

    well..

    well..
    Max

    By all means consider a ‚whole system‘.

    In most of the serious systems I have come across (trading systems for banks), there are no SPFs and everything is replicated (often active/active these days). This includes multiple power sources and data centres.

    Note that these systems are independent in the sense that the most likely explanation for their simultaneous unavailability is intent (eg disgruntled former employee).

    Many of these systems are – individually – at least 99.999% available and hence 0.001% unavailable. As you mentioned above, this is straightforward.

    If any one system is unavailable 0.001% of the time then two are simultaneously unavailable 0.000001% of the time, in which case ‚availability‘ is eight nines.

    Such systems tend to have a third backup which can be brought alive in the event of D/R to the secondary.

    So you can have lots of nines, even replicating power. People in banking and telco can get excited about this kind of thing. In practice I do not think it is needed. Many big web sites don’t seem to me to be that available for example. BTW, I suspect Joe’s example in reality is isolated in some sense – but we would need to ask him.

    On a lighter note..

    I once arranged to meet a friend, from a well known large bank, after work. When I arrived he was surrounded by his entire team, and it was clear that they had all been drinking a fair amount of beer. I asked why they were not at work. „We discovered a single point of failure today“. Of course I asked what it was and why they were not trying to fix it. It was indeed some small component of the power supply. The London trading rooms now had no power to all their PCs and local servers, and the repairs could not be done until parts arrived later that night. My friend explained that once the traders had realised that each trader was equally affected, they stopped yelling and screaming at the poor IT guys, and simply left the office for the day. Realising that nobody cared any more, the IT guys went to the pub….

    So – everything is relative :-)

    This comment was originally posted on 20070830T06:02:21

  6. uwiger
    2008-10-16 at 00:07 #

    basically correct

    basically correct
    The figure 99.9999999% availability actually came from a customer, and was based on six months of commercial operation in a very demanding environment, and it was in fact an average calculated for a number of installed nodes. One reason why that particular figure was used was that it had been cleared for use in public.

    You are absolutely right in noting that this was one *data point*, and should not be read as a claim that the AXD 301 consistently performs up to that standard. It has spread more than expected, however, and I sometimes think that perhaps more people should stop to consider, as you’ve done, what this data point could possibly refer to.

    We have been measuring availability in the field systematically for years, but these figures are not publically available. I can probably at least reveal that, while I was tracking those figures (for a few years), the AXD 301 consistently scored better than 99.999% availability, including planned maintenance.

    Does this strike you as a more realistic figure?
    I believe it is still impressive. (-:

    This comment was originally posted on 20080229T10:12:49

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden / Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ Foto

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s