Failure-Oblivious Computing
I found a pretty interesting paper the other day about Failure-Oblivious Computing. The idea is simple: when a program encounters an error, simply have your runtime return some garbage value instead of crashing, throwing an exception, or go into error-handling mode.
If you’re a programmer, this suggestion will likely cause you to recoil in horror because the very idea that your functions will be getting back undefined values seems contradictory to everything we’ve been taught. However, it’s hard argue with the results: the authors tested eight fairly well known programs, from mail user agents to Apache and Samba, and in every case the failure-oblivious version arguably did the right thing — and did a better job — than the ‘safe’ versions that would throw an exception or shutdown in a controlled manner when they hit unexpected territory. Read the paper if you’re doubtful about this.
This idea is somewhat in opposition to the Erlang philosophy of Let it crash. However, in both these scenarios, the underlying motivation is the same: large complex systems will inevitably have bugs, and both philosophies not only plan for it, but code to ensure that the system as a whole keeps running in the face of serious errors.
It’s quite easy to react emotionally to these ideas and say that it’s all just too dangerous and unpredictable — coders have always had it hammered into them to check for error values and exceptional conditions. However, there’s also something to be said about the brittleness of software vs more organic systems: the latter will often recover successfully in the face of unexpected conditions, whereas software will simply break. Failure-oblivious computing may not be the answer, but it’s a pretty good first research step. It would be an interesting follow-up experiment to modify the runtimes of the dynamic languages such as Python and Ruby and make them return sentry values instead of throwing exceptions. How many dynamic programs would continue to run successfully rather than die with some weird programmer-centric error?