Send Close Add comments: (status displays here)
Got it!  This site uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.nbsp; Note: This appears on each machine/browser from which this site is accessed.
Failure and recovery: MTTF and MTTR


1. Failure and recovery: MTTF and MTTR
There are two ways to detect and handle errors are the following.

2. Mean time between failure
Light bulbThe MTTF (Mean Time To Failure) is a measure of the average time between a failure. The MTTF of a (real or practical) system is less than the MTTF of the individual components.

3. MTTF
What is the goal for MTTF ? Sometimes this is referred to as MTBF (Mean Time Between Failure) ? One approach to avoiding failure: never do anything. What is another approach?

4. MTTR
What is the goal for MTTR (Mean Time To Recovery) ?

The goal for MTTR is zero. That is, recover as soon as possible such that the cost of failure is minimal.

What are some examples of mitigating the cost of increasing the MTTF by the cost of lowering the MTTR?

5. Error detection and recovery
Unless program proofs and verification are done, software bugs are inevitable. There are at least two ways to handle such bugs. In the limit, maximizing the MTTF requires program proofs and verification. Since testing cannot accomplish this, and, baring proofs, bugs are inevitable, in the limit, the cost can become very high. Minimizing MTTR means designing the system such that when inevitable bugs happen, the recovery can be made quickly, efficiently, and effectively.

A good testing philosophy makes judicious trade-offs between MTTF and MTTR .

6. Software today
What do real software companies do today? What helps in this process? Such concepts can be introduced in a beginning programming course.

7. My systems
I use this technique in my programming, web systems, etc.

8. Old CS 101 web site
Old CS 101 web site: (static web pages) My CS 101 web site: (dynamic pages) Redundancy is minimized and changes and updates controlled by the computer.

9. System scale
The web system has this scale. (2020-04-23) This web system did not exist 9 months ago (though parts of code and data existed and were adapted). The following helps in the process of reacting quickly to the present (minimize MTTR). All this helps in minimizing MTTR instead of maximizing MTTF.

10. Bank example
For example, if a bank, in avoiding bank robberies, attempts to maximize MTTF , the cost can be high, people can get hurt, etc. Instead, banks just let it happen (when faced with a robbery), handing over the (marked money), triggering a silent alarm, and are then, hopefully, back in business in a short while.

11. Risk
Avoidance means it will not happen.

Mitigation means it will not cost too much if it happens.

12. Credit card fraud
It would cost a lot to avoid credit card fraud. Can the cost of the fraud be kept manageable?

You can afford to lose some money if, in effect, you then make more money than you lose.

Risk avoidance is very expensive.

Risk mitigation balances risk with cost.

13. Drive configurations
RAID drive configurations (redundancy, hot-swappable) is networking systems.

Search engine companies:

14. Operating systems
Memory access in operating systems.

15. University example
MTU (Mythical Typical University) : MTTF to MTTR

University puts strict procedures in place, with lots of paperwork and coordination, to insure that students do not sign up for courses that they should not be taking. This tries to increase the MTTF.

University allows you to sign up for any course you want to, but immediately disallows any after checking the database - allowing you to take other options. Decrease the MTTR.

16. Forecasting
Michael Hammer: Perhaps the most startling notion that arises from process-centered planning is the suggestion that long-range forecasting is a waste of time. Hammer, M. (1996). Beyond reengineering. New York: Harper Business., p. 203.

17. McCarthy: Decisions
Jim McCarthy: The goal on a software development project is not to have the correct plan in advance but to make the right decisions every day as things that were unknown become known. McCarthy, J. (1995). Dynamics of Software Development. Redmond, WA: Microsoft Press., p. 101.

There are crucial elements to systems that cannot be known in advance.

18. Microsoft: Specifications

19. Initial specifications

20. Customers and market surveys
It can be hard to use market surveys to make certain types of decisions, especially when it involves something new - whether that something is a software product, an engineering project, etc.

21. Marketing surveys
Michael Hammer: ... this fundamental precept - that marketing research done for a product that does not yet exist is useless. Hammer, M., & Champy, J. (1993). Reengineering the corporation. New York: HarperBusiness., p. 88. Yet, market surveys of this type continue to be done.

22. Sony Walkman
The example used by Hammer was/is the Sony Walkman. A market survey would not have been of much use because the product was revolutionary, a completely new product.

23. Food products
What do companies many (e.g., fast-food) companies really do? Most people will try new things at least once, if the cost is not too high.

24. McCarthy: Customers
Jim McCarthy: Customers often won't tell you what they really want, particularly if it goes against conventional wisdom. Because they're insecure, they'll tell you instead what they think they're supposed to say they want. McCarthy, J. (1995). Dynamics of Software Development. Redmond, WA: Microsoft Press., p. 74.

25. Tickets
Why are tickets sold for events when the event could be done for free?

Selling tickets is one way to get a more accurate count of who will actually attend the event.

26. Capacity planning
Jon Bentley says that users will ask for a certain amount of capacity, but then use the system with much more capacity, as much as 10 times the capacity.

His classic example is the Pennsylvania Turnpike.

27. Pennsylvania Turnpike
Before building the PA turnpike in the late 1930's, extensive surveys were done in order to predict customer demand and usage for the new turnpike. What happened? Once they introduced the turnpike, people started using it for things they never imagined, like visiting relatives many hours away.

So, even though they built it to handle about 10 times the traffic from the survey, it was used about 10 times as much as was planned for.

Customer's expectations changed.

28. Steve Jobs: Customers
Some people say, "Give the customers what they want." But that's not my approach. Our job is to figure out what they're going to want before they do. I think Henry Ford once said, "If I'd asked customers what they wanted, they would have told me, 'A faster horse!'" People don't know what they want until you show it to them. That's why I never rely on market research. Our task is to read things that are not yet on the page. Steve Jobs (1955-2011)

29. Steve Jobs

30. Programming
The same problems arise in programming computers and/or analyzing system behavior.

31. Measure twice, cut once
Computers can be very useful.

Have you heard the following saying. Can you actually do this? Why or why not?

32. Do it
If you believe in always "doing it right the first time", then try writing a paper with a manual typewriter, from beginning to end, in one try.

What is a more realistic philosophy?

33. More realistic
A more realistic philosophy is as follows. This is where the storage aspect of computers is critical.

You can work on a document, save what you were doing, and then, later, pick up where you left off.

34. Saying
Traditional adage: Measure twice, cut once.

In programming terms: Think about what will happen before changing your program. Be ready to go back to the previous version.

Generalization: Keep doing it until the result is acceptable or until a fixed point is reached.

35. Software goal
Software design/implementation goal: Minimize non-computer-checked redundancy (repetition).

36. End of page

37. Multiple choice questions for this page

38. Acronyms and/or initialisms for this page