Friday, November 27, 2015

Project time estimates and Murphy's law


As we saw in the last few iterations, Murphy's law: "Anything that can go wrong, will go wrong" tend to be 
very strong resident in our midst.

Let's take a client-server short-project as an example. When you ask the client developer how much time will it take he answers 10 hours of uninterrupted work.
Usually a developer will interact with the world and his team mates (eat-lunch, do code-review, attend a meeting) , so let's count this as 2 days.


Now let's go to Murphy

client-code:
    1. Complexity in library-code:  You thought the default 1 row of code will do the work for you but it turns out it does not support your use-case, or t fails unexpectadly only in your input/
    2. Complexity in code: changing one element caused an unexpected bug in another unrelated component.
    3. Infrastructure failures:  your machine/network/server- can be down for few hours.
    4. computer/network/coruppted-int server can take 0-5 hours to overrcome
    5. Product requirements may slightly change when they see your finished feature. They don't ask for a new feature, they just explaining you what they ment before better.
UI related code:
    1. UI/Art integration may not be flawless, depending on your workflow
    2. Sometimes, you can't use the Art "as is" and it is not apparent until you actually try.
client-server potential issues:
    1. Miss-understanding of the API usage, can take some time to debug and understand.
    2. Sometimes it is not a miss-understanding, but a plain bug in the server-code.
    3. Sometimes it's not a bug,but a miss-configuration / not latest-version of code etc

So, there are at least 10 potential risks. What should be the estimate in this case?

One approach is to look at the optimistic scenario and say 2 days. It will sometimes work, but a lot of the time will fail poorly.

Second, opposite approach is to look at the worst case scenario and assume all the problems will happen simultaneously. This approach will say let's estimate 8 days.  One might say "better safe than sorry" and if it will be faster, the developer will say he is done.
A third approach is to aim to be correct most of the time (80-90%) but assuming that some problems might happen , but not all of them at the same week. 
Someone, on the upper-management has to fully understand this, and to manage the risk accordingly.
If you actually want numbers.  Let's say this is an example of (a-low) risk-matrix for the 2 days task. No real human think of this in this manner, but you do have "hunches" about it. In this case, I include a 0.1 chance fo r 2 extra-days due to bad-library behavior. 0.2 extra-day if sever-code does not work etc.

risk0.10.20.050.10.20.1
extra-day2.0015113

The probability of failure for this risks is:
46% - 0 late
71% - up to 1d late
83% - up to 2d late
90% - up to 3d late
94% - up to 4d late
97% - up to 5d late
98.7% up to 6d late
99.1% up to 7d late
99.8% up to 8d late (remember it's on a 2d task....)


Now let's talk about who takes the buffer.  
Let's say a developer gives the matrix to his manager, and ask the manager to tell him how to estimate the task.  What should the manager answer?
Due to Parkinsons's law "work expands so as to fill the time available for its completion", a reasonable manager should do three things.: 
  • Ask the developer to be 80% right, and in this case to add 2d to the estimate : 4d. Tell him that is is ok and fully expected to finish a day early, and they if all hell break loose, the manager got his back with few extra days.
  • Keep a mangment buffer for the next 17% (3 more days) which can be used by the developer, and will be totally ok to use, if the developer explained that the risk actually happened.
    One buffer pool can sometimes be shared amongst multiple team members.
  • Inform his managment about the chances.  If it's a live-or-die scenario, they may have their own buffer on the extra 3% risk.