[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability and assurances



There may  be underlying points with this as well.  I have had to deal with 
a situation that touch upon this.  Please allow me to elaborate...

We use "computer generated" forms, that is the database server sends ascii 
data to a formatting server that turns the raw text into our invoices with 
logos and etc.  The problem that we faced is that the formatting system is 
prone to going down for various reasons.  Because of this someone (usually 
me) gets called in the middle of the night when it doesn't work.  Since 
"once is enough" for me I came up with a way to add some redundancy to the 
system so that I could deal with the problems in the morning and not in the 
middle of the night.  (This was not simple, and did take some time.)

My boss saw what I was doing and expressed the stance that I was wasting my 
time since it only acts up every month or so.  I then rebutted with the 
position that we could not accurately predict when it would or wouldn't 
work.  Murphy's law then dictates that the system will fail at the worst 
possible time. (e.g. high work load, no one able to make it in to fix 
it).  I was still told to stop what I was doing.  Fortunately I had 
documented this.

Several months later, I was out of town and my boss was detained in getting 
to the site when the system went down.  The end loss was several hours of 
labor as no invoices could be printed, therefore trucks could not be 
loaded, etc. etc.  I returned to a small inquisition about the failure in 
the system.  It seems that my boss was busy covering himself.  (claims that 
I /someone was supposed to restart the system every week... this was 
documented to have no predictable effect. e.g restarting fixed the problem, 
but would not necessarily prevent it.)

I quietly showed "upper mgmt." the documentation I had kept with the line 
"we can not accurately predict the reliability of this system".  Nothing 
was done until several months later when the scenario repeated itself.  We 
have since looked into other solutions and with cost being a major factor 
my original idea of a redundant system was put into place.  The system 
still goes down from time to time, but it is not a crippling event anymore.


The underlying point might be that just because a system is prone to 
failure doesn't mean it can/should be replaced.  In our case it was not 
cost effective to replace the system, but it was cost effective to add 
redundancy and increase the level of reliability.

"Sometimes the obvious answer is not the correct one." (me)

Hope this helps... someone :-)
- Bennett

At 01:34 PM 1/8/00 -0500, Mark R. Lindsey manipulated the electrons to say:
>I'm working on a theory: if you can't be assured that a subsystem is going
>to work all of the time, then you can't be assured that a subsystem is
>going to work any of the time.
>
>Does that seem reasonable?
>
>When I say `subsystem' here, it's for lack of a better term to describe
>something atomic; e.g., a source of power for computer X, or a database
>server for application Y, &c. Obviously, everything depends on something
>else, and I'm not talking here about an analysis that extends up a
>reliability tree.