[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Reliability and assurances
There may be underlying points with this as well. I have had to deal with
a situation that touch upon this. Please allow me to elaborate...
We use "computer generated" forms, that is the database server sends ascii
data to a formatting server that turns the raw text into our invoices with
logos and etc. The problem that we faced is that the formatting system is
prone to going down for various reasons. Because of this someone (usually
me) gets called in the middle of the night when it doesn't work. Since
"once is enough" for me I came up with a way to add some redundancy to the
system so that I could deal with the problems in the morning and not in the
middle of the night. (This was not simple, and did take some time.)
My boss saw what I was doing and expressed the stance that I was wasting my
time since it only acts up every month or so. I then rebutted with the
position that we could not accurately predict when it would or wouldn't
work. Murphy's law then dictates that the system will fail at the worst
possible time. (e.g. high work load, no one able to make it in to fix
it). I was still told to stop what I was doing. Fortunately I had
documented this.
Several months later, I was out of town and my boss was detained in getting
to the site when the system went down. The end loss was several hours of
labor as no invoices could be printed, therefore trucks could not be
loaded, etc. etc. I returned to a small inquisition about the failure in
the system. It seems that my boss was busy covering himself. (claims that
I /someone was supposed to restart the system every week... this was
documented to have no predictable effect. e.g restarting fixed the problem,
but would not necessarily prevent it.)
I quietly showed "upper mgmt." the documentation I had kept with the line
"we can not accurately predict the reliability of this system". Nothing
was done until several months later when the scenario repeated itself. We
have since looked into other solutions and with cost being a major factor
my original idea of a redundant system was put into place. The system
still goes down from time to time, but it is not a crippling event anymore.
The underlying point might be that just because a system is prone to
failure doesn't mean it can/should be replaced. In our case it was not
cost effective to replace the system, but it was cost effective to add
redundancy and increase the level of reliability.
"Sometimes the obvious answer is not the correct one." (me)
Hope this helps... someone :-)
- Bennett
At 01:34 PM 1/8/00 -0500, Mark R. Lindsey manipulated the electrons to say:
>I'm working on a theory: if you can't be assured that a subsystem is going
>to work all of the time, then you can't be assured that a subsystem is
>going to work any of the time.
>
>Does that seem reasonable?
>
>When I say `subsystem' here, it's for lack of a better term to describe
>something atomic; e.g., a source of power for computer X, or a database
>server for application Y, &c. Obviously, everything depends on something
>else, and I'm not talking here about an analysis that extends up a
>reliability tree.