[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [SAGE] Strategies for taking ownership of existing infrastructure?



Jesús,

I second all the advice so far.

Life may be stressful at first, especially if a machine goes down, you 
lose power, or you have to reboot when you'd rather not... and the 
infrastructure doesn't come all the way up.

Once something like this happens, your life won't be the same until know 
that you can freely reboot all your servers, and you know what their 
interdependencies are.  In a "loose" environement, many servers can put 
put up, run for 400+ days, and things can get to the point where you 
realize you don't exactly know how to shut things down and bring them up 
in an orderly fashion, even when your own team set it all up!

My advice to you is to tell management that you can't promise /anything/ 
until you have done some fire drills.  Always inform your users and 
management if you must reboot an unknown server; and that you will know 
better NEXT TIME how long it will be before things are back on-line, etc.

Also insist on carving out some time at night on the weekend where you 
have scheduled downtime if you need them.  Even though many sites want 
24x7 - because they think they can have it, build a case for having 
everything unavailable at 11 PM of Sat - 6 AM Sunday for example, then 
use it.

Remain calm if something fails to come up.  Use your head.  Talk things 
through with a user of the system - even if they don't know system 
admin, they may prove to be a useful sounding board as you try to 
explain what isn't working.  (They won't have anything better to do 
anyway, if their server is down.  :^)

Have fun!

John Miller
http://www.metro-region.org