[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [SAGE] Pros/Cons of complete rejumstart v.s. just patching OS images
Twas brillig, on Sun Jan 06 at 08:47:37 AM, and Mark R. Lindsey burbled:
> Fortunately, if you can decouple the OS from the configuration,
> the OS needn't be `backed up' in the conventional sense -- you can
> always re-install it.
>
> Next, the system configuration needs to be applied. cfengine provides
> a good framework for this task; with it, you can conveniently write
> scripts that encode your configuration. Then on a newly-installed/restored
> OS installation, you can run cfengine with the scripts you've written
> to bring to restore the configuration.
One additional reason to avoid putting your OS configs into your automated
install mechanism is that if you work in a multi-vendor shop, you end up
duplicating a lot of effort if you need to have similarly-configured hosts
of multiple OS's.
We do as little configuration in the kickstart/jumpstart/ignite/ris/etc.
scripts as possible.
This also makes it much easier to migrate to a new OS for a given service,
and it makes it easier to determine that, for example, you've applied the
correct changes to the sendmail.cf on all the hosts, not just your Solaris
hosts.
There are a lot of different data types to deal with, and sometimes
variations within a type (your user accounts may need to be able to be
updated more frequently than, for example, /etc/hsots.)
The considerations we've used to distinguish between data types are:
- recreatability
- (solutions include no backups, tape backups, hot/warm/cold
spares and failovers, or simply making a copy to another
machine, possibly in another datacenter/country)
- does it need to be backed up?
- if so, how fast do we need to get at it, and how far back
in time?
- distribution requirements
- (solutions include NFS, local disk, pushing the data,
pulling the data, and touching each host individually)
- how syncronized across hosts/datacenters/countries?
- how much 'drift' is tolerated?
- how fast do changes have to get out there...
- normally?
- in rare 'emergency' situations?
- who can change/distribute the data?
- are there other security requirements?
- do we need to distribute changes sometimes to just a
subset of the targets?
- do changes need to be tested first, somehow?
- can other entities (other than the target host) have
access to the data?
- if a host is down during a data change.....
- do we have to know it was down?
- can the change wait until the host gets back up?
- does it need to get the update before it begins
normal operations?
- if not, how long can it go?
- do we have a complete list of the target hosts?
- what happens when the list changes?
- how likely is it to be kept up to date?
- what happens if a host NEVER gets a change?
- volitility and size
- how much does it change, if ever?
- how big does it get....
- normally?
- at peak?
- how big is the peak?
- any seasonal considerations?
-i.e. predictable variations in the above at month end, year
end, etc.
(this probably isn't the optimal arrangement of these issues; I don't have
this written down succintly anywhere, my apologies.)
A good example is user accounts:
We don't need these backed up on an individual machine basis, but the
central datastore needs to be recoverable very quickly
We've decided we can have a 'drift' of up to 4 hours normally, but in an
emergency situation (e.g. a high-profile or high-access adverse employee
termination) we need to be able to get the update out there in X minutes
instead.
No one other than the target host should have access to the crypted password
strings (we use shadowed passwords.)
etc. etc. etc.
I've found that looking at our data (OS, OS configs, user-land data,
databases, home-grown software, open source software, commercial software,
etc.) with these considerations in mind has made it much easier to figure
out where in the lifecycle you should make your changes, how you should make
them, and with what tools.
It's easy to over- or under-engineer a solution and end up causing
unnecessary expenses (either through too much network bandwidth, or through
not being able to fix a problem in a timely manner.)
-M.