[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [SAGE] number of eggs in a basket



On Thu, 6 Jan 2005, Jan Schaumann wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I'd like to get some opinions regarding best practices for mission
> critical systems with multiple services.
>
> I have a system that basically is a single point of failure:  if it's
> down, nothing goes.  The services on that machine are WWW, NIS, NFS and
> mail.  Mail is delivered to ~/.mail so mail can be read via NFS and need
> not be fetched.
>
> I do not like having all my eggs in this one basket, but on the other
> hand distributing the services to several machines seems to complicate
> things and increase the likeliness of one of the services failing.
>
> So... what are your comments/experiences?  How many eggs do you keep in
> your basket(s)?
>

I like to keep my eggs in a quantum state hovering between two baskets,
but unlike Schrodinger's cat, if you look in the one basket, you can
guarantee they'll be in the other.

Anyway, service redundancy/failover is a good thing if available.
In NIS, that's easy, you can have slaves. (if the master is down for
a period, it 'can' be a non-issue with the right architecture)

with mail - have a backup MX server in case the primary is down. with
things like cyrus/murder, you can have a lot of redundancy with the
addition of a little complication and architecture. There are lots
of options. If you have a SAN or network disk space, you can always
take over mail service via cold failover to another machine. If not,
have parts ready to do cold swapping of disks/cables etc.

with WWW, the top end is load balancers that are auto-detecting
of failure. Then there's replication of content, clustering,
and cold sparing. Have a plan for what to do if the web service
is unavailable. Things like freeHA (hi Phil) would be a good thing
for this too.

NFS - for read-only, NFS is much easier than read-write. In some
operating systems you can have automatic failover if a read-only
NFS server fails. For read-write, you're more apt to have a problem
with stale file handles. Those can be difficult to deal with.
have a good raid solution and resilient hardware (redundant
power supplies, parity ram banks (not chips - ecc chips are good
though) if available, multipathing network is ~free these days in
many OS. Have parts available.

Not to say that that requires you to have 6 machines. It might
be that you have one master machine for each services and then
one 'spare parts'. Or you might have one for each service, or clusters.
Figure out which services are most important to you and your users
and start beefing them up one by one. Which one stops business?
Which one is merely inconvenient? Many places can suffer a 10 minute
mail outage without a hue and cry. Internal vs external (cust facing)
web makes a big difference.  NIS is resilient out of the box with slaves
(particularly if you don't have frequent updates).

There's no one answer. Do your risk analysis. Try some of the low
hanging fruit first (e.g. NIS). More machines doesn't have to be
complicated. Have a good distributed management plan. Make use of things
like jumpstart/kickstart/fai/etc. automate. Even with a minimal budget
you can have really good uptimes.