[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [SAGE] number of eggs in a basket (sun NFS please read)



On Thu, Jan 06, 2005 at 04:50:22PM -0600, Doug Hughes wrote:
> 
> NFS - for read-only, NFS is much easier than read-write. In some
> operating systems you can have automatic failover if a read-only
> NFS server fails.


Speaking of which... I just found out something really, really annoying
about sun.

Sun's NFS client failover is broken.
Has been for years. At least since solaris 8, probably longer.
Furthermore, sun has KNOWN ABOUT IT for over a year, but wont fix it,
because "its too much work".

On top of that, their NFS engineering staff had the gall to tell me,
"well the documentation doesnt explicitly SAY it behaves
 [the way anyone here would expect it to], so that isnt a bug, thats just
 undocumented behaviour. We'll _fix_the_documentation_".


The problem:

 Take an NFS client, configured to do readonly failover between
  servers "nfs1" and "nfs2"

 make nfs1 unavailable

 reboot your NFS client (or manually unmount and remount filesystem)

 bring nfs1 back up

 take nfs2 offline.

No problem, right? because after all, you never had less than one 
nfs server available at all times, right?
Wrong.
Your NFS client is now dead in the water. It will NOT failover to nfs1,
since it was not up at the time you did the mount of the NFS fileystem.


So in other words, if "hypothetically", half of your datacenter crashed,
and all of your low-load clients happened to come up before your big beefy
NFS server... and then your other NFS server happened to go offline for
whatever reason... you now have half of your datacenter out of action
again.

Good thing this sort of thing never happens twice in one month.
Oh, yeah.



BugID 4931782. Open since at least Aug 2003, I'm told.

For anyone who has a gold level contract, would you please take a minute
to file an escalation on this bugid, and also note that as an enterprise
customer, you are shocked that sun is deliberately choosing to not fix this
bug.

We have already filed escalation. But apparently, only one customer filing
a support ticket, against an issue that has taken 30 production servers
offline, isnt seen as something that's worth fixing at the code level by
sun.

its a publically visible bug, btw:

http://sunsolve.sun.com/search/document.do?assetkey=1-1-4931782-1

The last sentance of the pre-existing bug report is particularly succinct:

"Basically, the end result is that the -ro failover is only useful when
 both servers remain up at all times, which defeats the purpose."