[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [SAGE] simple database problem
At 12:42 2005-01-03 -0500, Andrew Hume wrote:
> i have an application which needs to maintain a mapping of
>(name,md5sum) to pathname for 10k-1000k mappings.
>currently, we convert the key to a string, and use gdbm (or ndbm).
>(the application runs on Linux, FreeBSD, MacOSX, Solaris and Irix.)
>30% of the time, we add a single mapping, 20% of teh time we delete
>a mapping, and 50% of the time, we print out all mappings.
>rarely, we add or delete a largish number of mappings.
>
> the problem is that on Linux (actually, i could just stop here,
> couldn't I?),
>the 'print all' operation can take 30mins or more on a busy machine,
>(busy here means lots of I/O) as opposed to the normal 2-3secs,
>apparently because of the random seeking around in the database file.
>performance is significantly helped by simply running 'wc db.dbm'
>just prior to using the database.
>
> is there a better way to implement this databse that will not be
> prone
>to this kind of 'failure' (and make no mistake, taking 30mins is for all
>intents
>and purposes, a failure)? of course, this does not manifest itself on
>any of our other platforms, but Linux performance has always been
>unusually fragile
>with respect to the contents of teh buffer cache.
It shouldn't need to be seeking all over the database simply to read out
all the contents. You aren't perchance sorting the keys, and then accessing
the elements, are you?
From perl, the difference between:
foreach $i in keys %db
and
foreach $i in sort keys %db
(where %db has been associated with the dbm file) can be extreme.
Basically, you should access the database elements in whatever order dbm
wants to give them to you, then sort them later if that's what you want.
(Note: my experience with this is very dated... it may have changed since
the mid-90s. :-)
Greg.