Tuesday, July 14, 2009

Re: [BLUG] Large RAID config suggestions?

Wow, great information. I've never heard of RAID 5EE before so I
looked it up and found some interesting information about it on
wikipedia of course. I think you should read the cons about it before
using it:

http://en.wikipedia.org/wiki/Non-standard_RAID_levels

The whole recompression of the array that can take hours or days
sounds VERY risky. I wouldn't do it unless you are experimenting. Some
of these non-standard raid levels are just companies coming up with new
combinations to have an extra feature over the competition, they aren't
necessarily good things.

Also, PostgreSQL is different from MySQL when it comes to number of
files. MySQL has 2 or 3 files per table, whereas PostgreSQL has many
more files, but it might not get really high. I have a decent size
database with lots of data and somewhere over 120 tables, and its only
2000 files in /var/lib/pgsql. But if you are doing a large database like
flybase, you might pay attention to how many files you're using on the
filesystem like this:

find -type f /var/lib/pgsql | wc -l

df -i can quickly show you the number of inodes, but not all
filesystems have this.

Its nice talking about sysadmin type stuff on the blug list again, we
don't do it enough.

Mark

On Tue, Jul 14, 2009 at 03:08:38PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
> Hi Mark, Steven, and David,
>
> I have been bit by problems with large file systems as well in the past. Some related to
> performance and some just plain bugs in kernel code that were specific to RHEL v4. My plan is to
> break it up as much as possible but I hadn't considered using multiple volume groups. In terms of
> databases, we use both PostgreSQL and MySQL. PostgreSQL is used for our production database and
> manages all the reading, writing, and management of the data, whereas MySQL is used for a read only
> denormalized version of the same data for backing web applications. More generally speaking, this
> is for FlyBase (flybase.org), which is a database for Drosophila (fruit fly) genetics. In the past
> we have gotten by with much more modest disk needs but the advent of fairly cheap and fast DNA
> sequencing technologies coupled with some other new techniques is pushing the limit quite quickly.
>
> I did consider ZFS and I have a fellow admin in our department who has been trying to proselytize me
> in that direction. He has deployed it quite successfully and at quite a large scale. However, I
> decided against it because I'd prefer to not manage a mix of Solaris/Linux systems and we would have
> to spend a fair amount of time porting and debugging code on OpenSolaris. For now I'm keeping an
> eye on some of the Linux ZFS projects (http://en.wikipedia.org/wiki/ZFS#Linux) and hoping for the best.
>
> For backups, we benefit from being part of the university so we can partake of the MDSS for our
> backup needs. As Steven probably knows, the MDSS is a large capacity tape library system that
> provides offsite backups. It also copies the data in duplicate (one here in Bloomington and one up
> at IUPUI) to reduce the impact of pesky tape problems. If interested, more info on the MDSS can be
> found here http://kb.iu.edu/data/aiyi.html. We send data to it over the network via their tar like
> client so backups are fairly easy to implement and quite fast.
>
> I did consider a hot spare(s), but I was hoping that a RAID 60 would provide enough fault tolerance
> with up to 2 disk failures per set (6 across the entire RAID) to allow me to get by in degraded mode
> until I can swap in a new disk. I bought 6 extra spares with this in mind but it might be worth
> reconsidering. I did use a hot spare for the system drives (4x 146 GB drives). Specifically, I
> went with a RAID 5EE setup which stripes the spare across all drives instead of relying on a
> dedicated spare waiting for a failure. It is my first use of RAID 5EE so we will see how it works out.
>
> Thanks for all your excellent comments.
>
> Cheers,
> Josh
>
> Mark Krenz wrote:
> > Wow, that's a lot of disks.
> >
> > I have one major suggestion. Don't make one big filesystem. Don't
> > even make one big Volume Group. With that much space, I'd recommend
> > dividing it up somehow, otherwise if you need to recover, it can take a
> > day or more just to copy files over.
> >
> > At last year's Red Hat Summit, Rik van Riel gave a presentation called
> > "Why Computers Are Getting Slower". He mostly talked from a low level
> > point of view since he's a kernel developer, which was great. One of
> > the things that he talked about is how we're getting to the point where
> > filesystem sizes are getting too large for even fast disks to handle a
> > recovery in a reasonable amount of time. And the algorithms need to be
> > better optimized. So he recommended breaking up your filesystems into
> > chunks. So on a server your /home partition might be /home1 /home2
> > /home3, etc. and on a home machine you probably should put mediafiles on
> > a seperate partition or maybe even break that up. Plus, using volume
> > management like LVM is a necessity.
> >
> > On something like a mail server, with lots of little files, you may
> > have millions of files and copying them over takes a lot of time, even
> > on a SAN. I recently had to recover a filesystem with 6 million files
> > on it and it was going to take about 16 hours or more just to copy stuff
> > over to a SATA2 RAID-1 array running on a decent hardware raid
> > controller, even though it was only about 180GB of data. This was direct
> > disk to disk too, not over the network. What I had to do instead was
> > some disk image trickery to get the data moved to a new set of disks (I
> > had a RAID controller fail). If I had lost the array, I couldn't have
> > done it this way.
> >
> > It was the first time I've had to do such a recovery in a while and
> > wasn't expecting it to take so long. Immediately afterwards I decided
> > that we need to breakup our filesystems into smaller chunks and also
> > find ways to reduce the amount of data affected if a RAID array is lost.
> >
> > The short answer is that you can build all kinds of redundancy into
> > your setup, but can still end up with the filesystem failing or
> > something frying your filesystem that leads to major downtime.
> >
> > Of course this mythical thing called ZFS that comes with the J4400 may
> > solve all these problems listed above.
> >
> > What kind of database system are you going to be using?
> >
> > Mark
> >
> > On Mon, Jul 13, 2009 at 06:45:54PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
> >> Hi all,
> >>
> >> I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
> >> and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
> >> hardware RAID controller. The disk space will be used for NFS and a database server with a slight
> >> emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
> >>
> >> Here are 2 initial configuration ideas:
> >>
> >> * RAID 50 (4x RAID 5 sets of 6 drives)
> >> * RAID 60 (3x RAID 6 sets of 8 drives)
> >>
> >> I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
> >> RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
> >> route to go. I'm interested in hearing what others have to say especially if I've overlooked other
> >> possibilities.
> >>
> >> I'm off to start simulating failures and benchmarking various configurations.
> >>
> >> Cheers,
> >> Josh
> >>
> >> _______________________________________________
> >> BLUG mailing list
> >> BLUG@linuxfan.com
> >> http://mailman.cs.indiana.edu/mailman/listinfo/blug
> >>
> >
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>

--
Mark Krenz
Bloomington Linux Users Group
http://www.bloomingtonlinux.org/
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

No comments: