BLUG: 07/14/09

Tuesday, July 14, 2009

Re: [BLUG] Large RAID config suggestions? (This e-mail explains RAID for newbies)

On Tue, Jul 14, 2009 at 07:29:39PM +0000, Mark Krenz wrote:
> On Tue, Jul 14, 2009 at 04:51:04PM GMT, Steven Black said the following:
> > I mean, with a hotspare, the array is rebuilt on to the hotspare when a
> > drive fails. That's one major I/O operation.
>
> Just to make sure its clear. Technically, the array isn't rebuilt (as
> in the data that didn't fail stays put), it just writes the missing data
> onto the spare drive based on parity calculations of what is on the
> other drives.

You are correct. I used the wrong language. The correct term for the
reconstruction of the data on a single drive is just that. The drive
gets "reconstructed", not rebuilt.

Cheers,

--
Steven Black <blacks@indiana.edu> / KeyID: 8596FA8E
Fingerprint: 108C 089C EFA4 832C BF07 78C2 DE71 5433 8596 FA8E

Re: [BLUG] Large RAID config suggestions? (This e-mail explains RAID for newbies)

On Tue, Jul 14, 2009 at 04:51:04PM GMT, Steven Black [blacks@indiana.edu] said the following:
>
> I mean, with a hotspare, the array is rebuilt on to the hotspare when a
> drive fails. That's one major I/O operation.
>

Just to make sure its clear. Technically, the array isn't rebuilt (as
in the data that didn't fail stays put), it just writes the missing data
onto the spare drive based on parity calculations of what is on the
other drives.

-------------------------------------------------------------------------
Quick explanation of RAID for those that don't know.

For those that don't know about RAID, in this case RAID-5, it is
like if you had 3 numbers that added up to another number. If you remove
one of the numbers you can figure out what it was by subtracking the
other 2 from the total:

RAID-5 array intact: A B C ParitySum
10 + 12 + 3 = 25

RAID-5 array with a missing drive C:
A B C ParitySum
10 + 12 + ? = 25

Figuring out what is supposed to be on drive C is as simple as:

Parity - B - A = C
25 - 10 - 12 = (3)

This is done in binary on an actual array and it does it in larger
stripes of data. But the idea is the same. When a drive fails that
makes up the array, the data on the filesystem on the array stays intact
and you can go on using your computer while the hardware recalculates
the missing data on the fly (called degraded mode). Then putting a
replacement drive back takes you out of degraded mode, restores your
performance and another drive can fail without data loss. On a server,
usually you use this with hardware that allows you to take out the hard
drive without shutting down the system, like hot swap hard drives.

The RAID can be done in software or via a specialized device called a
RAID controller. Software RAID is something that can be done in Linux
using the mdadm program that comes with the dm-tools. Its slower and
uses the main CPU(s), but its a cheap solution to getting redundancy.
Hardware RAID cards take the load off the CPU and can do other things
like allow you to use RAID-5 with a boot partition that software RAID
can't do.

RAID-1 is simply where the bits on one drive are mirrored to another
drive.

RAID-0 shouldn't be used unless you know what you are doing. I don't
want to even explain it because many people use it incorrectly.

RAID-6 is more complicated, but allows multiple drives to fail. I'll
leave the explanation to Wikipedia:

http://en.wikipedia.org/wiki/RAID-6#RAID_6

Now you know.

--
Mark Krenz
Bloomington Linux Users Group
http://www.bloomingtonlinux.org/
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

Hi Steven and Mark,

You all raise exceptional points. What lured me into the idea of 5EE was the prospect of "enhanced
performance" and being overly optimistic about the possible performance of the RAID card. My
initial tests this morning show that the compression phase would take ~2 hours to complete followed
by another 3-4 hours for the decompression phase; too risky for my taste.

Our current database setup is a product of many different factors. Our choice of PostgreSQL was in
part dictated by our use of an open source biological database schema called Chado
(http://gmod.org/wiki/Chado) that we helped develop. The mix of PostgreSQL and mySQL came about
because this work started 5 years ago when, as you pointed out, the two databases were very
different beasts. We also rely on a mixed bag of open source bioinformatics tools that only support
either mySQL or PostgreSQL.

Our mySQL servers right now are mostly v5 but we still have a couple of v4 in production, but they
are slated to be replaced soon. For PostgreSQL we are currently running v8.1 but have a plan to
move to 8.4 later on this summer in order to take advantage of some nice new features. Features of
particular interest include improved performance, parallel restores, recursive queries, and no more
max_fsm_pages! I'm also happy to see that they finally have a migration tool in the works. I
really loathed those full dump and restores when upgrading.

Josh

Steven Black wrote:
> On Tue, Jul 14, 2009 at 03:37:37PM +0000, Mark Krenz wrote:
>> The whole recompression of the array that can take hours or days
>> sounds VERY risky. I wouldn't do it unless you are experimenting. Some
>> of these non-standard raid levels are just companies coming up with
>> new combinations to have an extra feature over the competition, they
>> aren't necessarily good things.
>
> Yeah, I don't really see the benefit of this compared to having a
> straight RAID5 + hotspare.
>
> I mean, with a hotspare, the array is rebuilt on to the hotspare when a
> drive fails. That's one major I/O operation.
>
> With RAID 5EE you have a possibly similar major I/O operation during the
> compression, then an additional (possibly similar) I/O operation during
> decompression once you add the new drive.
>
> What troubles me is that the system is subject to "a second drive
> failure" during *decompression*. That is, it is subject to data loss if
> you have a drive failure after you've inserted the hot spare until the
> RAID5 finishes becoming RAID5EE. This means you're subject to a second
> drive failure being an issue for a longer period than you would be under
> RAID5.
>
> An example: Your RAID5EE has a drive failure. It compresses it down to
> RAID5. During that time it is subject to a second drive failure. (This
> is on par to having a RAID5 with a hotspare. You're subject to a second
> drive failure until the hotspare becomes a full member of the array.)
> Then you have a happy period where the previously RAID5EE array is now a
> RAID5 array and immune to a second drive failure. Then you plug in the
> spare drive and -- unlike RAID5 with a hotspare -- you're subject to a
> second period of time where the system is subject to data loss if there
> is a drive failure.
>
> Can you be sure that the compression and decompression will take less
> total time than a RAID5 with a hotspare? This concerns me, as the total
> I/O transferred for RAID5EE will be more than the total I/O for RAID5
> with a hotspare.
>
> Since the hotspare is spread across all of the drives, and after
> compression it becomes a standard RAID 5 array, you're still
> transferring a whole disk work of data. It's just, with RAID 5EE you're
> doing the full disk worth of data transfer twice. (Okay, not quite a
> full disk's worth of data, a full disk's worth of data minus the spare
> percentage. Still, twice whatever-it-is is still well more than a disk
> and a half worth of data.)
>
> I mean, most hotspare systems treat the replaced drive as the new
> hotspare. It isn't like slot 10 (or whatever) is always the hotspare
> slot. With a traditional hotspare, you transfer data once. With RAID5EE
> you transfer data twice.
>
>> Also, PostgreSQL is different from MySQL when it comes to number of
>> files. MySQL has 2 or 3 files per table, whereas PostgreSQL has many
>> more files, but it might not get really high. I have a decent size
>> database with lots of data and somewhere over 120 tables, and its only
>> 2000 files in /var/lib/pgsql. But if you are doing a large database
>> like flybase, you might pay attention to how many files you're using
>> on the filesystem like this:
>
> With MySQL it depends on the database engine being used. MyISAM (and
> probably Maria -- I've not read much about it) uses three files per
> table. InnoDB uses one per table plus three shared data files. (There
> are additional database engines at this point, too.)
>
> InnoDB and Falcon (I'm not sure how many files it has per table) should
> both provide ACID compliance. (I'm not sure about Maria.)
>
> Maria and Falcon are new in MySQL 6. Prior to MySQL 5 you couldn't get
> anything close to full ACID compliance due to missing core features.
>
> InnoDB can run in to issues of file-size limits on some systems. This
> can be relieved by enabled "innodb_file_per_table" prior to creating
> the tables. This results in InnoDB creating 2 files per table plus the
> earlier mentioned three shared InnoDB files.
>
> Please tell me you're not using MySQL prior to v5. :)
>
> I'm actually a little surprised by the combined PostgreSQL/MySQL
> approach. (Unless, of course, it started with MySQL prior to version 5,
> at which point it would be understandable. That was before MySQL had the
> features, and before PostgreSQL had the speed.) I would have expected a
> consistent product with replication off to one or more slaves that only
> accept read operations.
>
>> Its nice talking about sysadmin type stuff on the blug list again,
>> we don't do it enough.
>
> Indeed.
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

On Tue, Jul 14, 2009 at 03:37:37PM +0000, Mark Krenz wrote:
> The whole recompression of the array that can take hours or days
> sounds VERY risky. I wouldn't do it unless you are experimenting. Some
> of these non-standard raid levels are just companies coming up with
> new combinations to have an extra feature over the competition, they
> aren't necessarily good things.

Yeah, I don't really see the benefit of this compared to having a
straight RAID5 + hotspare.

I mean, with a hotspare, the array is rebuilt on to the hotspare when a
drive fails. That's one major I/O operation.

With RAID 5EE you have a possibly similar major I/O operation during the
compression, then an additional (possibly similar) I/O operation during
decompression once you add the new drive.

What troubles me is that the system is subject to "a second drive
failure" during *decompression*. That is, it is subject to data loss if
you have a drive failure after you've inserted the hot spare until the
RAID5 finishes becoming RAID5EE. This means you're subject to a second
drive failure being an issue for a longer period than you would be under
RAID5.

An example: Your RAID5EE has a drive failure. It compresses it down to
RAID5. During that time it is subject to a second drive failure. (This
is on par to having a RAID5 with a hotspare. You're subject to a second
drive failure until the hotspare becomes a full member of the array.)
Then you have a happy period where the previously RAID5EE array is now a
RAID5 array and immune to a second drive failure. Then you plug in the
spare drive and -- unlike RAID5 with a hotspare -- you're subject to a
second period of time where the system is subject to data loss if there
is a drive failure.

Can you be sure that the compression and decompression will take less
total time than a RAID5 with a hotspare? This concerns me, as the total
I/O transferred for RAID5EE will be more than the total I/O for RAID5
with a hotspare.

Since the hotspare is spread across all of the drives, and after
compression it becomes a standard RAID 5 array, you're still
transferring a whole disk work of data. It's just, with RAID 5EE you're
doing the full disk worth of data transfer twice. (Okay, not quite a
full disk's worth of data, a full disk's worth of data minus the spare
percentage. Still, twice whatever-it-is is still well more than a disk
and a half worth of data.)

I mean, most hotspare systems treat the replaced drive as the new
hotspare. It isn't like slot 10 (or whatever) is always the hotspare
slot. With a traditional hotspare, you transfer data once. With RAID5EE
you transfer data twice.

> Also, PostgreSQL is different from MySQL when it comes to number of
> files. MySQL has 2 or 3 files per table, whereas PostgreSQL has many
> more files, but it might not get really high. I have a decent size
> database with lots of data and somewhere over 120 tables, and its only
> 2000 files in /var/lib/pgsql. But if you are doing a large database
> like flybase, you might pay attention to how many files you're using
> on the filesystem like this:

With MySQL it depends on the database engine being used. MyISAM (and
probably Maria -- I've not read much about it) uses three files per
table. InnoDB uses one per table plus three shared data files. (There
are additional database engines at this point, too.)

InnoDB and Falcon (I'm not sure how many files it has per table) should
both provide ACID compliance. (I'm not sure about Maria.)

Maria and Falcon are new in MySQL 6. Prior to MySQL 5 you couldn't get
anything close to full ACID compliance due to missing core features.

InnoDB can run in to issues of file-size limits on some systems. This
can be relieved by enabled "innodb_file_per_table" prior to creating
the tables. This results in InnoDB creating 2 files per table plus the
earlier mentioned three shared InnoDB files.

Please tell me you're not using MySQL prior to v5. :)

I'm actually a little surprised by the combined PostgreSQL/MySQL
approach. (Unless, of course, it started with MySQL prior to version 5,
at which point it would be understandable. That was before MySQL had the
features, and before PostgreSQL had the speed.) I would have expected a
consistent product with replication off to one or more slaves that only
accept read operations.

> Its nice talking about sysadmin type stuff on the blug list again,
> we don't do it enough.

Indeed.

--
Steven Black <blacks@indiana.edu> / KeyID: 8596FA8E
Fingerprint: 108C 089C EFA4 832C BF07 78C2 DE71 5433 8596 FA8E

Re: [BLUG] Large RAID config suggestions?

I read the same thing and have plans to do some simulated failures on that RAID as well before
putting it into production. We shall see... I may end up going with a good old RAID 5 plus hot
spare setup.

Our PostgreSQL database is getting a little bit out of hand. Right now we have 17,571 files under
the data directory. Part of that is due to the size of the database and also because we do roughly
monthly releases and keep archived versions in the database. Periodically I dump old versions that
aren't used very often after I ensure that I have a dump stored locally and on the MDSS. However,
that won't work forever and I may have to look into other solutions.

Thanks,
Josh

Mark Krenz wrote:
> Wow, great information. I've never heard of RAID 5EE before so I
> looked it up and found some interesting information about it on
> wikipedia of course. I think you should read the cons about it before
> using it:
>
> http://en.wikipedia.org/wiki/Non-standard_RAID_levels
>
> The whole recompression of the array that can take hours or days
> sounds VERY risky. I wouldn't do it unless you are experimenting. Some
> of these non-standard raid levels are just companies coming up with new
> combinations to have an extra feature over the competition, they aren't
> necessarily good things.
>
> Also, PostgreSQL is different from MySQL when it comes to number of
> files. MySQL has 2 or 3 files per table, whereas PostgreSQL has many
> more files, but it might not get really high. I have a decent size
> database with lots of data and somewhere over 120 tables, and its only
> 2000 files in /var/lib/pgsql. But if you are doing a large database like
> flybase, you might pay attention to how many files you're using on the
> filesystem like this:
>
> find -type f /var/lib/pgsql | wc -l
>
> df -i can quickly show you the number of inodes, but not all
> filesystems have this.
>
> Its nice talking about sysadmin type stuff on the blug list again, we
> don't do it enough.
>
> Mark
>
> On Tue, Jul 14, 2009 at 03:08:38PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
>> Hi Mark, Steven, and David,
>>
>> I have been bit by problems with large file systems as well in the past. Some related to
>> performance and some just plain bugs in kernel code that were specific to RHEL v4. My plan is to
>> break it up as much as possible but I hadn't considered using multiple volume groups. In terms of
>> databases, we use both PostgreSQL and MySQL. PostgreSQL is used for our production database and
>> manages all the reading, writing, and management of the data, whereas MySQL is used for a read only
>> denormalized version of the same data for backing web applications. More generally speaking, this
>> is for FlyBase (flybase.org), which is a database for Drosophila (fruit fly) genetics. In the past
>> we have gotten by with much more modest disk needs but the advent of fairly cheap and fast DNA
>> sequencing technologies coupled with some other new techniques is pushing the limit quite quickly.
>>
>> I did consider ZFS and I have a fellow admin in our department who has been trying to proselytize me
>> in that direction. He has deployed it quite successfully and at quite a large scale. However, I
>> decided against it because I'd prefer to not manage a mix of Solaris/Linux systems and we would have
>> to spend a fair amount of time porting and debugging code on OpenSolaris. For now I'm keeping an
>> eye on some of the Linux ZFS projects (http://en.wikipedia.org/wiki/ZFS#Linux) and hoping for the best.
>>
>> For backups, we benefit from being part of the university so we can partake of the MDSS for our
>> backup needs. As Steven probably knows, the MDSS is a large capacity tape library system that
>> provides offsite backups. It also copies the data in duplicate (one here in Bloomington and one up
>> at IUPUI) to reduce the impact of pesky tape problems. If interested, more info on the MDSS can be
>> found here http://kb.iu.edu/data/aiyi.html. We send data to it over the network via their tar like
>> client so backups are fairly easy to implement and quite fast.
>>
>> I did consider a hot spare(s), but I was hoping that a RAID 60 would provide enough fault tolerance
>> with up to 2 disk failures per set (6 across the entire RAID) to allow me to get by in degraded mode
>> until I can swap in a new disk. I bought 6 extra spares with this in mind but it might be worth
>> reconsidering. I did use a hot spare for the system drives (4x 146 GB drives). Specifically, I
>> went with a RAID 5EE setup which stripes the spare across all drives instead of relying on a
>> dedicated spare waiting for a failure. It is my first use of RAID 5EE so we will see how it works out.
>>
>> Thanks for all your excellent comments.
>>
>> Cheers,
>> Josh
>>
>> Mark Krenz wrote:
>>> Wow, that's a lot of disks.
>>>
>>> I have one major suggestion. Don't make one big filesystem. Don't
>>> even make one big Volume Group. With that much space, I'd recommend
>>> dividing it up somehow, otherwise if you need to recover, it can take a
>>> day or more just to copy files over.
>>>
>>> At last year's Red Hat Summit, Rik van Riel gave a presentation called
>>> "Why Computers Are Getting Slower". He mostly talked from a low level
>>> point of view since he's a kernel developer, which was great. One of
>>> the things that he talked about is how we're getting to the point where
>>> filesystem sizes are getting too large for even fast disks to handle a
>>> recovery in a reasonable amount of time. And the algorithms need to be
>>> better optimized. So he recommended breaking up your filesystems into
>>> chunks. So on a server your /home partition might be /home1 /home2
>>> /home3, etc. and on a home machine you probably should put mediafiles on
>>> a seperate partition or maybe even break that up. Plus, using volume
>>> management like LVM is a necessity.
>>>
>>> On something like a mail server, with lots of little files, you may
>>> have millions of files and copying them over takes a lot of time, even
>>> on a SAN. I recently had to recover a filesystem with 6 million files
>>> on it and it was going to take about 16 hours or more just to copy stuff
>>> over to a SATA2 RAID-1 array running on a decent hardware raid
>>> controller, even though it was only about 180GB of data. This was direct
>>> disk to disk too, not over the network. What I had to do instead was
>>> some disk image trickery to get the data moved to a new set of disks (I
>>> had a RAID controller fail). If I had lost the array, I couldn't have
>>> done it this way.
>>>
>>> It was the first time I've had to do such a recovery in a while and
>>> wasn't expecting it to take so long. Immediately afterwards I decided
>>> that we need to breakup our filesystems into smaller chunks and also
>>> find ways to reduce the amount of data affected if a RAID array is lost.
>>>
>>> The short answer is that you can build all kinds of redundancy into
>>> your setup, but can still end up with the filesystem failing or
>>> something frying your filesystem that leads to major downtime.
>>>
>>> Of course this mythical thing called ZFS that comes with the J4400 may
>>> solve all these problems listed above.
>>>
>>> What kind of database system are you going to be using?
>>>
>>> Mark
>>>
>>> On Mon, Jul 13, 2009 at 06:45:54PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
>>>> Hi all,
>>>>
>>>> I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
>>>> and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
>>>> hardware RAID controller. The disk space will be used for NFS and a database server with a slight
>>>> emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
>>>>
>>>> Here are 2 initial configuration ideas:
>>>>
>>>> * RAID 50 (4x RAID 5 sets of 6 drives)
>>>> * RAID 60 (3x RAID 6 sets of 8 drives)
>>>>
>>>> I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
>>>> RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
>>>> route to go. I'm interested in hearing what others have to say especially if I've overlooked other
>>>> possibilities.
>>>>
>>>> I'm off to start simulating failures and benchmarking various configurations.
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>> _______________________________________________
>>>> BLUG mailing list
>>>> BLUG@linuxfan.com
>>>> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>>>>
>> _______________________________________________
>> BLUG mailing list
>> BLUG@linuxfan.com
>> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>>
>
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

Ah, excellent. I hadn't thought of this but it makes perfect sense. I do have all the drives plus
6 spares. The drives came in 5 boxes of 6 and I naively set aside one box for spares. I've now
gone through and swapped out all the disks so that I have 6 spares manufactured on different dates.

The drives are from Sun but they are re-branded Seagate Barracuda ES.2 1 TB drives that have been
modified to fit the J4400 enclosure. We shall see how these drives perform. Given Seagate's
history the failure rate might be slightly higher than your estimate. Thankfully these drives have
the patched firmware that addresses a slew of problems that plagued them when they were first
introduced in 2007.

Josh

Mark Krenz wrote:
> On Mon, Jul 13, 2009 at 10:50:46PM GMT, Steven Black [blacks@indiana.edu] said the following:
>> It was an otherwise trustworthy drive manufacturer, too. (I think
>> Seagate.) Everybody has a bad batch now and then, and these just managed
>> to slip past through.
>>
>
> Steven brings up a good point here. Some adminsitrators go so far as
> to buy each of their drives from different places so as to try to get
> different lot numbers. The theory is that drives in the same lot can go
> bad at the same time (which indeed I've seen happen).
>
> I recently bought 4 drives for a raid 5 array and bought 2 drives from
> NewEgg, 1 from CDW and 1 from somewhere else. But with my luck with
> hard drives, the 2 that came from Newegg where in different lot numbers,
> but the 1 from CDW and from another company had the same manufacture
> date.
>
> I can't remember where, but I saw someplace that would sell you X
> number of drives and make sure they were all in different lot numbers.
>
> Of course, you already bought the drives, right? What brand are they?
> Do they come from Sun when you buy the array? With 24 drives, you can
> count on a few breaking within the first couple months and also several
> breaking around 4-5 years.
>
> Mark
>
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

Wow, great information. I've never heard of RAID 5EE before so I
looked it up and found some interesting information about it on
wikipedia of course. I think you should read the cons about it before
using it:

http://en.wikipedia.org/wiki/Non-standard_RAID_levels

The whole recompression of the array that can take hours or days
sounds VERY risky. I wouldn't do it unless you are experimenting. Some
of these non-standard raid levels are just companies coming up with new
combinations to have an extra feature over the competition, they aren't
necessarily good things.

Also, PostgreSQL is different from MySQL when it comes to number of
files. MySQL has 2 or 3 files per table, whereas PostgreSQL has many
more files, but it might not get really high. I have a decent size
database with lots of data and somewhere over 120 tables, and its only
2000 files in /var/lib/pgsql. But if you are doing a large database like
flybase, you might pay attention to how many files you're using on the
filesystem like this:

find -type f /var/lib/pgsql | wc -l

df -i can quickly show you the number of inodes, but not all
filesystems have this.

Its nice talking about sysadmin type stuff on the blug list again, we
don't do it enough.

Mark

On Tue, Jul 14, 2009 at 03:08:38PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
> Hi Mark, Steven, and David,
>
> I have been bit by problems with large file systems as well in the past. Some related to
> performance and some just plain bugs in kernel code that were specific to RHEL v4. My plan is to
> break it up as much as possible but I hadn't considered using multiple volume groups. In terms of
> databases, we use both PostgreSQL and MySQL. PostgreSQL is used for our production database and
> manages all the reading, writing, and management of the data, whereas MySQL is used for a read only
> denormalized version of the same data for backing web applications. More generally speaking, this
> is for FlyBase (flybase.org), which is a database for Drosophila (fruit fly) genetics. In the past
> we have gotten by with much more modest disk needs but the advent of fairly cheap and fast DNA
> sequencing technologies coupled with some other new techniques is pushing the limit quite quickly.
>
> I did consider ZFS and I have a fellow admin in our department who has been trying to proselytize me
> in that direction. He has deployed it quite successfully and at quite a large scale. However, I
> decided against it because I'd prefer to not manage a mix of Solaris/Linux systems and we would have
> to spend a fair amount of time porting and debugging code on OpenSolaris. For now I'm keeping an
> eye on some of the Linux ZFS projects (http://en.wikipedia.org/wiki/ZFS#Linux) and hoping for the best.
>
> For backups, we benefit from being part of the university so we can partake of the MDSS for our
> backup needs. As Steven probably knows, the MDSS is a large capacity tape library system that
> provides offsite backups. It also copies the data in duplicate (one here in Bloomington and one up
> at IUPUI) to reduce the impact of pesky tape problems. If interested, more info on the MDSS can be
> found here http://kb.iu.edu/data/aiyi.html. We send data to it over the network via their tar like
> client so backups are fairly easy to implement and quite fast.
>
> I did consider a hot spare(s), but I was hoping that a RAID 60 would provide enough fault tolerance
> with up to 2 disk failures per set (6 across the entire RAID) to allow me to get by in degraded mode
> until I can swap in a new disk. I bought 6 extra spares with this in mind but it might be worth
> reconsidering. I did use a hot spare for the system drives (4x 146 GB drives). Specifically, I
> went with a RAID 5EE setup which stripes the spare across all drives instead of relying on a
> dedicated spare waiting for a failure. It is my first use of RAID 5EE so we will see how it works out.
>
> Thanks for all your excellent comments.
>
> Cheers,
> Josh
>
> Mark Krenz wrote:
> > Wow, that's a lot of disks.
> >
> > I have one major suggestion. Don't make one big filesystem. Don't
> > even make one big Volume Group. With that much space, I'd recommend
> > dividing it up somehow, otherwise if you need to recover, it can take a
> > day or more just to copy files over.
> >
> > At last year's Red Hat Summit, Rik van Riel gave a presentation called
> > "Why Computers Are Getting Slower". He mostly talked from a low level
> > point of view since he's a kernel developer, which was great. One of
> > the things that he talked about is how we're getting to the point where
> > filesystem sizes are getting too large for even fast disks to handle a
> > recovery in a reasonable amount of time. And the algorithms need to be
> > better optimized. So he recommended breaking up your filesystems into
> > chunks. So on a server your /home partition might be /home1 /home2
> > /home3, etc. and on a home machine you probably should put mediafiles on
> > a seperate partition or maybe even break that up. Plus, using volume
> > management like LVM is a necessity.
> >
> > On something like a mail server, with lots of little files, you may
> > have millions of files and copying them over takes a lot of time, even
> > on a SAN. I recently had to recover a filesystem with 6 million files
> > on it and it was going to take about 16 hours or more just to copy stuff
> > over to a SATA2 RAID-1 array running on a decent hardware raid
> > controller, even though it was only about 180GB of data. This was direct
> > disk to disk too, not over the network. What I had to do instead was
> > some disk image trickery to get the data moved to a new set of disks (I
> > had a RAID controller fail). If I had lost the array, I couldn't have
> > done it this way.
> >
> > It was the first time I've had to do such a recovery in a while and
> > wasn't expecting it to take so long. Immediately afterwards I decided
> > that we need to breakup our filesystems into smaller chunks and also
> > find ways to reduce the amount of data affected if a RAID array is lost.
> >
> > The short answer is that you can build all kinds of redundancy into
> > your setup, but can still end up with the filesystem failing or
> > something frying your filesystem that leads to major downtime.
> >
> > Of course this mythical thing called ZFS that comes with the J4400 may
> > solve all these problems listed above.
> >
> > What kind of database system are you going to be using?
> >
> > Mark
> >
> > On Mon, Jul 13, 2009 at 06:45:54PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
> >> Hi all,
> >>
> >> I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
> >> and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
> >> hardware RAID controller. The disk space will be used for NFS and a database server with a slight
> >> emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
> >>
> >> Here are 2 initial configuration ideas:
> >>
> >> * RAID 50 (4x RAID 5 sets of 6 drives)
> >> * RAID 60 (3x RAID 6 sets of 8 drives)
> >>
> >> I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
> >> RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
> >> route to go. I'm interested in hearing what others have to say especially if I've overlooked other
> >> possibilities.
> >>
> >> I'm off to start simulating failures and benchmarking various configurations.
> >>
> >> Cheers,
> >> Josh
> >>
> >> _______________________________________________
> >> BLUG mailing list
> >> BLUG@linuxfan.com
> >> http://mailman.cs.indiana.edu/mailman/listinfo/blug
> >>
> >
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>

Re: [BLUG] Large RAID config suggestions?

I would tend to second these recommendations, though I have very
little professional experience and no experience with RAIDs.

However, I have lost lots of data several times. The first time, I
was in high school, and I'd had my father buy me a packaged version of
Mandrake Linux (I'd had no real exposure to linux before that). The
partitioner in the installer crashed and ate my drive. The second
time I was in college, and my roommate bumped my power cable where it
was hooked to the power supply. Both hard drives and both media
drives (back in the days when your DVD drive and CD-RW drive were
separate) were fried, along with the power supply (though my
motherboard etc were fine).

Since then I've kept 3 copies of all (well, most of) my data. I have
a live copy on whatever machine I'm primarily using (right now my Acer
Aspire One), a frequently synched copy on my off-site server at my
parents' house, and a copy on an external drive that I sync less
frequently and leave at home (i.e., when I throw my computer in my
backpack). Having a separate partition for all my data helps a lot,
and rsync makes a simple-but-powerful tool to back stuff up.

Everyone's backup strategy's going to be a little different, but
putting all your drives in one box sounds a little bit too much like
putting all your eggs in the same basket. A brief loss of power isn't
supposed to fry a power supply (and granted, that old power supply
wasn't top of the line), but things happen. Especially when you don't
prepare for or expect them. I think there're some a corollaries to
Murphy's Law about this.

--
Jonathan

2009/7/14 Steven Black <blacks@indiana.edu>:
> Back in my youth (1996 or so) I administered a RAID5 system. (I won't
> say how large it was. I said it was 1996, right? Things were smaller
> then.)
>
> Hot spares are the best invention *ever*. Back in 1996 they were not the
> norm.
>
> It was a simple RAID5 system. All the drives were of the same
> manufacturer. That manufacturer had a bad batch. Before the replacement
> drive arrived we had a second drive failure.
>
> It was an otherwise trustworthy drive manufacturer, too. (I think
> Seagate.) Everybody has a bad batch now and then, and these just managed
> to slip past through.
>
> More recently, I found out that a drive failed in one of my boxes. It's
> an ancient nightmare of a Solaris box, and I fully expected to need to
> type some obscure command to get the replacement drive up and in the
> system. I did a little investigation and found that all the data was
> already on another drive. Though I replaced the drive, the drive I added
> simply became a new hot spare.
>
> Hot spares become much more important when you deal with more data. If
> I have a drive fail at 4am and I have a hotspare I can show up at the
> office at my normal hours. By the time I come in, much of the data has
> already been copied over to the hot-spare. I can then make a support
> call during normal business hours for the replacement drive.
>
> If you don't have at least one hot spare in your system, you need to
> make sure you have 1-2 of the required drives on hand. Yeah, you could
> rely upon your service contract's same-day service, but it's a lot
> nicer to at least have one drive immediately on-hand. If you don't have
> same-day service, you better have a pair of spare drives because you
> might just need them both.
>
> It is important to have off-site backups, though. Not just backups,
> *off-site* backups. You don't want to explain what happened to the
> data when there was a building fire, flooding problem, etc. There are
> problems that strike that can take down your whole machine room.
>
> You also need a disaster recovery plan that goes from a set of documents
> detailing the process, backup media, and money from the insurance
> coverage, and turns that back in to what you have in your machine room.
> (And it needs to be doable by a replacement. -- Assume you've just been
> promoted.)
>
> Just my two cents,
> Steven Black
>
> --
> Steven Black <blacks@indiana.edu> / KeyID: 8596FA8E
> Fingerprint: 108C 089C EFA4 832C BF07 78C2 DE71 5433 8596 FA8E
>
> On Mon, Jul 13, 2009 at 05:35:30PM -0400, David Ernst wrote:
>> Well, I don't think I have anything very sophisticated to say, but I'm
>> inclined to agree with you about the 3x RAID6. By my calculations,
>> you'll get 18T that way vs. 20T in your other proposal. I don't know
>> what you're storing, but this is a lot of disk space, so probably no
>> one will mind that sacrifice. Meanwhile, the RAID 6 option does give
>> a slight emphasis to reliability over performance, as you wanted. So,
>> basically, I think I'm just saying "your reasoning makes sense to
>> me".
>>
>> I hate to bring this up, but twice in my life I've been affected by
>> the failure of entire RAID arrays... Both were high-quality hardware
>> RAID setups, and people said of both failures "this is supposed to
>> never happen." In short, I recommend some other kind of backup in
>> addition to the RAID, because things happen, and if your organization
>> is concerned enough with reliability to consider RAID 6, I wouldn't
>> assume that something like this would never happen.
>>
>> David
>>
>>
>> On Mon, Jul 13, 2009 at 02:45:54PM -0400, Josh Goodman wrote:
>> >
>> >Hi all,
>> >
>> >I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
>> >and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
>> >hardware RAID controller. The disk space will be used for NFS and a database server with a slight
>> >emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
>> >
>> >Here are 2 initial configuration ideas:
>> >
>> >* RAID 50 (4x RAID 5 sets of 6 drives)
>> >* RAID 60 (3x RAID 6 sets of 8 drives)
>> >
>> >I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
>> >RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
>> >route to go. I'm interested in hearing what others have to say especially if I've overlooked other
>> >possibilities.
>> >
>> >I'm off to start simulating failures and benchmarking various configurations.
>> >
>> >Cheers,
>> >Josh
>> >
>> >_______________________________________________
>> >BLUG mailing list
>> >BLUG@linuxfan.com
>> >http://mailman.cs.indiana.edu/mailman/listinfo/blug
>> _______________________________________________
>> BLUG mailing list
>> BLUG@linuxfan.com
>> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>
>
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>

_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

Hi Mark, Steven, and David,

I have been bit by problems with large file systems as well in the past. Some related to
performance and some just plain bugs in kernel code that were specific to RHEL v4. My plan is to
break it up as much as possible but I hadn't considered using multiple volume groups. In terms of
databases, we use both PostgreSQL and MySQL. PostgreSQL is used for our production database and
manages all the reading, writing, and management of the data, whereas MySQL is used for a read only
denormalized version of the same data for backing web applications. More generally speaking, this
is for FlyBase (flybase.org), which is a database for Drosophila (fruit fly) genetics. In the past
we have gotten by with much more modest disk needs but the advent of fairly cheap and fast DNA
sequencing technologies coupled with some other new techniques is pushing the limit quite quickly.

I did consider ZFS and I have a fellow admin in our department who has been trying to proselytize me
in that direction. He has deployed it quite successfully and at quite a large scale. However, I
decided against it because I'd prefer to not manage a mix of Solaris/Linux systems and we would have
to spend a fair amount of time porting and debugging code on OpenSolaris. For now I'm keeping an
eye on some of the Linux ZFS projects (http://en.wikipedia.org/wiki/ZFS#Linux) and hoping for the best.

For backups, we benefit from being part of the university so we can partake of the MDSS for our
backup needs. As Steven probably knows, the MDSS is a large capacity tape library system that
provides offsite backups. It also copies the data in duplicate (one here in Bloomington and one up
at IUPUI) to reduce the impact of pesky tape problems. If interested, more info on the MDSS can be
found here http://kb.iu.edu/data/aiyi.html. We send data to it over the network via their tar like
client so backups are fairly easy to implement and quite fast.

I did consider a hot spare(s), but I was hoping that a RAID 60 would provide enough fault tolerance
with up to 2 disk failures per set (6 across the entire RAID) to allow me to get by in degraded mode
until I can swap in a new disk. I bought 6 extra spares with this in mind but it might be worth
reconsidering. I did use a hot spare for the system drives (4x 146 GB drives). Specifically, I
went with a RAID 5EE setup which stripes the spare across all drives instead of relying on a
dedicated spare waiting for a failure. It is my first use of RAID 5EE so we will see how it works out.

Thanks for all your excellent comments.

Cheers,
Josh

Mark Krenz wrote:
> Wow, that's a lot of disks.
>
> I have one major suggestion. Don't make one big filesystem. Don't
> even make one big Volume Group. With that much space, I'd recommend
> dividing it up somehow, otherwise if you need to recover, it can take a
> day or more just to copy files over.
>
> At last year's Red Hat Summit, Rik van Riel gave a presentation called
> "Why Computers Are Getting Slower". He mostly talked from a low level
> point of view since he's a kernel developer, which was great. One of
> the things that he talked about is how we're getting to the point where
> filesystem sizes are getting too large for even fast disks to handle a
> recovery in a reasonable amount of time. And the algorithms need to be
> better optimized. So he recommended breaking up your filesystems into
> chunks. So on a server your /home partition might be /home1 /home2
> /home3, etc. and on a home machine you probably should put mediafiles on
> a seperate partition or maybe even break that up. Plus, using volume
> management like LVM is a necessity.
>
> On something like a mail server, with lots of little files, you may
> have millions of files and copying them over takes a lot of time, even
> on a SAN. I recently had to recover a filesystem with 6 million files
> on it and it was going to take about 16 hours or more just to copy stuff
> over to a SATA2 RAID-1 array running on a decent hardware raid
> controller, even though it was only about 180GB of data. This was direct
> disk to disk too, not over the network. What I had to do instead was
> some disk image trickery to get the data moved to a new set of disks (I
> had a RAID controller fail). If I had lost the array, I couldn't have
> done it this way.
>
> It was the first time I've had to do such a recovery in a while and
> wasn't expecting it to take so long. Immediately afterwards I decided
> that we need to breakup our filesystems into smaller chunks and also
> find ways to reduce the amount of data affected if a RAID array is lost.
>
> The short answer is that you can build all kinds of redundancy into
> your setup, but can still end up with the filesystem failing or
> something frying your filesystem that leads to major downtime.
>
> Of course this mythical thing called ZFS that comes with the J4400 may
> solve all these problems listed above.
>
> What kind of database system are you going to be using?
>
> Mark
>
> On Mon, Jul 13, 2009 at 06:45:54PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
>> Hi all,
>>
>> I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
>> and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
>> hardware RAID controller. The disk space will be used for NFS and a database server with a slight
>> emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
>>
>> Here are 2 initial configuration ideas:
>>
>> * RAID 50 (4x RAID 5 sets of 6 drives)
>> * RAID 60 (3x RAID 6 sets of 8 drives)
>>
>> I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
>> RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
>> route to go. I'm interested in hearing what others have to say especially if I've overlooked other
>> possibilities.
>>
>> I'm off to start simulating failures and benchmarking various configurations.
>>
>> Cheers,
>> Josh
>>
>> _______________________________________________
>> BLUG mailing list
>> BLUG@linuxfan.com
>> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>>
>
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

On Mon, Jul 13, 2009 at 10:50:46PM GMT, Steven Black [blacks@indiana.edu] said the following:
> It was an otherwise trustworthy drive manufacturer, too. (I think
> Seagate.) Everybody has a bad batch now and then, and these just managed
> to slip past through.
>

Steven brings up a good point here. Some adminsitrators go so far as
to buy each of their drives from different places so as to try to get
different lot numbers. The theory is that drives in the same lot can go
bad at the same time (which indeed I've seen happen).

I recently bought 4 drives for a raid 5 array and bought 2 drives from
NewEgg, 1 from CDW and 1 from somewhere else. But with my luck with
hard drives, the 2 that came from Newegg where in different lot numbers,
but the 1 from CDW and from another company had the same manufacture
date.

I can't remember where, but I saw someplace that would sell you X
number of drives and make sure they were all in different lot numbers.

Of course, you already bought the drives, right? What brand are they?
Do they come from Sun when you buy the array? With 24 drives, you can
count on a few breaking within the first couple months and also several
breaking around 4-5 years.

Mark

Tuesday, July 14, 2009

Blog Archive