Tuesday, July 14, 2009

Re: [BLUG] Large RAID config suggestions?

Hi Mark, Steven, and David,

I have been bit by problems with large file systems as well in the past. Some related to
performance and some just plain bugs in kernel code that were specific to RHEL v4. My plan is to
break it up as much as possible but I hadn't considered using multiple volume groups. In terms of
databases, we use both PostgreSQL and MySQL. PostgreSQL is used for our production database and
manages all the reading, writing, and management of the data, whereas MySQL is used for a read only
denormalized version of the same data for backing web applications. More generally speaking, this
is for FlyBase (flybase.org), which is a database for Drosophila (fruit fly) genetics. In the past
we have gotten by with much more modest disk needs but the advent of fairly cheap and fast DNA
sequencing technologies coupled with some other new techniques is pushing the limit quite quickly.

I did consider ZFS and I have a fellow admin in our department who has been trying to proselytize me
in that direction. He has deployed it quite successfully and at quite a large scale. However, I
decided against it because I'd prefer to not manage a mix of Solaris/Linux systems and we would have
to spend a fair amount of time porting and debugging code on OpenSolaris. For now I'm keeping an
eye on some of the Linux ZFS projects (http://en.wikipedia.org/wiki/ZFS#Linux) and hoping for the best.

For backups, we benefit from being part of the university so we can partake of the MDSS for our
backup needs. As Steven probably knows, the MDSS is a large capacity tape library system that
provides offsite backups. It also copies the data in duplicate (one here in Bloomington and one up
at IUPUI) to reduce the impact of pesky tape problems. If interested, more info on the MDSS can be
found here http://kb.iu.edu/data/aiyi.html. We send data to it over the network via their tar like
client so backups are fairly easy to implement and quite fast.

I did consider a hot spare(s), but I was hoping that a RAID 60 would provide enough fault tolerance
with up to 2 disk failures per set (6 across the entire RAID) to allow me to get by in degraded mode
until I can swap in a new disk. I bought 6 extra spares with this in mind but it might be worth
reconsidering. I did use a hot spare for the system drives (4x 146 GB drives). Specifically, I
went with a RAID 5EE setup which stripes the spare across all drives instead of relying on a
dedicated spare waiting for a failure. It is my first use of RAID 5EE so we will see how it works out.

Thanks for all your excellent comments.

Cheers,
Josh

Mark Krenz wrote:
> Wow, that's a lot of disks.
>
> I have one major suggestion. Don't make one big filesystem. Don't
> even make one big Volume Group. With that much space, I'd recommend
> dividing it up somehow, otherwise if you need to recover, it can take a
> day or more just to copy files over.
>
> At last year's Red Hat Summit, Rik van Riel gave a presentation called
> "Why Computers Are Getting Slower". He mostly talked from a low level
> point of view since he's a kernel developer, which was great. One of
> the things that he talked about is how we're getting to the point where
> filesystem sizes are getting too large for even fast disks to handle a
> recovery in a reasonable amount of time. And the algorithms need to be
> better optimized. So he recommended breaking up your filesystems into
> chunks. So on a server your /home partition might be /home1 /home2
> /home3, etc. and on a home machine you probably should put mediafiles on
> a seperate partition or maybe even break that up. Plus, using volume
> management like LVM is a necessity.
>
> On something like a mail server, with lots of little files, you may
> have millions of files and copying them over takes a lot of time, even
> on a SAN. I recently had to recover a filesystem with 6 million files
> on it and it was going to take about 16 hours or more just to copy stuff
> over to a SATA2 RAID-1 array running on a decent hardware raid
> controller, even though it was only about 180GB of data. This was direct
> disk to disk too, not over the network. What I had to do instead was
> some disk image trickery to get the data moved to a new set of disks (I
> had a RAID controller fail). If I had lost the array, I couldn't have
> done it this way.
>
> It was the first time I've had to do such a recovery in a while and
> wasn't expecting it to take so long. Immediately afterwards I decided
> that we need to breakup our filesystems into smaller chunks and also
> find ways to reduce the amount of data affected if a RAID array is lost.
>
> The short answer is that you can build all kinds of redundancy into
> your setup, but can still end up with the filesystem failing or
> something frying your filesystem that leads to major downtime.
>
> Of course this mythical thing called ZFS that comes with the J4400 may
> solve all these problems listed above.
>
> What kind of database system are you going to be using?
>
> Mark
>
> On Mon, Jul 13, 2009 at 06:45:54PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
>> Hi all,
>>
>> I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
>> and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
>> hardware RAID controller. The disk space will be used for NFS and a database server with a slight
>> emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
>>
>> Here are 2 initial configuration ideas:
>>
>> * RAID 50 (4x RAID 5 sets of 6 drives)
>> * RAID 60 (3x RAID 6 sets of 8 drives)
>>
>> I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
>> RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
>> route to go. I'm interested in hearing what others have to say especially if I've overlooked other
>> possibilities.
>>
>> I'm off to start simulating failures and benchmarking various configurations.
>>
>> Cheers,
>> Josh
>>
>> _______________________________________________
>> BLUG mailing list
>> BLUG@linuxfan.com
>> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>>
>
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

On Mon, Jul 13, 2009 at 10:50:46PM GMT, Steven Black [blacks@indiana.edu] said the following:
> It was an otherwise trustworthy drive manufacturer, too. (I think
> Seagate.) Everybody has a bad batch now and then, and these just managed
> to slip past through.
>

Steven brings up a good point here. Some adminsitrators go so far as
to buy each of their drives from different places so as to try to get
different lot numbers. The theory is that drives in the same lot can go
bad at the same time (which indeed I've seen happen).

I recently bought 4 drives for a raid 5 array and bought 2 drives from
NewEgg, 1 from CDW and 1 from somewhere else. But with my luck with
hard drives, the 2 that came from Newegg where in different lot numbers,
but the 1 from CDW and from another company had the same manufacture
date.

I can't remember where, but I saw someplace that would sell you X
number of drives and make sure they were all in different lot numbers.

Of course, you already bought the drives, right? What brand are they?
Do they come from Sun when you buy the array? With 24 drives, you can
count on a few breaking within the first couple months and also several
breaking around 4-5 years.

Mark

--
Mark Krenz
Bloomington Linux Users Group
http://www.bloomingtonlinux.org/
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Monday, July 13, 2009

Re: [BLUG] Large RAID config suggestions?

Back in my youth (1996 or so) I administered a RAID5 system. (I won't
say how large it was. I said it was 1996, right? Things were smaller
then.)

Hot spares are the best invention *ever*. Back in 1996 they were not the
norm.

It was a simple RAID5 system. All the drives were of the same
manufacturer. That manufacturer had a bad batch. Before the replacement
drive arrived we had a second drive failure.

It was an otherwise trustworthy drive manufacturer, too. (I think
Seagate.) Everybody has a bad batch now and then, and these just managed
to slip past through.

More recently, I found out that a drive failed in one of my boxes. It's
an ancient nightmare of a Solaris box, and I fully expected to need to
type some obscure command to get the replacement drive up and in the
system. I did a little investigation and found that all the data was
already on another drive. Though I replaced the drive, the drive I added
simply became a new hot spare.

Hot spares become much more important when you deal with more data. If
I have a drive fail at 4am and I have a hotspare I can show up at the
office at my normal hours. By the time I come in, much of the data has
already been copied over to the hot-spare. I can then make a support
call during normal business hours for the replacement drive.

If you don't have at least one hot spare in your system, you need to
make sure you have 1-2 of the required drives on hand. Yeah, you could
rely upon your service contract's same-day service, but it's a lot
nicer to at least have one drive immediately on-hand. If you don't have
same-day service, you better have a pair of spare drives because you
might just need them both.

It is important to have off-site backups, though. Not just backups,
*off-site* backups. You don't want to explain what happened to the
data when there was a building fire, flooding problem, etc. There are
problems that strike that can take down your whole machine room.

You also need a disaster recovery plan that goes from a set of documents
detailing the process, backup media, and money from the insurance
coverage, and turns that back in to what you have in your machine room.
(And it needs to be doable by a replacement. -- Assume you've just been
promoted.)

Just my two cents,
Steven Black

--
Steven Black <blacks@indiana.edu> / KeyID: 8596FA8E
Fingerprint: 108C 089C EFA4 832C BF07 78C2 DE71 5433 8596 FA8E

On Mon, Jul 13, 2009 at 05:35:30PM -0400, David Ernst wrote:
> Well, I don't think I have anything very sophisticated to say, but I'm
> inclined to agree with you about the 3x RAID6. By my calculations,
> you'll get 18T that way vs. 20T in your other proposal. I don't know
> what you're storing, but this is a lot of disk space, so probably no
> one will mind that sacrifice. Meanwhile, the RAID 6 option does give
> a slight emphasis to reliability over performance, as you wanted. So,
> basically, I think I'm just saying "your reasoning makes sense to
> me".
>
> I hate to bring this up, but twice in my life I've been affected by
> the failure of entire RAID arrays... Both were high-quality hardware
> RAID setups, and people said of both failures "this is supposed to
> never happen." In short, I recommend some other kind of backup in
> addition to the RAID, because things happen, and if your organization
> is concerned enough with reliability to consider RAID 6, I wouldn't
> assume that something like this would never happen.
>
> David
>
>
> On Mon, Jul 13, 2009 at 02:45:54PM -0400, Josh Goodman wrote:
> >
> >Hi all,
> >
> >I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
> >and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
> >hardware RAID controller. The disk space will be used for NFS and a database server with a slight
> >emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
> >
> >Here are 2 initial configuration ideas:
> >
> >* RAID 50 (4x RAID 5 sets of 6 drives)
> >* RAID 60 (3x RAID 6 sets of 8 drives)
> >
> >I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
> >RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
> >route to go. I'm interested in hearing what others have to say especially if I've overlooked other
> >possibilities.
> >
> >I'm off to start simulating failures and benchmarking various configurations.
> >
> >Cheers,
> >Josh
> >
> >_______________________________________________
> >BLUG mailing list
> >BLUG@linuxfan.com
> >http://mailman.cs.indiana.edu/mailman/listinfo/blug
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug


_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

Well, I don't think I have anything very sophisticated to say, but I'm
inclined to agree with you about the 3x RAID6. By my calculations,
you'll get 18T that way vs. 20T in your other proposal. I don't know
what you're storing, but this is a lot of disk space, so probably no
one will mind that sacrifice. Meanwhile, the RAID 6 option does give
a slight emphasis to reliability over performance, as you wanted. So,
basically, I think I'm just saying "your reasoning makes sense to
me".

I hate to bring this up, but twice in my life I've been affected by
the failure of entire RAID arrays... Both were high-quality hardware
RAID setups, and people said of both failures "this is supposed to
never happen." In short, I recommend some other kind of backup in
addition to the RAID, because things happen, and if your organization
is concerned enough with reliability to consider RAID 6, I wouldn't
assume that something like this would never happen.

David


On Mon, Jul 13, 2009 at 02:45:54PM -0400, Josh Goodman wrote:
>
>Hi all,
>
>I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
>and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
>hardware RAID controller. The disk space will be used for NFS and a database server with a slight
>emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
>
>Here are 2 initial configuration ideas:
>
>* RAID 50 (4x RAID 5 sets of 6 drives)
>* RAID 60 (3x RAID 6 sets of 8 drives)
>
>I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
>RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
>route to go. I'm interested in hearing what others have to say especially if I've overlooked other
>possibilities.
>
>I'm off to start simulating failures and benchmarking various configurations.
>
>Cheers,
>Josh
>
>_______________________________________________
>BLUG mailing list
>BLUG@linuxfan.com
>http://mailman.cs.indiana.edu/mailman/listinfo/blug
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Large RAID config suggestions?

Wow, that's a lot of disks.

I have one major suggestion. Don't make one big filesystem. Don't
even make one big Volume Group. With that much space, I'd recommend
dividing it up somehow, otherwise if you need to recover, it can take a
day or more just to copy files over.

At last year's Red Hat Summit, Rik van Riel gave a presentation called
"Why Computers Are Getting Slower". He mostly talked from a low level
point of view since he's a kernel developer, which was great. One of
the things that he talked about is how we're getting to the point where
filesystem sizes are getting too large for even fast disks to handle a
recovery in a reasonable amount of time. And the algorithms need to be
better optimized. So he recommended breaking up your filesystems into
chunks. So on a server your /home partition might be /home1 /home2
/home3, etc. and on a home machine you probably should put mediafiles on
a seperate partition or maybe even break that up. Plus, using volume
management like LVM is a necessity.

On something like a mail server, with lots of little files, you may
have millions of files and copying them over takes a lot of time, even
on a SAN. I recently had to recover a filesystem with 6 million files
on it and it was going to take about 16 hours or more just to copy stuff
over to a SATA2 RAID-1 array running on a decent hardware raid
controller, even though it was only about 180GB of data. This was direct
disk to disk too, not over the network. What I had to do instead was
some disk image trickery to get the data moved to a new set of disks (I
had a RAID controller fail). If I had lost the array, I couldn't have
done it this way.

It was the first time I've had to do such a recovery in a while and
wasn't expecting it to take so long. Immediately afterwards I decided
that we need to breakup our filesystems into smaller chunks and also
find ways to reduce the amount of data affected if a RAID array is lost.

The short answer is that you can build all kinds of redundancy into
your setup, but can still end up with the filesystem failing or
something frying your filesystem that leads to major downtime.

Of course this mythical thing called ZFS that comes with the J4400 may
solve all these problems listed above.

What kind of database system are you going to be using?

Mark

On Mon, Jul 13, 2009 at 06:45:54PM GMT, Josh Goodman [jogoodman@gmail.com] said the following:
>
> Hi all,
>
> I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
> and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
> hardware RAID controller. The disk space will be used for NFS and a database server with a slight
> emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.
>
> Here are 2 initial configuration ideas:
>
> * RAID 50 (4x RAID 5 sets of 6 drives)
> * RAID 60 (3x RAID 6 sets of 8 drives)
>
> I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
> RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
> route to go. I'm interested in hearing what others have to say especially if I've overlooked other
> possibilities.
>
> I'm off to start simulating failures and benchmarking various configurations.
>
> Cheers,
> Josh
>
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>

--
Mark Krenz
Bloomington Linux Users Group
http://www.bloomingtonlinux.org/
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

[BLUG] Large RAID config suggestions?

Hi all,

I have a 24 x 1 TB RAID array (Sun J4400) that is calling out to be initialized and I'm going round
and round on possible configurations. The system attached to this RAID is a RHEL 5.3 box w/
hardware RAID controller. The disk space will be used for NFS and a database server with a slight
emphasis given to reliability over performance. We will be using LVM on top of the RAID as well.

Here are 2 initial configuration ideas:

* RAID 50 (4x RAID 5 sets of 6 drives)
* RAID 60 (3x RAID 6 sets of 8 drives)

I'm leaning towards the RAID 60 setup because I'm concerned about the time required to rebuild a
RAID 5 set with 6x 1 TB disks. Having the cushion of one more disk failure per set seems the better
route to go. I'm interested in hearing what others have to say especially if I've overlooked other
possibilities.

I'm off to start simulating failures and benchmarking various configurations.

Cheers,
Josh

_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

[BLUG] Fwd: Unix/MTA admin and Deliverability admin people in Indy/poss. to telecommute

The following post hit a mailing list my wife is on.

I stripped out some of the chatter more appropriate to that list, so
that it was more focused on the jobs.

It seemed potentially interesting to folks on the list.

Cheers,

--
Steven Black <blacks@indiana.edu> / KeyID: 8596FA8E
Fingerprint: 108C 089C EFA4 832C BF07 78C2 DE71 5433 8596 FA8E

> ---------- Forwarded message ----------
> Date: Sun, Jul 12, 2009 at 1:00 PM
> Subject: Unix/MTA admin and Deliverability admin people in Indy/poss. to telecommute
>
> If anyone's looking for a job right now, point them our way. I'm
> working for ExactTarget and we're hiring like mad. [...]
> We're growing like mad. We got something like
> $70M in venture capital a few months back, have signed some huge deals
> (including replacing Microsoft's entire internal mail system), have
> won some great awards, and are spreading out to SMS and voice.
>
> http://email.exacttarget.com/Company/Careers/OpenPositions.html
>
> [...] I know we're also hiring an
> Implementation Consultant (my least favorite left a couple weeks ago),
> more support folks, 60 programmers in the next year, and all types of
> crazy growth. Plus, if you hire on before our user conference, you'll
> get to see They Might Be Giants (or you could just go to the show
> they're doing that week in Indy).
>
> Linux Mail and System Administrator - We actually use FreeBSD
> http://tbe.taleo.net/NA1/ats/careers/requisition.jsp?org=EXACTTARGET&cws=1&rid=455
>
> Deliverability Administrator - Set up Sender Authentication for
> customers, assist with DNS set up. It's actually a fairly entry-level
> position. This is with my team.
> http://tbe.taleo.net/NA1/ats/careers/requisition.jsp?org=EXACTTARGET&cws=1&rid=495
>