Tuesday, December 22, 2009

Re: [BLUG] Fwd: possible causes of segfaults

Your ld's got segment fault inside the shared library libc-2.9.so which,
in generally, is a relatively bug-free piece of software as it's so
important that essentially every single piece of program will call it.
It should have been loaded and resides in the memory and just that 7
invocation of ld's (since the PIDs are different) segfault'ed at
different points in libc makes me wonder if somehow the memory is at
fault. So it might not be a bad idea to run a memory test program (such
as memtest86+) to check your memory. Have a good holiday! --Shing-Shong

> Hiya,
>
> I just installed a new file / computation server at my work, and a few
> days into its tenure, I've started noticing some segfaults.
>
> For instance, I was running a bunch of image processing jobs, and
> ImageMagick's "convert" program segfaulted. I ran it again on the
> same data, and it did fine. So I'm wondering if I have bad hardware,
> or bad libraries, or what?
>
> Possible causes of segmentation faults that I know of:
> Hardware --- could be very random and difficult to find, might need to
> totally shut down the server and run a memory tester for days in order
> to find.
>
> Filesystem corruption --- should be reproducible, right? if "convert"
> segfaults once, it should do it again...
>
> Libraries / os problems --- should be reproducible too, right?
>
>
>
> from dmesg:
> [17217.872070] ld[16027]: segfault at 0 ip 00002ae3b445411b sp
> 00007fffc39a74e8 error 4 in libc-2.9.so[2ae3b43d0000+168000]
> [17354.753195] ld[20115]: segfault at 0 ip 00002b832843d11b sp
> 00007fff729f2fa8 error 4 in libc-2.9.so[2b83283b9000+168000]
> [19463.265457] ld[3673]: segfault at 0 ip 00002ad2f7a3a11b sp
> 00007fffd11023a8 error 4 in libc-2.9.so[2ad2f79b6000+168000]
> [19474.653491] ld[3680]: segfault at 0 ip 00002b7f10f0e11b sp
> 00007fffbac8a978 error 4 in libc-2.9.so[2b7f10e8a000+168000]
> [19507.935271] ld[3687]: segfault at 0 ip 00002af5c9eb511b sp
> 00007fff12fe00d8 error 4 in libc-2.9.so[2af5c9e31000+168000]
> [19528.740436] ld[3701]: segfault at 0 ip 00002b265616a11b sp
> 00007fff2ceb8d98 error 4 in libc-2.9.so[2b26560e6000+168000]
> [19606.865585] ld[3754]: segfault at 0 ip 00002ae3a079811b sp
> 00007fff6bc4d3c8 error 4 in libc-2.9.so[2ae3a0714000+168000]
> [263529.064795] convert[24941]: segfault at 7fffd973b6b8 ip
> 00007fffdb1e2ed9 sp 00007fffd973b660 error 7 in
> libMagickCore.so.1.0.0[7fffdb0fe000+1b5000]
> [268495.776398] convert[28595]: segfault at 7fffc7e2b608 ip
> 00007fffc9e13ed9 sp 00007fffc7e2b5b0 error 7 in
> libMagickCore.so.1.0.0[7fffc9d2f000+1b5000]
>
>
>
> Any advice?
> Thanks,
> -Thomas
>
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>
_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

Re: [BLUG] Fwd: possible causes of segfaults

Hiya,

Well, I'm using Ubuntu, and generally upgrades go fine without
rebooting... It's a nice server machine with ECC memory. So yeah, it
should really be at least telling me if I'm getting memory corruption,
if not fixing it outright.

Hmm, once, I had a computer that, on one version of Linux, was fine,
and on the next version (2.6.15->2.6.16 or something) would flip one
bit every 600MB of i/o. It slowly corrupted the entire filesystem and
... yuck. So I guess right now I'm suspecting Linux, or something
subtle in the hardware.

Thanks for your advice! It's good to narrow it down a bit. I'll
probably come in sometime over break (my boss will probably work the
whole week, but nobody else is here, and he can deal) and run that
memory test, as you say.

Bleh,
-Thomas

On Mon, Dec 21, 2009 at 4:33 PM, Steven Black <blacks@indiana.edu> wrote:
> Did you upgrade the system without rebooting it? (That should be
> reproducible, though.) Are you using a version of ImageMagick compiled
> for a different distribution of Linux? (I know a lot of RPM-based
> systems do not bundle a lot of programs.)
>
> If you downloaded an RPM that wasn't compiled specifically for your
> distribution/version, I would expect that to be the cause of the
> problem. If that's the cause of the problem, building it from source
> should clear it up.
>
> If you're using the version of ImageMagick that comes with your
> distribution and you've not recently performed an upgrade, memory
> corruption seems the most likely candidate. Does that machine have
> unparitied memory? Any type of memory other than unparitied would likely
> show an error instead of just producing bogus data. (It is why I hate
> unparitied memory.)
>
> It could also be a CPU fault.
>
> Problems with hard drives tend to show up as errors with the specific
> media. (It'll list the device producing the error.) Hardware problems
> regarding media do not normally produce segfaults, unless the
> application fails to handle the error case.
>
> My recommendation: Pick an upcoming weekend and tell them the services
> of this machine will be unavailable. Then start the memory test at
> 5:15pm (adjusted for the end of your workday) and run a memory checker
> all weekend. Then come in 15 minutes early on Monday to check for errors
> and reboot the system.
>
> Cheers,
> Steven Black
>
>
>
> On Mon, Dec 21, 2009 at 03:58:40PM -0500, Thomas Smith wrote:
>> Hiya,
>>
>> I just installed a new file / computation server at my work, and a few
>> days into its tenure, I've started noticing some segfaults.
>>
>> For instance, I was running a bunch of image processing jobs, and
>> ImageMagick's "convert" program segfaulted.  I ran it again on the
>> same data, and it did fine.  So I'm wondering if I have bad hardware,
>> or bad libraries, or what?
>>
>> Possible causes of segmentation faults that I know of:
>> Hardware --- could be very random and difficult to find, might need to
>> totally shut down the server and run a memory tester for days in order
>> to find.
>>
>> Filesystem corruption --- should be reproducible, right?  if "convert"
>> segfaults once, it should do it again...
>>
>> Libraries / os problems --- should be reproducible too, right?
>>
>>
>>
>> from dmesg:
>> [17217.872070] ld[16027]: segfault at 0 ip 00002ae3b445411b sp
>> 00007fffc39a74e8 error 4 in libc-2.9.so[2ae3b43d0000+168000]
>> [17354.753195] ld[20115]: segfault at 0 ip 00002b832843d11b sp
>> 00007fff729f2fa8 error 4 in libc-2.9.so[2b83283b9000+168000]
>> [19463.265457] ld[3673]: segfault at 0 ip 00002ad2f7a3a11b sp
>> 00007fffd11023a8 error 4 in libc-2.9.so[2ad2f79b6000+168000]
>> [19474.653491] ld[3680]: segfault at 0 ip 00002b7f10f0e11b sp
>> 00007fffbac8a978 error 4 in libc-2.9.so[2b7f10e8a000+168000]
>> [19507.935271] ld[3687]: segfault at 0 ip 00002af5c9eb511b sp
>> 00007fff12fe00d8 error 4 in libc-2.9.so[2af5c9e31000+168000]
>> [19528.740436] ld[3701]: segfault at 0 ip 00002b265616a11b sp
>> 00007fff2ceb8d98 error 4 in libc-2.9.so[2b26560e6000+168000]
>> [19606.865585] ld[3754]: segfault at 0 ip 00002ae3a079811b sp
>> 00007fff6bc4d3c8 error 4 in libc-2.9.so[2ae3a0714000+168000]
>> [263529.064795] convert[24941]: segfault at 7fffd973b6b8 ip
>> 00007fffdb1e2ed9 sp 00007fffd973b660 error 7 in
>> libMagickCore.so.1.0.0[7fffdb0fe000+1b5000]
>> [268495.776398] convert[28595]: segfault at 7fffc7e2b608 ip
>> 00007fffc9e13ed9 sp 00007fffc7e2b5b0 error 7 in
>> libMagickCore.so.1.0.0[7fffc9d2f000+1b5000]
>>
>>
>>
>> Any advice?
>> Thanks,
>> -Thomas
>>
>> _______________________________________________
>> BLUG mailing list
>> BLUG@linuxfan.com
>> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug
>

--
http://resc.smugmug.com/

_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug