The Microcode. Especially on SMP systems, the CPUS may need an upgrade.
Since the Pentium division disaster, Intel have their CPUs field upgradable! The
CPU can be bumped a few versions by a special instruction from the BIOS. These
upgrades usually come with your BIOS, so make sure you're running the latest
BIOS, especially if you have an SMP system. -- Jeffrey Friedl (Email withheld).
QUESTION
RAM timing problems? I fiddled with the bios settings more than
a month ago. I've compiled numerous kernels in the mean time and nothing went
wrong. It can't be the RAM timing. Right?
ANSWER
Wrong. Do you think that the RAM manufacturers have a machine
that makes 60ns RAMs and another one that makes 70ns RAMs? Of course not! They
make a bunch, and then test them. Some meet the specs for 60 ns, others don't.
Those might be 61 ns if the manufacturer would have to put a number to it. In
that case it is quite likely that it works in your computer when for example the
temperature is below 40 degrees centigrade (chips become slower when the temp
rises. That's why some supercomputers need so much cooling).
However "the coming of summer" or a long compile job may push the temperature
inside your computer over the "limit". -- Philippe Troin (ptroin@compass-da.com)
QUESTION
I got suckered into not buying ECC memory because it was
slightly cheaper. I feel like a fool. I should have bought the more expensive
ECC memory. Right?
ANSWER
Buying the more expensive ECC memory and motherboards protects
you against a certain type of errors: Those that occur randomly by passing alpha
particles.
Because most people can reproduce "signal 11" problems within half
an hour using "gcc" but cannot reproduce them by memory testing for hours in a
row, that proves to me that it is not simply a random alpha particle flipping a
bit. That would get noticed by the memory test too. This means that something
else is going on. I have the impression that most sig11 problems are caused by
timing errors on the CPU <-> cache <-> memory path. ECC on your main
memory doesn't help you in that case. When should you buy ECC? a) When you feel
you need it. b) When you have LOTS of RAM. (Why not a cut-off number? Because
the cut-off changes with time, just like "LOTS".) Some people feel very strong
about everybody using ECC memory. I refer them to reason "a)".
QUESTION
Memory problems? My BIOS tests my memory and tells me its ok. I
have this fancy DOS program that tells me my memory is OK. Can't be memory
right?
ANSWER
Wrong. The memory test in the BIOS is utterly useless. It may
even occasionally OK more memory than really is available, let alone test
whether it is good or not.
A friend of mine used to have a 640k PC (yeah,
this was a long time ago) which had a single 64kbit chip instead of a 256kbit
chip in the second 256k bank. This means that he effectively had 320k working
memory. Sometimes the BIOS would test 384k as "OK". Anyway, only certain
applications would fail. It was very hard to diagnose the actual
problem....
Most memory problems only occur under special circumstances.
Those circumstances are hardly ever known. gcc Seems to exercise them. Some
memory tests, especially BIOS memory tests, don't. I'm no longer working on
creating a floppy with a linux kernel and a good memory tester on it. Forget
about bugging me about it......
The reason is that a memory test causes the
CPU to execute just a few instructions, and the memory access patterns tend to
be very regular. Under these circumstances only a very small subset of the
memories breaks down. If you're studying Electrical Engineering and are
interested in memory testing, a masters thesis could be to figure out what's
going on. There are computer manufacturers that would want to sponsor such a
project with some hardware that clients claim to be unreliable, but doesn't fail
the production tests......
QUESTION
Does it only happen when I compile a kernel?
ANSWER
Nope. There is no way your hardware can know that you are
compiling a kernel. It just so happens that a kernel compile is very tough on
your hardware, so it just happens a lot when you are compiling a kernel.
Compiling other large packages like gcc or glibc also often trigger the sig11.
- People have seen "random" crashes for example while installing using the
slackware installation script.... -- dhn@pluto.njcc.com
- Others get "general protection errors" from the kernel (with the
crashdump). These are usually in /var/adm/messages. -- fox@graphics.cs.nyu.edu
- Some see bzip2crash with "signal 11" or with "internal assertion
failure (#1007)." Bzip2 is pretty well-tested, so if it crashes, it's likely
not a bug in bzip2. -- Julian Seward (jseward@acm.org)
QUESTION
Nothing crashes on NT, Windows 95, 98, Milennium or XP. It must
be something Linux specific.
ANSWER
First of all, Linux stresses your hardware more than all of the
above. Some OSes like the Microsoft ones named above crash in unpredictable ways
anyway. Nobody is going to call Microsoft and say "hey, my windows box crashed
today". If you do anyway, they will tell you that you, the user, made an error
(see the interview with Bill
Gates in a German magazine....) and that since it works now, you should shut
up.
Those OSes are also somewhat more "predictable" than Linux. This means
that Excel might always be loaded in the exact same memory area. Therefore when
the bit-error occurs, it is always excel that gets it. Excel will crash. Or
excel will crash another application. Anyway, it will seem to be a single
application that fails, and not related to memory.
What I am sure of is that
a cleanly installed Linux system should be able to compile the kernel without
any errors. Certainly no sig-11 ones. (** Exception: Red Hat 5.0 with a Cyrix
processor. See elsewhere. **)
Really Linux and gcc stress your hardware more
than other OSes. If you need a non-linux thingy that stresses your hardware to
the point of crashing, you can try winstone. -- Jonathan Bright
(bright@informix.com)
QUESTION
Is it always signal 11?
ANSWER
Nope. Other signals like four, six and seven also occur
occasionally. Signal 11 is most common though.
As long as memory is getting corrupted, anything can happen. I'd expect bad
binaries to occur much more often than they really do. Anyway, it seems that the
odds are heavily biased towards gcc getting a signal 11. Also seen:
- free_one_pmd: bad directory entry 00000008
- EXT2-fs warning (device 08:14): ext_2_free_blocks bit already cleared for
block 127916
- Internal error: bad swap device
- Trying to free nonexistent swap-page
- kfree of non-kmalloced memory ...
- scsi0: REQ before WAIT DISCONNECT IID
- Unable to handle kernel NULL pointer dereference at virtual address
c0000004
- put_page: page already exists 00000046
invalid operand: 0000
- Whee.. inode changed from under us. Tell Linus
- crc error -- System halted (During the uncompress of the Linux kernel)
- Segmentation fault
- "unable to resolve symbol"
- make [1]: *** [sub_dirs] Error 139
make: *** [linuxsubdirs] Error 1
- The X Window system can terminate with a "caught signal xx"
The
first few ones are cases where the kernel "suspects" a kernel-programming-error
that is actually caused by the bad memory. The last few point to application
programs that end up with the trouble.
-- S.G.de Marinis (trance@interseg.it)
-- Dirk Nachtmann
(nachtman@kogs.informatik.uni-hamburg.de)
QUESTION
What do I do?
ANSWER
Here are some things to try when you want to find out what is
wrong... note: Some of these will significantly slow your computer down. These
things are intended to get your computer to function properly and allow you to
narrow down what's wrong with it. With this information you can for example try
to get the faulty component replaced by your vendor.
The
hardest part is that most people will be able to do all of the above except
borrowing memory from someone else, and it doesn't make a difference. This makes
it likely that it really is the RAM. RAM used to be one of the priciest parts of
a PC, so you would rather not arrive at this conclusion, but, I'm sorry, I get
lots of reactions that in the end turn out to be the RAM. However don't despair
just yet: your RAM may not be completely wasted: you can always try to trade it
in for different or more RAM.
QUESTION
I had my RAMs tested in a RAM-tester device, and they are OK.
Can't be the RAM right?
ANSWER
Wrong. It seems that the errors that are currently occurring in
RAMS are not detectable by RAM-testers. It might be that your motherboard is
accessing the RAMs in dubious ways or otherwise messing up the RAM while it is
in YOUR computer. The advantage is that you can sell your RAM to someone who
still has confidence in his RAM-tester......
QUESTION
What other hardware could be the problem?
ANSWER
Well, any hardware problem inside your computer. But things that
are easy to check should be checked first. So, for example, all your cards
should be correctly inserted into the mother board.
QUESTION
Why is the Red Hat install bombing on me?
ANSWER
The Red Hat 5.x, 6.x and 7.x install has problems on some
machines. Try running the install with only 32M. This can usually be done with
mem=32m as a boot parameter.
It could be that there is a read-error on the CD. The installer handles this
less-than-perfect..... Make sure that your CD is flawless! It seems that the
installer will bomb on marginal CDs!
People report, and I've seen with my own eyes, that Red Hat installs can go
wrong (crash with signal 7 or signal 11) on machines that are perfectly in
order. My machine was and still is 100% reliable (actually the machine I tested
this on, is by now reliably dead). People are getting into trouble by wiping the
old "working just fine" distribution, and then wanting to install a more recent
Red Hat distribution. Going back is then no longer an option, because going back
to 5.x also results in the same "crashes while installing".
Patrick Haley (haleyp@austin.rr.com) reports that he tried all memory
configurations up to 96Mb (32 & 64) and found that only when he had 96Mb
installed, the install would work. This is also consistent with my own
experience (of Red Hat installs failing): I tried the install on a 32M machine.
NEW: It seems that this may be due to a kernel problem. The kernel may
(temporarliy) run low on memory and kill the current process. The fix by Hubert
Mantel (mantel@suse.de) is at: http://juanjox.linuxhq.com/patch/20-p0459.html.
If this is actually the case, try switching to the second virtual console
(ctrl-alt-F2) and type "sync" there every few seconds. This reduces the amount
of memory taken by harddisk-buffers... I would really appreciate hearing from
you if you've seen the Red Hat install crash two or more times in a row, and
then were able to finish the install using this trick!!!
What do you do to get around this problem?...
- Use SuSE. It's better: It doesn't crash during the installation.
(Moreover, it actually is better. ;-)
- Maybe you're running into a bad-block on your CD. This can be
drive-dependent. If that's the case, try making a copy of the CD in another
drive. Try borrowing someone elses copy of Red Hat.
- Try configuring a GIGABYTE of swap. I have two independent reports that
report that they got through with a gig of swap. Please report to me if it
helps!
- Modify the "settings" for the harddisk. Changing the setting from "LBA" to
"NORMAL" in the bios has helped for at least one person. If you try this, I'd
really appreciate it if you'd EMail
me: I would like to hear from you if it helps or not. (and what you
exactly changed to get it to work)
- I got my machine to install by installing a minimal base system,
and then adding packages to the installed system.
- Someone suggested that the machine might be out-of-memory when this
happens. Try having a swap partition ready. Also, the install may be
"prepared" to handle low mem situations, but misjudging the situation. For
example, it may load a RAMDISK, leaving just 1M of free RAM, and then trying
to load a 2M application. So if you have 16M of RAM, booting with mem=14M may
actually help, as the "load RAMDISK" stage would then fail and the install
would then know to run off the CD instead of off the RAMDISK. (installs used
to work for >8M machines. Is that still true?)
- Try, in one session to clear the disk of all the partitions that are going
to be used by Linux. Reboot. Then try the install. Either by partitioning
manually, or by letting the install program figure it out. (I take it that Red
Hat has that possibility too, SuSE has it...) If this works for you, I'd
appreciate it if you'd tell me.
- A corrupted download can also cause this. Duh.
- Someone reports that installs on 8Mb machines no longer work, and that the
install ungracefully exits with a sig7. -- Chris Rocco (crocco@earthlink.net)
- One person reports that disabling "BIOS shadow" (system & VIDEO),
helped for him. As Linux doesn't use the BIOS, shadowing it doesn't help. Some
computers may even give you 384k of extra RAM if you disable the shadowing.
Just disable it, and see what happens. -- Philippe d'Offay
(pdoffay@pmdsoft.com).
QUESTION
What are other possibilities?
ANSWER
Others have noted the following possibilities:
- The compiler and libc included in Red Hat 5.0 have an odd interaction with
the Cyrix processor. It crashes the compiler, This is VERY odd. I would think
that the only way that this can be the case is when the Cyrix has a bug that
has gone undetected all this time, and reliably gets triggered when THAT gcc
compiles the Linux kernel. Anyway, if you just want compile a kernel, you
should get a new compiler and/or libc from the Red Hat website. (start at the
homepage, and click errata).
- Compiling a 2.0.x kernel with a 2.8.x gcc or any egcs doesn't work. There
are a few bugs in the kernel that don't show up because gcc 2.7.x does a lousy
job optimizing it. gcc 2.8.x and egcs just dump some of the code because we
didn't tell it not to. Anyway, you usually get a kernel that seems to work but
has funny bugs. For example X may crash with a signal 11. Oh, and before you
ask, no it's not going to be fixed. Don't bother Alan or Linus about this OK?
-- Hans Peter Verne (h.p.verne@kjemi.uio.no)
- The pentium-optimizing-gcc (the one with the version number ending in "p")
fails with the default options on certain source files like floppy.c in the
kernel. The "triggers" are in the kernel, libc and in gcc itself. This is
easily diagnosed as "not a hardware problem" because it always happens in the
same place. You can either disable some optimizations (try -fno-unroll-loops
first) or use another gcc. -- Evan Cheng (evan@top.cis.syr.edu) (In other
words: gcc 2.7.2p crashes with sig11 on floppy.c . Workaround-1: Use plain
gcc. Workaround-2: Manually compile floppy.c with "-O" instead of "-O2". )
- A bad connection between a disk and the system. For example IDE cables are
only allowed to be 40cm (16") long. Many systems come with longer cables. Also
a removable IDE rack may add enough trouble to crash a system.
- A badly misconfigured gcc -- some parts from one version, some from
another. After a few weeks I ended up re-installing from scratch to get
everything right. -- Richard H. Derr III (rhd@Mars.mcs.com).
- Gcc or the resulting application may terminate with sig11 when a program
is linked against the SCO libraries (which come with iBCS). This occurs on
some applications that have -L/lib in their LDFLAGS....
- When compiling a kernel with an ELF compiler, but configured for a.out (or
the other way around, I forgot) you will get a signal 11 on the first call to
"ld". This is easily identified as a software problem, as it always occurs on
the FIRST call to "ld" during the build. -- REW
- An Ethernet card together with a badly configured PCI BIOS. If your (ISA)
Ethernet card has an aperture on the ISA bus, you might need to configure it
somewhere in the BIOS setup screens. Otherwise the hardware would look on the
PCI bus for the shared memory area. As the ISA card can't react to the
requests on the PCI bus, you are reading empty "air". This can result in
segmentation faults and kernel crashes. -- REW
- Corrupted swap partition. Tony Nugent (T.Nugent@sct.gu.edu.au) reports he
used to have this problem and solved it by an mkswap on his swap partition.
(Don't forget to type "sync" before doing anything else after an mkswap. --
Louis J. LaBash Jr. (lou@minuet.siue.edu))
- NE2000 card. Some cheap Ne2000 cards might mess up the system. -- Danny
ter Haar (dth@cistron.nl) I personally might have had similar problems, as my
mail server crashed hard every now and then (once a day). It now seems that
1.2.13 and lots of the 1.3.x kernels have this bug. I haven't seen it in
1.3.48. Probably got fixed somewhere in the meantime.... -- REW
- Power supply? No I don't think so. A modern heavy system with two or three
harddisk, both SCSI and IDE will not exceed 120 Watts or so. If you have loads
of old harddisks and old expansion cards the power requirements will be
higher, but still it is very hard to reach the limits of the power supply. Of
course some people manage to find loads of old full-size harddisks and install
them into their big-tower. You can indeed overload a powersupply that way. --
Greg Nicholson (greg@job.cba.ua.edu) A faulty power supply CAN of course
deliver marginal power, which causes all of the malfunctioning that you read
about in this file.... -- Thorsten Kuehnemann (thorsten@actis.de)
- An inconsistent ext2fs. Some circumstances can cause the kernel code of
the ext2 file system to result in Signal 11 for Gcc. -- Morten Welinder
(terra@diku.dk)
- CMOS battery. Even if you set the BIOS as you want it, it could be
changing back to "bad" settings under your nose if the CMOS battery is bad. --
Heonmin Lim (coco@me.umn.edu)
- No or too little swap space. Gcc doesn't gracefully handle the "out of
memory" condition. -- Paul Brannan (brannanp@musc.edu)
- Incompatible libraries. When you have a symlink from "libc.so.5" pointing
to "libc.so.6", some applications will bomb with sig11. -- Piete Brooks
(piete.brooks@cl.cam.ac.uk).
- Broken mouse. Somehow, a mouse seems to be able to break in a way that it
causes some (mouse related) programs to crash with Sig11. I've seen it happen
on an X server that would crash if you moved the mouse quickly. Matthew might
not even have been moving his mouse. -- REW & Matthew Duggan
(stauff@guarana.org).
- Badly seated RAM. Make sure your RAM is correctly seated into the
socket.... -- Carroll Kong (me@carrollkong.com).
QUESTION
I found that running ..... detects errors much quicker than
just compiling kernels. Please mention this on your site.
ANSWER
Many people email me with notes like this. However, what many
don't realize is that they encountered ONE case of problematic hardware. The
person recommending "unzip -t" happened to have a certain broken DRAM stick. And
unzip happened to "find" that much quicker than a kernel compile.
However, I'm sure that for many other problems, the kernel compile WOULD find
it, while other tests don't. I think that the kernel compile is good because it
stresses lots of different parts of the computer. Many other tests just exercise
just one area. If that area happens to be broken in your case, it will show a
problem much quicker than "kernel compile" will. But if your computer is OK on
that area and broken in another, the "faster" test may just tell you your
computer is OK, while the kernel compile test would have told you something was
wrong.
In any case, I might just as well list what people think are good tests,
which they are, but not as general as the "try and compile a kernel" test....
- Run unzip while compiling kernels. Use a zipfile about as large as RAM.
- use "memtest86" found at: http://www.memtest86.com/.
- do dd if=/dev/hda of=/dev/null while compiling kernels.
- run md5sum on large trees.
Note that whatever fast method you may
find to tell you that your computer is broken, it won't guarantee your computer
is fine if such a test suddenly doesn't fail anymore. I always recommend that
after fiddling with things to make it work, you should run a 24-hour
kernel-compile test.
QUESTION
Why isn't "memtest86" the first to try if I suspect memory
problems?
ANSWER
Feel free to do so. Some of this is black magic. However, when
"memtest86" tells you that your RAM is ok, you might be tempted to believe it.
It's telling you that it couldn't find any problems. It's not telling you
that your RAM is flawless.
In my experience, RAM related problems are sometimes not found using a memory
tester. The patterns are all nice and regular. Some problematic RAM simply works
well under that kind of stress, but fails under the more erratic stress patterns
caused by "gcc" or "zip".
So, I still recommend that you try verifying your system using kernel
compiles, and not trusting a memory tester....
QUESTION
I don't believe this. To whom has this happened?
ANSWER
Well for one it happened to me personally. But you don't have to
believe me. It also happened to:
- Johnny Stephens (icjps@asuvm.inre.asu.edu)
- Dejan Ilic (d92dejil@und.ida.liu.se)
- Rick Tessner (rick@myra.com)
- David Fox (fox@graphics.cs.nyu.edu)
- Darren White (dwhite@baker.cnw.com) (L2 cache)
- Patrick J. Volkerding (volkerdi@mhd1.moorhead.msus.edu)
- Jeff Coy Jr. (jcoy@gray.cscwc.pima.edu) (Temp problems)
- Michael Blandford (mikey@azalea.lanl.gov) (Temp problems: CPU fan failed)
- Alex Butcher (Alex.Butcher@bristol.ac.uk) (Memory waitstates)
- Richard Postgate (postgate@cafe.net) (VLB loading)
- Bert Meijs (L.Meijs@et.tudelft.nl) (bad SIMMs)
- J. Van Stonecypher (scypher@cs.fsu.edu)
- Mark Kettner (kettner@cat.et.tudelft.nl) (bad SIMMs)
- Naresh Sharma (n.sharma@is.twi.tudelft.nl) (30->72 converter)
- Rick Lim (ricklim@freenet.vancouver.bc.ca) (Bad cache)
- Scott Brumbaugh (scottb@borris.beachnet.com)
- Paul Gortmaker (paul.gortmaker@anu.edu.au)
- Mike Tayter (tayter@ncats.newaygo.mi.us) (Something with the cache)
- Benni ??? (benni@informatik.uni-frankfurt.de) (VLB Overloading)
- Oliver Schoett (os@sdm.de) (Cache jumper)
- Morten Welinder (terra@diku.dk)
- Warwick Harvey (warwick@cs.mu.oz.au) (bit error in cache)
- Hank Barta (hank@pswin.chi.il.us)
- Jeffrey J. Radice (jjr@zilker.net) (Ram voltage)
- Samuel Ramac (sramac@vnet.ibm.com) (CPU tops out)
- Andrew Eskilsson (mpt95aes@pt.hk-r.se) (DRAM speed)
- W. Paul Mills (wpmills@midusa.net) (CPU fan disconnected from CPU)
- Joseph Barone (barone@mntr02.psf.ge.com) (Bad cache)
- Philippe Troin (ptroin@compass-da.com) (delayed RAM timing trouble)
- Koen D'Hondt (koen@dutlhs1.lr.tudelft.nl) (more kernel error messages)
- Bill Faust (faust@pobox.com) (cache problem)
- Tim Middlekoop (mtim@lab.housing.fsu.edu) (CPU temp: fan installed)
- Andrew R. Cook (andy@anchtk.chm.anl.gov) (bad cache)
- Allan Wind (wind@imada.ou.dk) (P66 overheating)
- Michael Tuschik (mt2@irz.inf.tu-dresden.de) (gcc2.7.2p victim)
- R.C.H. Li (chli@en.polyu.edu.hk) (Overclocking: ok for months...)
- Florin (florin@monet.telebyte.nl) (Overclocked CPU by vendor)
- Dale J March (dmarch@pcocd2.intel.com) (CPU overheating on laptop)
- Markus Schulte (markus@dom.de) (Bad RAM)
- Mark Davis (mark_d_davis@usa.pipeline.com) (Bad P120?)
- Josep Lladonosa i Capell (jllado@arrakis.es) (PCI options
overoptimization)
- Emilio Federici (mc9995@mclink.it) (P120 overheating)
- Conor McCarthy (conormc@cclana.ucd.ie) (Bad SIMM)
- Matthias Petofalvi (mpetofal@ulb.ac.be) ("Simmverter" problem)
- Jonathan Christopher Mckinney (jono@tamu.edu) (gcc2.7.2p victim)
- Greg Nicholson (greg@job.cba.ua.edu) (many old disks)
- Ismo Peltonen (iap@bigbang.hut.fi) (irq_unmasking)
- Daniel Pancamo (pancamo@infocom.net) (70ns instead of 60 ns RAM)
- David Halls (david.halls@cl.cam.ac.uk)
- Mark Zusman (marklz@pointer.israel.net) (Bad motherboard)
- Elizabeth Ayer (eca23@cam.ac.uk) (Power management features)
- Thorsten Kuehnemann (thorsten@actis.de)
-
- (Email me with your story, you might get to be mentioned here... :-) ----
Update: I like to hear what happened to you. This will allow me to guess what
happens most, and keep this file as accurate as possible. However I now have
around 500 different Email addresses of people who've had sig-11 problems. I
don't think that it is useful to keep on adding "random" people's names on
this list. What do YOU think?
I'm interested in new stories. If you have a problem and are unsure about what
it is, it may help to mailto:R.E.Wolff@BitWizard.nl. My
curiosity will usually drive me to answering your questions until you find what
the problem is..... (on the other hand, I do get pissed when your problem is
clearly described above :-)
This page is hosted by http://www.bitwizard.nl/
This text is (C) 1994-2003 R.E.Wolff@BitWizard.nl. You are permitted to copy and
distribute this text, provided that you leave the attributions, this copyright
message and the reference to the original (and up-to-date) location intact. If
you're in the kingdom of the Netherlands, you're permitted to print or copy this
document, provided you pay the standard fee of EUR 0.045 per page through the
standard channels.