- Unexplained Hang During Boot
- Posted by eon_blue_80@verizon.net on June 28th, 2006
I am experiencing a very bizarre problem with vxWorks and I am hoping
that someone might be able to offer some suggestions on where to start
looking to determine the root of the problem.
VxWorks is being used on a Synergy Microsystems VME SBC which is PPC
based. The problem seems to arise at random times after rebuilding the
OS image. For instance, by commenting out a single 'printf' statement
such as "printf("Message Received\n"); in an application level piece of
code that is not even invoked; and rebuilding the image, the image can
hang while booting (early in the boot procedure). Uncomment this
'printf' statement, rebuild the image, and the OS will boot without
error. Note that this routine is not called at any time during the boot
procedure so the code containing that printf is never even executed.
This problem has been experienced by multiple developers on different
modules. I am not sure if this is a hardware, or a software type of
problem. Can anyone think of any reason why something as non-intrusive
as commenting out a printf statement, in a function that is never even
invoked, would cause the OS to hang during boot?
The printf statement is only adding a handful of bytes to the resultant
image and larger images than the ones that fail have been booted
successfully.
Similar hangs have been produced by changing array sizes in uncalled
routines, etc., (i.e., add a few more bytes to an array in an uncalled
function and the images hangs during boot, add a few more bytes and the
image loads fine).
- Posted by Bill Pringlemeir on June 28th, 2006
On 28 Jun 2006, eon_blue_80@verizon.net wrote:
[snip]
This sounds like a cache problem. The "printf" is unrelated to the
code. It just changes the image size at the "right" place. You could
add a ".bytes 7" or something in the code section and the same thing
would result.
At some point in the boot sequence, there may be an alias between data
and code cache. It could be when the MMU is turned on. The address
space will change and code must often jump in a very specific
sequence. It maybe a conflict with a device. For instance an "eieio"
instruction may be necessary in some cases, but due to code section
alignment, the code is executing in different times and the "eieio"
become necessary/un-necessary depending on the build.
It is very good that you try to hunt this down. I've known several
"senior" people who have let this type of problem go on for ever.
You can toggle an LED, an general purpose I/O with scope or you can
use some polled console output to provide check points in the boot
sequence to see where the hang occurs.
The important point is that the "printf" has nothing to do with the
problem besides making the code move around. You can verify this by
inserting different dummy routines with different lengths (a cache
line is typically 32/64 bytes). Observing a map file of the full
image and knowing the location of these bytes can be helpful. For
instance if code following this is an ethernet driver, then that may
be helpful to know.
It could also be reading of garbage strings, code, constant data. I
have also seen one section of code round MMU rights and another read
to the byte. Sometimes this rounding is wrong and a "bus error"
happens due to memory not being sized right.
hth,
Bill Pringlemeir.
--
You have the right to remain silent -- so shut up!
vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
- Posted by MetalHead on June 29th, 2006
eon_blue_80@verizon.net wrote:
Another possibility is that errant code is corrupting memory during the
boot process. The commonest case is the "wild pointer" where an
uninitialized pointer is used to write data. Other possibilites would be
over-running the stack reserved area or using pointers to buffers that
have been returned to the buffer pool and re-used. I have also seen
incorrect function prototypes cause this type of problem. If you are
using vector tables in RAM, walking on them will cause this type of
problem too.
The way I would attempt to solve this problem is with a logic analyzer.
Start out by finding where the code hangs. Then see if the instruction
sequence to get there took any un-explainable jumps. See if the
departure point for the unexplainable sequence values match the expected
values for that address. If they don't match the expected values, use
writes to those locations to trigger the logic analyzer and you should
be able to locate the errant code. The departure from expected execution
could also be un-initialized or corrupted vectors in the vector table.
I am not familiar with the particular VME card you mentioned, but memory
management hardware could protect you from a number of the things I
described. Because it is a boot sequence problem, memory management
hardware may not be operational at this point.
Another place to look would be the linker command file. Are all of the
segements large enough and in non-overlapping regions of memory? The
logic analyzer approach would leady you to this type of problem, but it
could be a painful path that could be avoided by careful study.
Good Luck,
Bob
- Posted by ssubbarayan on June 29th, 2006
Bill Pringlemeir wrote:
Bill\Others,
Excellent and enlightening explaination from you all.We are facing a
similar kind of issue with STMicroElectronics prop board and we were
using prop OS.Though the OS is different,the problem seems to be
similar to query we are addressing here I believe.
We faced a situation where if we just type printf inside one function
or just introduce one
i=1(Though we did not use 'i' variable further anywhere) will make the
feature to work and removing this statement made us to loose the
feature.
We were trying hard to figure the problem until one day when we
inspected the cache and disabled the data cache the feature was working
just fine.
Now the question I would like to understand it,whats the best way to
figure out whether the problem is with cache memory?One more behaviour
I have observed is when we debug with break point the feature was
working fine and when we use binary production version of same code it
never works!
This made debugging further difficult.Will the role of cache have
something to do to bring this difference between debug and production
version?
I would like to avoid such problems in future so it will be helpful if
some of you enlightened ones explain me this.
I am posting the query also to comp.arch.embedded as this will help me
to get lot of experienced people's inputs.Pardon me incase I am wrong.
Looking farward for all your replys and advanced thanks for the same,
Regards,
s.subbarayan
- Posted by Bo on June 29th, 2006
"ssubbarayan" <ssubba@gmail.com> wrote in message
news:1151564578.472571.35220@b68g2000cwa.googlegro ups.com...
is the solution? Sprinkling cache flushes throughout the code? or what?
Bo
- Posted by Bill Pringlemeir on June 29th, 2006
On 29 Jun 2006, bo@cephus.com wrote:
There are three possible issues. One is a direct effect of caching,
another is alignment, and the other is timing.
If you have DMA, it will always retrieve from memory (Ie, SDRAM,
flash). If your CPU is using a cache, it might be retrieving data
from the memory or from the cache. For example, on one project we had
a video capture device with a built-in convulsion matrix that DMAed
the results to the main processor. The code did not pay attention to
the cache. After much debugging, the software developer for the
imaging code decided that the HA was buggy. I examined this and noted
that the buffer being used was fully cached. It started to work when
we got memory that the MMU had marked as being non-cacheable.
Another example is on the PPC, there is a "write buffer". It can be
the result that the PPC will not commit data to memory in the order
that instructions are encountered; especially with a write-back cache.
So, for instance, an AMD style NOR flash takes the command AA, 55,
CMD. Without using the eieio command on the PowerPC, your flash
driver will not work as the commands can get written to the bus out of
order. An MTD driver might loop forever trying to detect the flash
type or an end of operation, etc. This might cause a hang during
booting. Many HW devices use multiple writes to the same location.
Those are some examples of direct changes the cache might have on the
order of memory accesses.
I had previously explained an alignment issue. It sound like this is
more like the OPs problem as the code in question doesn't even
execute. However, it can also be the timing as this will shift code
and might change how the cache lines are fetched. If the compiler is
aligning all code to a cache line, then this is not the problem.
The other instance is just timing. If code is relying on things being
slow, then a cache is enabled and speeds them up, a implicit delay may
no longer be sufficient. Some slower/older HW devices must have fixed
delays between accesses. It may also be that the code must be fast
enough, like kicking a watchdog.
In all cases, the best thing you can do is insert some sort of trace.
Like toggling a general purpose I/O connected to a scope. You can
alter the timing to provide information or use multiple lines to
encode some information. Multiple lines are better as they will
reduce the amount of code needed. This mechanism suffers in that
inserting the debug code can make the symptom appear/disappear.
An ICE, BDM, or JTAG debugger would also be useful. Let the system
crash and then look at the stack and PC. Use HW breakpoints to work
backwards from there.
The problem with using a traditional debugger with breakpoints is that
this alters the code flow (just like a printf). Hitting breakpoints
will definitely effect what is in the cache.
Once you find the problem, you have to look at the structure to know
what to do. For instance, it is often best to change the way a
hardware device is accessed. Like non-cacheable, write-through cache,
etc. Sometimes it is not just the cache, but eieio instructions might
be needed (or other PPC instructions like isync, sync, etc). Adding
cache flushes may work. It would be much better to understand why it
is crashing and then correct the problem. Just adding cache flushes
might be equivalent to the printfs. Ie, it just shifts the code
around.
fwiw,
Bill Pringlemeir.
--
My cousin is an agoraphobic homosexual, which makes it kind of hard
for him to come out of the closet. - Bill Kelly
vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
- Posted by Bill Pringlemeir on June 29th, 2006
This is *unlikely* as the OP noted that adding un-executed code would
cause the problem. If the code is directly corrupting memory this
would be unlikely to introduce the problem. Especially if the added
code make no types of allocation, nor writes to memory. If simply
changing the cache on/off will cause the crash, I find it extremely
unlikely that it is a memory corruption.
So there is a quick way to rule this out. Disable/enable the cache
with a crashing image. Often you can arrange the code so that the
size is the same, just a constant has changed to disable/enable the
cache.
fwiw,
Bill Pringlemeir.
--
Anyone who trades liberty for security deserves neither liberty nor
security - Benjamin Franklin
vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
- Posted by mrfirmware on June 29th, 2006
Bill Pringlemeir wrote:
Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series,
use a weakly ordered view of memory so you have the potential for out
of order reads but never writes. The write buffer does not affect write
order, it just allows read-around-write.
Our AMCC PPC405 (strongly ordered view of memory) Flash driver started
failing when we used it on the 440. After, reading up on weakly ordered
memory systems and ensuring that the Flash region was marked cached,
guarded we proceeded to put msync brackets round all reads to I/O
devices (real memory or I/O devices without read-side effects don't
need them) with read-side effects, e.g.
# uint16 read_flash_hword(uin16 *pFlashAddr);
read_flash_hword:
msync
lhz r3,0(r3)
msync
blr
The msync brackets ensured that the read could not issue before any
subsequent read or write. However, you are guaranteed to have multiple
writes go in order to the device safely as long as your reads are
protected as above. Furthermore, the PowerPC architecture is smart
enough to execute RMW ops. correctly on a given I/O address, e.g.
lwz r3,0(r4)
ori r3,r3,0x0040
stw r3,0(r4)
will result in the expected value written to the address pointed to by
r4, that is, the CPU will not perform the store before the load due to
register dependencies.
--
- Mark
- Posted by Bill Pringlemeir on June 29th, 2006
On 29 Jun 2006, mrfirmware@gmail.com wrote:
I am absolutely sure of nothing. I might have the wrong terminology
for the cache type. If multiple writes to the same location fit in
the cache, only that last value will be written to the memory device.
This makes perfect sense for SDRAM and is a very good operation.
Consider a frame pointer with some loop variables stored in one of
these lines. Constantly committing the data from cache to SDRAM would
seem to be a waste of time.
With AMD type flash, there are several writes to the same address. I
didn't have access to a logic analyzer to see what cycles the CPU was
performing on the flash. However, a straight 'C' implementation was
not sufficient. You need to add some assembler instructions.
I guess it is wrong to say "out of order". I should have said not at
all. AA and CMD are usually written to the same address. I did try
msync commands and this was not effective.
fwiw,
Bill Pringlemeir.
--
I never did give anybody hell. I just told the truth and they thought
it was hell. - Harry S. Truman
vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
- Posted by LarryC on June 29th, 2006
After you reset after the hang, stop at the boot prompt and do an "e".
That might dump exception data if the previous hang was caused by an
exception.
lc
Bill Pringlemeir wrote:
- Posted by MetalHead on June 30th, 2006
Bill Pringlemeir wrote:
I have seen this happen in the past in this manner. By adding code into
the code segement, you move the relative position of stuff around. Even
if the code you added does not get executed, if the I/O drivers are at
opposite end of the link map from the boot code, just increasing or
decreasing the relative separation of components can cause the
corruption to occur in a place that does not get executed during the
boot process or causes a different kind of problem. C libraries are
another good candidate for winding up at the far end of the link map. If
you are lucky, this will show up as an illegal instruction trap, and if
you are unlucky, it shows up as branches to nowhere or tight loops.
This would be a good first step. The OP sounded like he was fishing for
ideas, so I threw out a couple that I have run into in the past.
Bob
- Posted by Didi on June 30th, 2006
I also would tip on cache handling problems in the code. Forgotten
flush of the i-cache is something I have had to chase with my early
versions.
There is one more possibility I know of. If the processor is a 405,
check
its errata sheet. I recently discovered (while considering a device,
I opted not to use it) a late published error to be saying basically
you may not use its cache in copyback mode, it does not work.
Use write through....
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
MetalHead wrote:
- Posted by mrfirmware on June 30th, 2006
Didi wrote:
We haven't used write-through, ever, on the 405GPr and it has had narry
a problem with copy-back at least for the past 4 years of the product
life (thousands of blade servers). Do you have an errata number or doc.
I could look at WRT to this cache bug? If you are referring to CPU_213
you need only to set CCR0 as specified. Setting write-through mode is
simply too big a hammer (for us).
--
- Mark
- Posted by Jim Stewart on June 30th, 2006
eon_blue_80@verizon.net wrote:
Reading your post, it's not clear how many different
physical units you've tried this on. If the answer
is one, the problem could be a bad byte with a bad
bit of flash memory.
- Posted by Didi on June 30th, 2006
This is what I was referring to, apparently you have it under control.
It was
enough to stop me from using the 405 (I opted for the 5200).
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
mrfirmware wrote:
- Posted by eon_blue_80@verizon.net on June 30th, 2006
Thank you everyone for all of your suggestions. These suggestions will
be a great help when troubleshooting future problems.
As far as the original problem goes, using I/O probing we were able to
successfully narrow the error down to a relatively large segment of the
BSP. Apparently there is a problem in the SCSI section of the BSP
(wild pointer or out of order type operation??) that causes the image
to hang when the bytes of the image are aligned in just the right way.
We have made a decision to disable SCSI support within the OS (which
has corrected the problem). Hopefully, if time ever becomes available,
we can look into the SCSI section of the BSP; and find the exact bug.