Tech Support > Computer Hardware > Microprocessors > Interrupt latency
Interrupt latency
Posted by Stargazer on March 16th, 2008


Hi,

I am writing my own real-time kernel for x86. Now I face something
really strange (or may be rather it's not; it has been some time since
I was in the details of x86 microarchitecture).

I measured CPU clocks elapsed between the first assembly instruction
executed at interrupt's entry point in IDT and beginning of the C code
of user-defined interrupt handler and the result was a big
surprise :-) It took about 2500 cycles despite that I have only a
handful of assembly instructions before a call to user-supplied IRQ
handler.

A little more testing showed that almost all cycles (2300+) were spent
at access to a global variable (via ds:[] addressing). Nothing that
accesses stack memory (push, pop, call, mov) makes a noticeable
difference. Does anybody have an idea why this happens? I test on
Celeron 2.8G in protected mode set up for flat model with paging
disabled.

I can post the code of the interrupt's entry point (until the C entry
point is called), but it's rather trivial and not long.

Thanks,
D

Posted by Tim Wescott on March 17th, 2008


On Sun, 16 Mar 2008 16:02:18 -0700, Stargazer wrote:

Cache latency?

--
Tim Wescott
Control systems and communications consulting
http://www.wescottdesign.com

Need to learn how to apply control theory in your embedded system?
"Applied Control Theory for Embedded Systems" by Tim Wescott
Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html


Posted by Alexei A. Frounze on March 17th, 2008


On Mar 16, 4:02 pm, Stargazer <spamt...@crayne.org> wrote:
What are the min, max and average cycle counts (you need to repeat the
measurement many times)?
What are the numbers on other PCs?

I wonder if it's SMIs. On my Dell Latitude D610 notebook an SMI (or a
short burst of thereof) may take up to ~240K cycles, which is ~150
microseconds at 1.6 GHz; on the old Compaq Armada 7800 notebook it's
only 12K cycles or ~40 microseconds at 300 MHz. Hardware bugfixes and
control are moving into the CPU.

Alex


Posted by Didi on March 17th, 2008


Stargazer wrote:
I cannot help you but just want to thank you for the educating (for
me) post.
Demonstrates I have been right to stay away from x86 (I have
considered
it 2 or 3 times every 5 years or so, and have not done so last 8 years
IIRC).
The latency is probably compiler generated, of course, but this does
not make things any better :-).

Dimiter

------------------------------------------------------
Dimiter Popoff Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tg...7600228621276/


Posted by larwe on March 17th, 2008


On Mar 16, 7:02*pm, Stargazer <spamt...@crayne.org> wrote:

God knows. Cache line fill, maybe. Bus contention with a shared-memory
graphics adapter. Any one of a million things. x86 in V86 mode is a
nondeterministic architecture; design accordingly.


Posted by Cyril Novikov on March 17th, 2008


Stargazer wrote:
That's unlikely. 2300 cycles is about 1us, no memory is that slow. That
looks more like ISA bus access time. Do you have any I/O instructions or
MMIO accesses there, by any chance? How do you measure cycles, did you
remember to use synchronizing instructions with RDTSC?

That would be helpful.


Posted by Vladimir Vassilevsky on March 17th, 2008




Stargazer wrote:
Something is not right.

The biggest part of latency is created by the SDRAM Tpre + Tras + Trcd +
Tcl + Tburst. The ds:[] access also causes the shadow descriptor miss.
So in the worst case it translates to the two SDRAM bursts to flush the
dirty cache lines, and another two bursts to reload. I would expect the
delay at the order of hundreds of the CPU cycles.


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com


Posted by Wolfgang Kern on March 17th, 2008



Stargazer wrote:

....
A naked IRQ (just EOI and IRET) takes about 200 cycles, if measured in
an unproteced environment (ie: my OS where all code run with PL0).

One big cycle loss is in the PL-transition, especially if hardware
task-switches are in use.

If there is no WBINVD instruction (~2000 cycles) in your code
I can just guess what may happen here

user code runs PL=3...
IRQ : ->PL=0
user hooks : PL0->PL3
global access: PL3->PL0->PL3
end of hook : PL3->PL0
IRET : PL0->PL3

__
wolfgang



Posted by Wolfgang Kern on March 17th, 2008



Stargazer wrote:

....
A naked IRQ (just EOI and IRET) takes about 200 cycles, if measured in
an unproteced environment (ie: my OS where all code run with PL0).

One big cycle loss is in the PL-transition, especially if hardware
task-switches are in use.

If there is no WBINVD instruction (~2000 cycles) in your code
I can just guess what may happen here

user code runs PL=3...
IRQ : ->PL=0
user hooks : PL0->PL3
global access: PL3->PL0->PL3
end of hook : PL3->PL0
IRET : PL0->PL3

__
wolfgang



Posted by Stargazer on March 17th, 2008


On Mar 17, 2:38*am, "Alexei A. Frounze" <spamt...@crayne.org> wrote:
A weird thing is that the difference between min and max is about 10
cycles. That is, results are fairly accurate and consistent.
I didn't test on other PCs yet.

I don't know. It's a single instruction that accounts for over 2000
cycles, I can point the instruction but don't understand the reason.
It's a read-modify-write (INC ds:[xxx]) and it has to do something
with the nature of instruction being RMW. Actually I have a BT ds:
[xxx] (read) several instructions before it, which doesn't cause
anything abnormal.

It sounds weird if SMI would somehow be triggered on each and any
hardware interrupt.


D


Posted by Stargazer on March 17th, 2008


That's the first thing that jumps into mind, but... over 2000 cycles?
OTOH, may it be that L2 cache on Celeron (or on-board L3?) is
organized so badly that it read-modify-write to memory *always* causes
a cache miss? Interesting, I ought to test what happens if there is an
off-chip cache and if I can disable it.


D


Posted by (Patrick Klos) on March 17th, 2008


In article <7366de24-bc1c-47f1-9fc1-6de7d66d6e5a@s37g2000prg.googlegroups.com>,
Stargazer <spamtrap@crayne.org> wrote:
Is this a SINGLE memory access or several/many accesses?

Please feel free to post the code so there doesn't have to be any
unnecessary speculation.

Patrick


Posted by Stargazer on March 17th, 2008


On Mar 17, 6:00*am, Cyril Novikov <spamt...@crayne.org> wrote:
I use RTDSC twice and display difference between them. I didn't
serialize RDTSC, but what inaccuracy that would cause? I would accept
that it explains even the +/- 10 cycles in measurements, but not 2000.

The code is below. Only the relevant snippet is posted.
I hid most of irrelevant code (e.g. there is some logic following the
code that loops through possible maximum of 16 chained interrupts, but
I currently don't chain anything). "TEST_INT_RESPONSE" is defined
above the code and "_int_received" is a public label that marks two
doublewords. Comments "; Ts from here..." are put where I inserted the
three-instruction sequence (rdtsc and 2 moves) and the TS difference
that I got until rdtsc in inline assembly in the C code. It appears
that the "inc dword [_running_irq]" accounts for all the mess.

BTW, now I notice that there is not only read through ds:[] (bt), but
also two writes to ds:[] following the rtdsc that I use to store time
stamp for further comparison and neither causes any anomaly. So it
just strengthens my suspect that RMW instruction has some strange
effect on more distant caches. Does anybody have an idea?

Thanks,
D


------------ start of code -----------------
handle_int:
push eax
push ebx

%IFDEF TEST_INT_RESPONSE
;
; Test latency for interrupt.
; This code measures latency without itself. With the measurement code
the latency is slightly higher
;
push edx
rdtsc
mov [_int_received], eax
mov [_int_received+4], edx

; Dummy instructions that take about the same time as instructions
above
push 10h
jmp dummy1
dummy1:
push eax
push ebx

; Enough, go ahead
add esp, 12
pop edx
;
; End of test latency for interrupt
;
%ENDIF

mov ebx, [esp+8]

;
; Acknowledge interupt with PIC(s).
; After calling callbacks, to enable nesting of interrupt handlers
according to priorities
; TODO: check rotation priorities of 8259 and other options
;
mov al, CMD_EOI
out PIC_MASTER, al
cmp ebx, 8
jb fin
out PIC_SLAVE, al
fin:

;
; Ts from here to int handler is 2340-2350 cycles
;

bt [_int_callbacks], ebx
jnc return_int

;
; Ts from here to int handler is 2340-2350 cycles
;

; On stack saved already: EAX, EBX
; Don't trust C code, save all regs (except for EBP, ESP)
push ecx
push edx
push esi
push edi

;
; Ts from here to int handler is ~2340 (cycles)
;

%IF 0
push ebx ; int_no
call _call_int_callbacks
add esp, 4
%ELSE
;
; Implement _call_int_callbacks in assembly
;

;;;;; Just to test where cycles go
; push eax
; push edx
; rdtsc
; mov [_int_received], eax
; mov [_int_received+4], edx
; pop edx
; pop eax
;;;;;

inc dword [_running_irq]

;
; Ts from here to int handler is 105-115 (cycles)
;
mov eax, 10 ; MAX_IRQ_HANDLERS
mov ecx, eax

;
; Ts from here to int handler is 105-115 (cycles)
;

mul ebx
lea eax, [eax*4]

;
; Ts from here to int handler is 90-105 (cycles)
;

next_int_handler:
push eax

;
; Ts from here to int handler is 90-105 (cycles)
;

call dword [_isr_proc+eax]

[...]
%ENDIF
------------ end of code --------------


Posted by tbroberg_nospam@hifn.com on March 17th, 2008


On Mar 16, 8:00*pm, Cyril Novikov <spamt...@crayne.org> wrote:
Actually, PCI reads tend to take about 1us, and they are generally
implemented as memory reads.

Are you reading an interrupt status register to check for the
interrupt condition, and that's the memory read that eats a us?

What is this memory location?


Posted by Cyril Novikov on March 17th, 2008


Stargazer wrote:

These 2 OUT's are what takes 2000+ cycles. The reason you see it elsewhere is
most likely a bug in your measurement code.


Posted by Marven Lee on March 17th, 2008


Stargazer wrote:
Is the Celeron 2.8G related to the Pentium 4? I'm not clued up on
processors of the last few years.

Interrupts, int and iret took about 1900 to 2000 cycles on the P4
processors for some reason. I know it was significantly higher than the
earlier processors that took a few hundred cycles. I'm not sure
how many cycles each instruction takes on its own though.

I think someone, maybe Piotr ran some benchmarks of int/iret and
sysenter/sysexit on the P4 a few years ago and posted them in a.o.d.


--
Marv




Posted by Stargazer on March 18th, 2008


On Mar 17, 12:12*pm, "Wolfgang Kern" <spamt...@crayne.org> wrote:
I use "classical" monolithic embedded OS model: both OS and
application run on PL0, no PL switches, no address spaces (CR0.PG=0)
and TSS are not used.

200-300 cycles is what I get without that "inc" instruction.


D.


Posted by Stargazer on March 18th, 2008


On Mar 17, 3:08*pm, (Patrick Klos) <spamt...@crayne.org> wrote:
It's a single read-modify-write instruction (see "inc [xxx]" in the
posted code). I don't know how would you define it, just hope that
there isn't some fancy incrementer in memory controller :-) Normal
single memory accesses which I also do (see "bt" and 2 "mov"s after
"rdtsc") don't cause anything unexpected.


D


Posted by Stargazer on March 18th, 2008


On Mar 17, 8:44*pm, Cyril Novikov <spamt...@crayne.org> wrote:
Why would it take 2000+ cycles? interrupt controller is integrated in
a chipset which provides pretty fast access. Anyway, your assertion is
not true: if I just comment out the "inc dword [_running_irq]", I get
total latency of about 300 cycles.


D


Posted by Stargazer on March 18th, 2008


On Mar 17, 10:06*pm, "Marven Lee" <spamt...@crayne.org> wrote:
Yes, it's Celeron from Pentium-4 family. However, I don't include
processor's interrupt entry and IRET into timing. For now I want my
interrupt entry code to introduce minimal additional latency, not deal
with matters that I can't change anyway :-)

As I said, I know where the problem lies, but I don't understand the
reason so I can't yet eliminate it.


D



Similar Posts