Tech Support > Computer Hardware > Microprocessors > ISA Support for Multithreading
ISA Support for Multithreading
Posted by Eric on February 20th, 2007


Hi,

I'm looking for the minimum set of instructions that a processor must
implement to support some kind of simultaneous multithreading. I found
the MIPS MT extension. What do you think of the instructions/
mechanisms that it offers? What would you change? Do you know other
ISAs implementing other usefull features related to SMP?

Many thanks for your help.

Eric

Posted by Nick Maclaren on February 20th, 2007



In article <1171973342.839566.44410@s48g2000cws.googlegroups. com>,
"Eric" <nospam.eric@gmail.com> writes:
|>
|> I'm looking for the minimum set of instructions that a processor must
|> implement to support some kind of simultaneous multithreading. I found
|> the MIPS MT extension. What do you think of the instructions/
|> mechanisms that it offers? What would you change? Do you know other
|> ISAs implementing other usefull features related to SMP?

Well, the minimum number is generally zero. Your other questions
relate to what may be a good idea to provide. If you want a better
answer, you need to be clearer about your objective.


Regards,
Nick Maclaren.

Posted by slebetman@yahoo.com on February 21st, 2007


On Feb 20, 8:09 pm, "Eric" <nospam.e...@gmail.com> wrote:
For timeslice and/or preemptive multithreading/multitasking ONLY,
without other modern OS features like virtual memory etc, you
basically need to be able to arbitrarily save and restore the call
stack. Without it the only kind of multitasking you can implement is
cooperative multitasking.

Oh and obviously you need to be able to execute code outside the
context of the currently running code/application -- interrupt timer,
supervisory mode, ring 0, doesn't matter how you just need a
mechanism. Otherwise again the only kind of multitasking you can
implement is cooperative multitasking.


Posted by Nick Maclaren on February 21st, 2007



In article <1172023629.922879.240590@v33g2000cwv.googlegroups .com>,
"slebetman@yahoo.com" <slebetman@gmail.com> writes:
|>
|> For timeslice and/or preemptive multithreading/multitasking ONLY,
|> without other modern OS features like virtual memory etc, you
|> basically need to be able to arbitrarily save and restore the call
|> stack. Without it the only kind of multitasking you can implement is
|> cooperative multitasking.

Otherwise commonly known as coroutines. Yes. But the term "call
stack" is unhelpful, as that is generally in memory; it is actually
the local environment (registers, state etc.) that needs saving.

|> Oh and obviously you need to be able to execute code outside the
|> context of the currently running code/application -- interrupt timer,
|> supervisory mode, ring 0, doesn't matter how you just need a
|> mechanism. Otherwise again the only kind of multitasking you can
|> implement is cooperative multitasking.

Oh, no, you don't. That is almost true if you have only a single CPU,
but definitely isn't if you have more than one. Why shouldn't one
CPU run the 'kernel' and the other run ordinary processes?


Regards,
Nick Maclaren.

Posted by Paul A. Clayton on February 21st, 2007


NOTE: The following comments are from one who is not a programmer
much less a processor architect. Where there is disagreement
with the work of experienced architects, it should be assumed
that the author has an inadequate understanding of the tradeoffs
involved or of the architecture in question or is evaluating the
architecture based on different criteria.

While I perceive the MIPS MT-ASE as clean and professionally
architected, I disagree with some of the decisions made. I am
disappointed that there was no adoption of reduced contexts
beyond simply not including or sharing coprocessor state. While
the MIPS ISA makes this slightly awkward (since the multiplier
has state), it would allow for more contexts for a given area.
This would be especially powerful with the identification of
Shadow Register Sets with Thread Contexts. While I could see
user applications usefully spawning threadlets, it seems
especially useful for an interrupt context to be smaller than 31
registers (i.e., for there to be nearly twice or four times as
many contexts). (MIPS MT-ASE defines a Thread Context as
containing 38 to 42 registers [31 GPRs, 2-3 multiplier registers,
5-7 MT-ASE-specific CP0 registers, 0-1 CP0 register; a TC can
also contain additional registers such as those for Floating
Point], a Virtual Processing Element has all previously defined
CP0 registers (minimum 14?), and a Processor has 2-3
MT-ASE-specific CP0. A Shadow Register Set contains 31
registers.)

I also disagree with the choice to have only a RISC-y 'fork'
instruction (MIPS MT-ASE 'fork' loads the value in one GPR to the
spawned thread's PC [actually TCRestart], the value of another
GPR to the target register in the spawned thread, and updates the
Status register [or generates a Thread Exception of the Thread
Overflow type {or, of course, a Reserved Instruction Exception if
the instruction is unimplemented}].) While this provides the
greatest flexibility within the constraints of RISC, it prohibits
an application from taking advantage of the thread locality
(other than through a shared cache) and forces more heavyweight
thread spawning to pass most values through the memory system.
Using a RISC-y instruction does allow implementations to be
simpler, but it effectively prohibits the exploitation of the
fact that localized bandwidth is easier to implement and that
register copies can be virtualized. It might be desirable, e.g.,
to support forking directly to a generic procedure.

Of course, with the earlier complaint, it also effectively
prohibits thread splitting (i.e., one context generating two
contexts each with a little more than half the content as the
originating context, for which most of the data movement can be
virtualized even in a fairly simple in-order implementation).

I would also be highly tempted to provide a PC-related (like MIPS
jump-and-link) fork to slightly reduce the software overhead of
some simple threadlet spawning. ("MIPS MT Principles of
Operation" assumes that only one type of fork could be
implemented and concludes that a PC-relative or PC-related fork
would force an excessive burden in the case of distant targets
["forcing them to perform a double control transfer"].)

It might also be desirable to make hardware forking more flexible
(MIPS MT-ASE only allows forking within a VPE. "MIPS MT
Principles of Operation":
the objective is to minimize the "payload" of a FORK
operation, in part to minimize the hardware implementation
cost, but also in anticipation of multicore "remote" FORKs,
where the information would need to be transmitted between
processing elements.
This implies that either the fork instruction will be extended to
allow forking to a different VPE [it seems improbable that
multicore VPEs would be defined] or a new fork instruction will
be provided to provide such functionality.)

There is some attractiveness to allowing the fork operation to be
of variable 'aggressiveness'. E.g., a fork might fail if no TC
is available within the specified resource group (e.g., within
the threads sharing L2 cache) and within the TCs that have a
certain degree of execution resources available (failing, e.g.,
when the only free TCs have high run costs [as would be the case
if the only free TC was allocated to a second level of context
storage requiring the moving of a TC in the first level to the
second level] or when an independent execution pipeline is not
associated with the TC). This would seem to require at least one
additional register argument AND a return value (to indicate
failure and perhaps the reason for the failure); both conditions
which are non-RISC-y. It seems desirable to provide a mechanism
in the hardware interface to communicate failure of a fork (so
that the original thread can, e.g., choose a more expensive
forking such as a system call or even just follow a different
code path on failure). (Alternately, there could be a
lightweight mechanism to test if the appropriate resources are
available. This would allow the remote possibility of performing
a more expensive [or less beneficial] than desired fork if
another thread grabbed the only remaining TC that meets the
resource criteria. However, I think checking for failure would
be sufficiently common to justify a return value [i.e., a fast
mechanism to check for failure].)

It also seems appropriate that a TC could be allocated to a 'open
pool' such that when activated the TC could be associated with a
specific VPE.

It might also be appropriate to provide configuration control
(per VPE) to allow hardware-forking to other VPEs (presumably
with per VPE configuration that the VPE does not accept outside
spawned threads); this would make hardware forking more flexible.

(I also think the limit of 8 VPEs is inappropriate, but that can
be easily fixed because there are 13 adjacent reserved bits in
the TCBind register.)

The MIPS MT-ASE also leaves much as implementation specific.
While this provides maximum implementation flexibility, it might
be desirable to define one or more processor families with more
of the implementation-specific features defined.

(It might be desirable for there to be a yield instruction
variant that defines the restart condition via a single condition
value rather than using a bit vector [so more than 31 {63 in
64-bit versions???} conditions could be defined without requiring
an extra source register {the case of waiting on a single
condition would seem likely to be fairly common}]. It is not
clear to me whether each VPE defines the set of 31 conditions.
[I receive the impression that the 31 conditions are defined by
the Virtual MultiProcessor {"a collection of interconnected
VPEs"}, but I could be mistaken.] If each VPE defines its
conditions, the 31-condition limit might not be particularly
constraining.)

(MIPS' branch delay slots also cause some complications for
gating storage accesses.)

MIPS MT-ASE gating storage (and particularly InterThread
Communications Storage) is interesting, but I do not think I have
an adequate understanding to critique such.


Paul A. Clayton
just a technophile

Posted by Bill Todd on February 21st, 2007


Eric wrote:
Possibly just one instruction that says, "Give me your current internal
state for thread N and replace it with this one." As long as the OS was
told the number of thread slots that the hardware supported and
processor-generated interrupts included the applicable thread ID (as an
output) I suspect (having thought about it for less than a minute) that
the OS could potentially handle everything else.

Of course, this might not support anything remotely approaching
*optimal* SMT, but that wasn't what you asked.

- bill

Posted by Anne & Lynn Wheeler on February 21st, 2007



Bill Todd <billtodd@metrocast.net> writes:
in the mid-70s i designed some dispatching microcode for multiprocessor
project (that was eventually canceled and never shipped a product).

it basically moved part of the virtual machine dispatcher and
interrupt handler into the microcode. the status for the virtual
machine was already being kept ... so it required two queues ... one
of available stuff to execute ...and one of stuff that had finished
execution (and needed something from the kernel for one reason or
another). how many different real processors and therefor concurrently
executing virtual machines was somewhat transparent to the kernel
code. this is similar in concept to what was later done by 432i.

misc. past posts mentioning the project
http://www.garlic.com/~lynn/subtopic.html#bounce

later there was the "SIE" instruction ... which wasn't particularly
queue or multiprocessor oriented ... just packaged all the the stuff
needed to start/stop process operation into a single instruction.

SIE was somewhat part of migration of more and more support for
virtual machine operation into hardware of the machine. Major part of
that was virtual machine subset that appeared as "LPARS" (logical
partitions) ... where a subset of virtual machine operation was
provided directly by the hardware ... w/o requiring virtual machine
operating system. for a little drift ... post with old email discussing
some philosiphy differences between SIE implementation on the 3081
and 3090
http://www.garlic.com/~lynn/2006j.html#27 virtual memory

Posted by Nick Maclaren on February 21st, 2007



In article <1dWdneIifcqRLEHYnZ2dnUVZ_tadnZ2d@metrocastcablevi sion.com>,
Bill Todd <billtodd@metrocast.net> writes:
|> Eric wrote:
|> >
|> > I'm looking for the minimum set of instructions that a processor must
|> > implement to support some kind of simultaneous multithreading.
|>
|> Possibly just one instruction that says, "Give me your current internal
|> state for thread N and replace it with this one." ...

If the architecture keeps all state in memory, you don't need even
that! The ICL 1900 mapped its registers to memory, so that can work
even for register designs.


Regards,
Nick Maclaren.

Posted by Paul Keinanen on February 21st, 2007


On 21 Feb 2007 09:47:37 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

At least in processors with only a few registers, the interrupt
usually saves the registers into the current stack. If the interrupt
causes a task switch, you just have to save the stack pointer into the
current task data area. Then load the stack pointer for the task to be
activated (which was suspended by a previous hardware or software
interrupt with registers on the stack) and issue a return from
interrupt instruction, which will load the saved registers from the
stack into the active registers.

In a small system with 3-5 tasks, scanning the task list in the ISR
for the highest priority runnable task does not increase the interrupt
latency very much.

Paul


Posted by CBFalconer on February 22nd, 2007


Nick Maclaren wrote:
A stack machine normally has about 4 registers not visible to the
programmer - PC, SP, localbase, globalbase. Everything else is
already in memory. With memory management globalbase can be taken
as 0 and replaced by a pointer to the appropriate page tables. So
scheduling is a matter of changing 4 hidden registers and
returning.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews



Posted by Bill Todd on February 22nd, 2007


Nick Maclaren wrote:
With no special SMT instructions, how do you tell the processor that you
want to replace a specific thread that it is currently executing (e.g.,
one that is about to wait for some lock) with another? Or, for that
matter, give it multiple threads to begin with?

- bill

Posted by robertwessel2@yahoo.com on February 22nd, 2007


On Feb 21, 3:47 pm, Paul Keinanen <keina...@sci.fi> wrote:

That's a particularly bad idea if you expect there to be any isolation
between user mode tasks and the OS. A user mode stack is always
unsafe to use in an ISR.

That obviously does not apply to small embedded system where you know
that each task's stack always has enough room to service an interrupt,
but any other system must switch to a safe stack when taking an
interrupt.


Posted by Grant Edwards on February 22nd, 2007


On 2007-02-22, Bill Todd <billtodd@metrocast.net> wrote:
You save the processor context in RAM (on the current thread's
stack is common), and load the processor context from the
thread you want to execute. Then you execute it.

You don't "give a processor multiple threads". You write a
scheduler that knows about threads and knows how to switch
between them.

--
Grant Edwards grante Yow! Yow! Are we wet yet?
at
visi.com

Posted by Bill Todd on February 22nd, 2007


Grant Edwards wrote:
Are you sure that you understand what SMT is? It certainly appears that
one of us doesn't.

- bill


Posted by Nick Maclaren on February 22nd, 2007



In article <gfOdnex-vp5Kb0HYnZ2dnUVZ_qXinZ2d@metrocastcablevision.com> ,
Bill Todd <billtodd@metrocast.net> writes:
|>
|> With no special SMT instructions, how do you tell the processor that you
|> want to replace a specific thread that it is currently executing (e.g.,
|> one that is about to wait for some lock) with another? Or, for that
|> matter, give it multiple threads to begin with?

Special memory locations? Ordinary control instructions with a different
set of operands? Control register contents? Anything else in common
use I have forgotten?

There's more than one way to kill a cat.


Regards,
Nick Maclaren.

Posted by Andrew Reilly on February 22nd, 2007


On Wed, 21 Feb 2007 21:33:10 -0500, Bill Todd wrote:

What makes you think that an SMT processor needs to be in any way
distinguishable (including instruction set) from a pair (or whatever
ordinality) of processors that share memory? You don't need special
instructions to schedule processes or threads, other than the usual sort
that have been available in almost every processor since the 68010 (and
many before that, of course).

Cheers,

--
Andrew


Posted by Gil Hamilton on February 22nd, 2007


Bill Todd <billtodd@metrocast.net> wrote in
news:W6OdnR557MixrkDYnZ2dnUVZ_sWdnZ2d@metrocastcab levision.com:
In particular, Bill appears to be referring to something like the TCs
(thread contexts) supported by the MIPS MT ASE (MultiThreading
Application Specific Extension). This is a step beyond mere multiple
hardware threads, each of which appears to be a separate processor (such
as Intel's HyperThreading or what MIPS calls VPEs [Virtual Processing
Elements]). If interested, download the programmer's reference from this
page (registration required):
http://www.mips.com/products/resourc...ials/MIPS_Arch
itecture.php

I can't claim to have a complete understanding, but it appears that you
can actually allocate and schedule TCs from user space. That is, you can
create additional threads of execution and switch amongst them without
making a system call (though OS support is required to initially enable
and configure the TC mechanism and, presumably, to manage them
appropriately when exceptions or interrupts occur). There is also
apparently some limited support for inter-thread communication:
essentially, block until some data value is written / condition satisfied
by another thread.

GH

Posted by Tim McCaffrey on February 22nd, 2007


In article <m3bqjnnrcr.fsf@garlic.com>, lynn@garlic.com says...
created custom microcode for the HP21MX mini computer that did something
similar. Every interrupt, the microcode was invoked, the current task
was put at the tail of the ready-to-run Queue (for its priority level),
then the device was serviced (if possible) (timers & serial devices were
handled directly for most interrupts), and/or a ISR was called. Afterwards,
the microcode picked the highest priority ready-to-run task and started it.

SRI wrote alot of microcode....


- Tim



Posted by Tim McCaffrey on February 22nd, 2007


In article <1171973342.839566.44410@s48g2000cws.googlegroups. com>,
nospam.eric@gmail.com says...
The perpherial processors on the CDC Cyber series had 10 18 bit accumulators
and 10 program counters. The processor switched between them every cycle. I
suppose you could call that multithreading support, of course your limited to
10 threads (both max and min).

- Tim


Posted by Walter Banks on February 22nd, 2007




Tim McCaffrey wrote:

We implemented threads support in the RS08 and S08 and are extending this support to
all the other micros Byte Craft supports ( ad off technology begins ).

The dispatcher groups the threads into three categories INTERRUPT,SOFTWARE and RESET

RESET are one shot threads that only ever execute once
..
INTERRUPT are threads that require an outside event as part of the execution conditioning.

SOFTWARE are threads that whose conditioning is entirely system status variables and flags.
When all the open software threads have been serviced the processor can sleep.

Priority is linear within each group. On small processors each thread is run to completion. (It may spawn other threads)

The biggest advantage we have seen is lower RAM requirements and execution cycles. Overall ROM usage is similar or marginally less than other implementations.

Our surprise was how effective it is in coding small state machine tasks often found in single purpose micros for example a quadrature decoder

Walter Banks
--
1 519 888 6911
Byte Craft Limited
A2-490 Dutton Drive,
Waterloo, Ontario N2L 6H7
Canada
http://www.bytecraft.com
email walter@bytecraft.com



Similar Posts