Tech Support > Microsoft Windows > Drivers > PCI bus-master and large contiguous memory buffers
PCI bus-master and large contiguous memory buffers
Posted by v_mirgorodsky@yahoo.com on February 9th, 2006


Hello, ALL!

I am developing PCI bus-master device, which generates a lot of data
during its life-time. On average device generates about 85-100MB of
data per second. Since the amount of data is huge, I need relatively
large 32MB contiguous memory buffer, which is a big problem in modern
OS. After getting these data from device driver needs to perform some
small processing, like adding headers, cross-references, etc. Driver
restricts its use by more than single application at a time, at least
on data interface.

I can not follow standard DMA-like procedure when talking to device,
since it requires extremely low latency in data handling. I can not
have several small contiguous memory buffers since the data comes out
in a big bursts and device may already overwrite the data in such
buffer while OS not yet managed to invoke my DPC for ISR.

I would like to adopt the following solution for this task. First of
all user application allocates any size of buffer it desires and
notifies the driver about its starting address. Driver rounds the
buffer start address to the beginning of the physical page size (may be
even run-time configuration parameter) and maps the entire buffer into
kernel memory space and protects it from being paged out by memory
paging system. Then driver allocates a small amount of contiguous
non-paged memory, like 64kB or so and fills it with 4-bytes long
addresses of physical pages of memory, from which the user buffer
consists. That is all. As soon device needs next portion of physical
memory to store the data it just reads its address from the catalog. At
the end of the buffer it wraps to its beginning. As soon as new portion
of data ready, device generates the interrupt to notify driver about
this. Buffer may be allocated once and may be used as long as it locked
in kernel memory space. In the end of session driver deconfigures
device, unmaps user buffer, etc.

Are there any serious pitfalls with this approach I am not aware of? Is
there documented way to get the list of pages user buffer consists of?
Will the OS try to exchange one page to another while it locked in
memory?

With best regards,
Vladimir S. Mirgorodsky

Posted by Brian on February 9th, 2006


You should read up on the MDL, (Memory Descriptor List) Windows already does
what you are proposing. You are really attempting to support scatter
gather, your design is correct, but you're really re-inventing a wheel.

Way back at the turn of the century, I supported several devices that
recorded data almost as fast (up to 64 MBs), but back then the equipment was
much slower too (32 bit 33 MHZ bus speed, CPU speed in the MHZs).

Brian

"v_mirgorodsky@yahoo.com" wrote:

Posted by v_mirgorodsky@yahoo.com on February 10th, 2006


Hello Brian,

Brian wrote:
Thanks, Brian, for pointing this out Although being familiar with
x86 CPU memory management structures I never tried to dig into Windows
implementation of those details. I always treat MDL as black-box and I
work with it only by means of provided kernel API. Sure, I can dig into
its structure and make my research all the way down to CPU structures,
but that code will not be compatible with upcoming Windows versions and
64 bits CPU's. Is there any elegant way to figure out physical
structure of the memory buffer?

What about buffer consistency in the long run? Is it possible to
guaranty that its physical structure remain unchanged for any arbitrary
long period of time while it is locked in memory?

I am going to use PCI MEMORY_WRITE_INVALIDATE command. Documentation
states that this command should automatically keep cache coherencies.
Do I need to call KeFlushIoBuffers() and/or FlushAdapterBuffers() at
the end of the transfer operation? Since the system does not know
anything about my device internal DMA call to FlushAdapterBuffers()
seems to be redundant. What about KeFlushIoBuffers()? System does not
know what command I am using to transfer data, so it can not decide if
it is necessary to flush the CPU caches.

What is the worst case latency in calling my DPC for ISR? I read that
CD-ROM driver may keep the system for seconds at elevated IRQL levels,
effectively preventing all other DPC's from executing. Sure enough, I
can discourage using CD-ROMs while my device bursts its data, but may
be there are other bad citizens like that in driver community?

With best regards,
Vladimir S. Mirgorodsky


Posted by Mark Roddy on February 10th, 2006


On 10 Feb 2006 01:20:12 -0800, v_mirgorodsky@yahoo.com wrote:

You misunderstand. The MDL is one of the key data structures for
supporting DMA operations and direct memory transfers from device to
user buffers, but you do not poke around with the internal pieces of
the MDL, you use the defined APIs for manipulating MDL objects. MDLs
are just one piece of the DMA architecture. The next piece is the
DMA_ADAPTER object obtained via a call to IoGetDmaAdapter. From the
DMA_ADAPTER you obtain a set of bus specific DMA methods that you can
call. In particular you want to read up on GetScatterGatherList and
PutScatterGatherList, and once you figure out how these work, you also
might want to understand BuildScatterGatherList and friends.

The scattergather list methods provide a vastly simplified mechanism
for doing scatter gather dma using data buffers described by MDLs.
Almost all of the OS specific housekeeping operations are performed
for you if you use Get/PutScatterGather - all you have to do is setup
your device's DMA registers and hit the 'go' bit.

You are?

Cache coherency from the perspective of the host system is guaranteed
by the platform.

If you use the scattergather methods you do not have to flush
anything. PutScatterGather does whatever is required to guarantee that
the data is available.

Indeterminate.

Note that the version 2 DMA_OPERATIONS support mechanisms to build
your own scattergather lists objects (BuildScatterGatherList
CalculateScatterGatherList) and that you may be able to architect your
device/driver interface such that you can tolerate a fair amount of
latency by having queues of DMA operations ready to go, with your
device moving asynchronously onto the next operation while your driver
is waiting for its DPC routine to run.

Not sure about seconds, but there continue to be badly behaved drivers
out there. On an MP system one device behaving badly should not
prevent other devices from being serviced on other CPUs.

platform configuration. You have to be able to tolerate a modest
amount of latency. Probably your users should not be burning CDs while
collecting data from your device.


=====================
Mark Roddy DDK MVP
Windows Vista/2003/XP/2000 Consulting
Device and Filesystem Drivers
Hollis Technology Solutions 603-321-1032
www.hollistech.com

Posted by v_mirgorodsky@yahoo.com on February 10th, 2006


Hello Mark,

The problem that I don't want to call PutScatterGather() function every
time I get some data from my device. Building/reprogramming device each
time introduces more latency to the process that I am able to afford.
The whole idea is to create this scatter/gather list and use it for a
long time, notifying user application that portion within the buffer is
valid for processing. Application must process the data as fast as
device generates them. As soon as device reaches the end of the buffer
it starts again from the beginning, using computer memory as a huge
external FIFO.

Sure, I am developing both PCI adapter and device driver, so, it is
under my control

It does, but I need more comprehensive knowledge on the issue. I can
not afford reinitializing DMA on my device after every transfer. That
is why I would need to perform some of the work of PutScatterGather()
myself. So, I still need to understand is it necessary to call
mentioned cache coherency management functions.

With best regards,
Vladimir S. Mirgorodsky



Mark Roddy wrote:

Posted by Brian on February 10th, 2006


Vladimir,

Wow, this is really sounding familiar. I supported a digital recording
system. I built my scatter gather list in SRAM that was on my device, but I
could have done it in system memory had I needed to. The list described a
circular queue of "buffers". When recording started the device would
interrupt when a buffer was filled, the application would save the buffer to
disk and then update a register in SRAM so that the device would not overrun
it as it came around. Since we were recording in realtime we couldn't afford
to overrun either technically we couldn't throttle anything either, but
fortunately the design never needed to.

I did use documented MDL functions (at the time the ones Mark mentions
weren't available yet) and was able to make the transition from NT to WDM
without having to change my DMA engine. My "solution" was that at the
beginning of the recording I made a device IO control call to my driver. I
did not complete the IO until the end of the recording, which could be an
indeterminate amount of time. The memory that I was using was locked in
place accessible by the application and most importantly it worked. I don't
know if this is what the kernel developers had in mind, but I stayed within
the documented functionality of the time.

good luck,

brian
"v_mirgorodsky@yahoo.com" wrote:

Posted by Alberto. on February 10th, 2006


Our driver does something very similar, and it has been working for quite a
while. One advantage of the method is, it is written in such a way that the
code is compatible across OS's - our driver runs on Windows, Linux, Solaris
and HP-UX, and the code to set up the scatter-gather list is in the common
OS-independent part.

"v_mirgorodsky@yahoo.com" wrote:

Posted by v_mirgorodsky@yahoo.com on February 11th, 2006


Hello, ALL!

Thank you for all your comments and suggestions. I think I am pretty
clear now about this idea. All of your practice proves the vitality of
this solution, which is really great for me.

Thank to all of you again,

With best regards,
Vladimir S. Mirgorodsky

Posted by Maxim S. Shatskih on February 11th, 2006


Notification must be done by Read/Write/DeviceIoControl call, which is pended
till the driver will be done with this area.

No need. Just make your driver DO_DIRECT_IO and use METHOD_xxx_DIRECT IOCTL
codes - in these cases, the things will be done automatically by the OS and you
will have the ready MDL at Irp->MdlAddress.

Mapping it to the kernel memory is not needed at all in your case.

Correct, but please use ->AllocateCommonBuffer for this.

Very good idea, called the "chain DMA", lots of hardware work this way, like
the popular IDE/1394/USB controllers, as also aic78xx SCSI controller.

But note: "the 4-bytes long addresses of physical pages of memory" must be
obtained via ->MapTransfer or ->GetScatterGatherList, NOT via
MmGetPhysicalAddress (if I would MS, I would make Verifier to spit warnings on
its use) and NOT via direct meddling with the MDL tail.

The way itself is absolutely good and popular, and used in lots of hardware.
Just use the correct APIs for all of this - the Windows DMA APIs. Otherwise,
your code will die on PAE, on > 4GB machines, on some specific laptop chipsets
etc.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com


Posted by Maxim S. Shatskih on February 11th, 2006


MDL is "struct _MDL", followed by the array of physical page numbers (physical
addresses / PAGE_SIZE).

For Read/WriteFile, use DO_DIRECT_IO, and the OS will build the MDL from the
app's buffer itself and provide it for you at Irp->MdlAddress.

For IOCTLs, use METHOD_xxx_DIRECT codes for the same effect.

But please do not touch the MDL tail yourself - for instance, on PAE, the MDL
tail entries are 64bit and not 32! The only valid uses of the MDL are:

- map it to the kernel addresses using MmGetSystemAddressForMdlSafe (this calls
will use the MDL tail to fill the PTEs, conceptually, it is 2 calls -
MmAllocateMappingAddress allocates the system PTE range, and then
MmMapLockedPagesWithReservedMapping fills these PTEs according to the MDL
tail).
- create a sub-MDL from it using IoBuildPartialMdl
- pass the MDL to the DMA APIs.

The DMA APIs are implemented as methods of the "adapter object" provided by the
lower driver in the PnP stack (or the PDO itself). The adapter object
implementation in this lower driver (usually pci.sys or acpi.sys) knows all of
the details about bounce buffers, discontig memory, 4GB-PCI-limit on >4GB PAE
machines, and so on.

APIs like ->MapTransfer or ->GetScatterGatherList will transform the MDL tail
according to all these details and return you the results as Start/Length pairs
( ->MapTransfer must be called in a loop and returns 1 such pair per call).

Do not use MmAllocateContiguousMemory, this call is intended for DMA adapter
object implementors only - to implement ->AllocateCommonBuffer.

->AllocateCommonBuffer is smarter and knows the specifics of the machine and
device (PAE or not, is your PCI device 64bit or not, is it Dual-Address-Cycle
or not, and such). It chooses the upper limit and alignment for such allocation
correctly according to these details.

First is no-op on x86.

Second is mandatory, if flushes the temporary bounce buffers if the adapter
object decided to use them in ->MapTransfer or ->GetScatterGatherList.

Major. I think that 10ms is normal.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com


Posted by Maxim S. Shatskih on February 11th, 2006


Send several overlapped IRPs from the app, and assemble 1 huge DMA chain in the
driver. Then, when a DpcForIsr will fire, temporary stop the DMA engine on the
device and analyze what chain entries were completed and what not, and complete
the necessary IRPs.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com



Similar Posts