Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 04-03-06, 08:22 AM   #25
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

@ZANDER

I finally figured out what causes the high latencies when starting and
stopping glxgears: it is in deed a "wbinvd". This single instruction can
cause up to 160 microseconds measured so far.

Depending on if NV_CPA_NEEDS_FLUSHING is defined or not,
then nv_flush_caches() will either use nvidia's own cache fluching
mechanism cache_flush() or global_flush_tlb() is called which in turn
calls flush_kernel_map().

Either way, both paths include a wbind instruction causing the latencies.

From what i can see so far, nv_flush_caches() is called in theses
functions:

* nv_vm_malloc_pages()
* nv_vm_free_pages()
* nv_vmap_vmalloc()

These functions are called when staring and stopping glxgears.

I didn't yet understood in detail why a full cache flushing would
actually be necessary when allocating and freeing memory, but
from what i have digged out so far, the reservered pages are
somehow prepared for DMA in the allocation routine and in order
to gurantee memory coherence when the nvidia chip accesses
the prepared pages, the cache is flushed.

This is IMHO a "brute force" method! It would be way more elegant
and less time consuming to specifically flushing only the pages
affected (using flush_cache_page(), flush_cache_vmap(), etc.)
rather than simply flushing the whole cache.

Another alternative might be to allocate the memory un-cached,
but just to check if it makes a difference concerning the real time
capabilites, because using un-cached DMA pages typically leads
to some preformance penalties. However, concerning how big
the performance loss is, it might by an acceptable solution.

Any feedback is highly appreciated!

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 04-03-06, 10:09 AM   #26
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

Quote:
Originally Posted by JaXXoN
This is IMHO a "brute force" method! It would be way more elegant
and less time consuming to specifically flushing only the pages
affected (using flush_cache_page(), flush_cache_vmap(), etc.)
rather than simply flushing the whole cache.
I have just learned from the Linux source code that flush_cache_page()
and friends are not available on i386 (i'm more accustomed to PPC, where
these functions do a great job concerning DMA througput) and from the
nvidia glue code sources, i learned that DMA pages are un-cached
(actually need to be that way if flushing individal cache lines or
pages is not possible):

nv-vm.c, lines 294 - 300:
Code:
 * when allocating new pages, we convert the kernel mapping from cached to
 * uncached to avoid cache aliasing. one problem with this is that cpus can 
 * still contain data cached from these pages, in addition to stale ptes that
 * are cached and think the pages are still cached. normally, the cpu's self
 * snoop (SS) capability would catch this between cpus, but if the pages are
 * mapped through the agp aperture, SS is not capable of detecting these
 * conflicts.
Note sure how we could address this issue without using a wbinvd
instruction. Earlier, i suggested reading in big junks of bulk memory
in a loop to make sure that only a small number of cache lines are not
in sync with main memory before calling wbinvd (thus only a small number
of cache lines actually need to be flushed when wbinvd is called).
When i have time, i will implement this feature and let you know the results.


Maybe another solution would be to use the PCI DMA kernel interface:

nv-vm.c, lines 392 - 396:

Code:
 * the official kernel interface for allocating dma memory is to use
 * pci_alloc_consistent. this will abstract many of the 32-bit issues from us
 * mentioned above and will allow us to clean our code up quite a bit. we're
 * migrating towards this interface, but it's not quite stable in all cases
 * yet. em64t is one such current case.
Zander, does there exist a patch that you could share with us?

I think the PCI DMA kernel interface doesn't require wbinvd when
allocating un-cached memory - at least, i didn't yet recognized
high latencies, i.e. when network DMA buffers are allocated.
So having an nvidia driver using the PCI DMA kernel interface
could potentialy solve the latency issue.

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 04-03-06, 12:13 PM   #27
zander
NVIDIA Corporation
 
zander's Avatar
 
Join Date: Aug 2002
Posts: 3,740
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

@JaXXoN: last I checked, the interface you referred to above still didn't actually respect the PCI devices' DMA masks, which renders it unusable for older hardware with a limited address space. In any case, the choice of the allocation interface is orthogonal to this problem, the need for cache flushes would persist even if a different interface was used.

The reason for this is that change_page_attr() is used to update the kernel mappings of memory that's mapped into the AGP aperture or that's mapped into user space with the WC memory type, and that any cache lines and TLB entries for the old cached mappings need to be evicted before the (WC AGP/user mappings of the) memory can be used safely. For these kinds of memory, the GPU's DMA engine won't snoop the CPU caches (note that most AGP implementations aren't cache coherent). Many of the Xid problems reported with shipping kernels are the indirect result of problems where the kernel mappings aren't updated or the caches aren't properly flushed.

I thought that change_page_attr() tried to be smarter than to just issue the wbinvd instruction, at least on Linux/x86-64, but looking at the kernel source tree, that optimization was actually disabled.

I think without the ability to allocate from a pool of memory that has an uncached kernel mapping by default, I think you'd only get away without the change_page_attr()/wbinvd logic if you were using an Intel processor (though not by design, creating two mappings of the same piece of memory with conflicting memory types is illegal). On PCI-E systems, disabling PAT support and the change_page_attr()/wbinvd logic would probably also be safe with current NVIDIA Linux graphics driver releases, but that would incur a (potentially significant) performance penalty.
zander is offline   Reply With Quote
Old 04-03-06, 06:14 PM   #28
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

Zander,

thanks for your quick but nevertheless detailed reponse - it sounds pretty
plausible, but confuses me even more:

I have compared how the linux kernel (in dma_alloc_consistent) and the
nvidia driver (in nv_vm_malloc_pages) are allocating memory for DMA
transfers:


Both are using __get_free_pages(), but the kernel additionaly applies the
flag GFP_DMA, (which will allocate and map the pages uncached, as far
as i understand), while the nvidia module uses change_page_attr() afterwards
(which requires wbinvd to get the post-allocation changes in the PTEs fixed).

I adopted the kernel variant for the nvidia kernel module in nv-vm.c
(please find attached the appropriate patch):

1. added "GFP_DMA" to "NV_GFP_DMA32" (applied to __get_free_pages)
2. ifdef'ed out all occurances of change_page_attr()
3. commented out all occurances of nv_flush_cache()

The result is that the high latencies while starting/stopping glxgears disapeared!!

The highest latency measured under heavy load ("glxgears", "find /" and
"ping -s 6000 -f <ip of some other pc in my network>" in parallel) was
37 microseconds.

However, i have observed two side effects:

A)
When the glxgears window is located on display 1 or 2 (twinview configuration
on first 7800GTX), then the frame rate drops to 13600 FPS (without the patch it
is about 16000 FPS), but when moving the window to display 3 (second 7800GTX),
then the frame rate recovers to the ususal 16000 FPS.
This might have something to do with the fact that nv_rm_malloc_pages()
actually differentiates between cached und uncached allocations while
i'm currently forcing all pages to be uncached. The cleaner method would be
to check if a cached or uncached allocation is desired and to add an appropriate
GFP-flag as additional parameter to NV_GET_FREE_PAGES() that is then applied
to __get_free_pages() accordingly. I will implement that and post if it makes any
difference.

B)
Once during the experimentation, the screen freezed with symptoms
known from the in-famous "screen freezes but mouse pointer moves" bug.
However, in my case, glxgears was still running on the screen. Xorg and
glxgears both used 100% CPU time (each one on one core of the X2 Athlon).
I attached gdb from a remote machine: the X-server was looping somewhere
in nvidia_drv.o (only "??" symbols in the stack backtrace around here, so no
source of speculation what could be the problem) so it's probably a different
situation than for the nanosleep()/gettimeofday() issues discussed earlier.
I couldn't reproduce the problem and i'm not sure if it might also
happen in a system where the nowbinvd patch has not been applied
(however, the problem didn't yet showed up within hours of testing).


dmetz99,

would you mind testing the attached patch and check if the
screen freezes for your setups?

Patch order:

NVIDIA_kernel-1.0-8178-U012206.diff.txt
NVIDIA_kernel-1.0-8178-1491837.diff.txt
patch-nv-1.0-8178-U012206-1491837-2.6.16-rt11
patch-nv-1.0-8178-U012206-1491837-2.6.16-rt11-nowbinvd


Any feedback is highly appreciated!

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 04-03-06, 06:40 PM   #29
zander
NVIDIA Corporation
 
zander's Avatar
 
Join Date: Aug 2002
Posts: 3,740
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

@JaXXoN: the only way to switch the kernel mappings' memory type to UC is to use the change_page_attr() kernel interface, you'll need to do this no matter how you obtain ownership of the memory; the GFP_DMA flag doesn't control the cache attributes, it selects the zone the kernel allocates from (GFP_DMA instructs the kernel to allocate memory from the first 16MB of physical memory (for ISA devices), the new (as of Linux 2.6.15) GFP_DMA32 instructs the kernel to allocate memory from the first 4GB of physical memory). The problem you describe in B) is most likely the result of your changes to not use change_page_attr().
zander is offline   Reply With Quote
Old 04-03-06, 07:27 PM   #30
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

Quote:
Originally Posted by zander
the new (as of Linux 2.6.15) GFP_DMA32 instructs the kernel to allocate memory from the first 4GB of physical memory)
Thanks very much for this hint!

My biggest disconnect, here, was that on non-x86 hardware (i dealt with
caches and MMU so far : PPC, MIPS and ARM), there does only exist
a "Normal" (all cached) zone, because these architectures have instructions to
easily flush single cache lines affected by cache attribute changes.

I do now understand that x86 uses the first 16 MByte as DMA zone, uncached
by default, and if you need more uncached memory (as necessary for
3D graphics), then you need to take it from the Normal zone, mark it
uncached and call wbinvd. The later one, because x86 doesn't provide fine
grained cache manipulation instructions - bummer.

So the solution now would be to either a) extend the 16 MByte
"DMA Zone" and apply GFP_DMA for uncached allocations,
or b) to introduce a "DMA32 zone" and apply "GFP_DMA32".


I guess variant a) probably only requires to change some hard coded
value in the kernel sources (or does there even exist a boot option?),
but would most certainly cause problems with old ISA hardware (not
applicable in my case) or when using the floppy disk (avoidable in my case).

I think, version b) is the cleaner variant, but i read in the announcement
that the "DMA32 zone" has only been implemented for x86_64.
However, i can imagine it's only a couple of lines to add another zone.
In this context: the kernel log output tells:

Code:
127MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f5980
On node 0 totalpages: 262128
  DMA zone: 4096 pages, LIFO batch:0
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 225280 pages, LIFO batch:31
  HighMem zone: 32752 pages, LIFO batch:7
Looks like that there are already some provision for a "DMA32 zone" in 2.6.16.
Maybe there does even exist a kernel boot option for the size of this zone
someone could provide to define the size of the DMA32 zone (default is 4GB?)?
I will check the kernel sources.

Anyway, how big should such a DMA32 zone actually be for 3D graphics
hardware? Probably depends on the application?!

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 04-03-06, 07:35 PM   #31
dmetz99
Registered User
 
Join Date: Mar 2005
Posts: 84
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

Just got back here from doing other stuff. Backtrace from an interrupted "3-second sticky glxgears" episode islisted below. Looks like this may be called from a non-OS part of the code....

Program received signal SIGINT, Interrupt.
0x423eefe5 in nanosleep () from /lib/i686/libc.so.6
(gdb) bt
#0 0x423eefe5 in nanosleep () from /lib/i686/libc.so.6
#1 0x42418a6c in usleep () from /lib/i686/libc.so.6
#2 0xa74bed10 in _nv001743x ()
from /usr/X11R6/lib/modules/nvidia-drv.so
#3 0x00000000 in ?? ()
(gdb) c

JaXXoN - I'll try your latest variant and let you know what happens here in a few..
dmetz99 is offline   Reply With Quote
Old 04-03-06, 07:39 PM   #32
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

Quote:
Originally Posted by zander
The problem you describe in B) is most likely the result of your changes to not use change_page_attr().
Agreed :-) apart from easily reproducable high latencies with glxgears,
i observered other high latencies, sometimes (discussed in another thread).
These are very likely caused by less easy to catch calls to nv_rm_malloc_pages().
Since only 16 Mbytes are reserved for DMA operations when my recent patch
is applied, and since this is very likely not sufficient, it's quite obvious why
the screen might freeze (rather than just causing high latencies).

In this context: Do you think this could also be a problem for other
nvidia users when their systems start running out of memory?

regards

Bernhard
JaXXoN is offline   Reply With Quote

Old 04-03-06, 07:52 PM   #33
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

Quote:
Originally Posted by dmetz99
Backtrace from an interrupted "3-second sticky glxgears" episode islisted below.
Thanks for reporting, looks like that this time, the X-server loops through
nanosleep() invoked by function nv001743x () in the nvidia_drv.so
X-server driver module. Zander, could you please check if this functions
may include code that could react allergic to changes in the (now more
fine grained) timing behaviour of nanosleep()? I can imagine that it is
"historically" possible that nanosleep() is called with only a few nanoseconds
as argument, but nanosleep() always at least waited a full system tick cycle
(10ms on 2.4 or 4ms on 2.6), but now with -rt, nanosleep() may return to fast,
confusing the nvidia driver?!

Quote:
Originally Posted by dmetz99
JaXXoN - I'll try your latest variant and let you know what happens here in a few..
Forget about it, it won't work!! :-)
At least not yet - for details, please refer to my recent post.

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 04-03-06, 07:59 PM   #34
zander
NVIDIA Corporation
 
zander's Avatar
 
Join Date: Aug 2002
Posts: 3,740
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

@JaXXoN: the GFP_DMA zone isn't uncached by default, and neither is the GFP_DMA32 zone, both zones merely reflect the location in physical memory; note that a GFP_DMA32 zone isn't necessary on Linux/x86, since GFP_KERNEL is already limited to "low" (i.e. kernel logical) memory (you need to allocate "high" memory explicitely). The problems you're seeing when not calling change_page_attr() aren't due to the GFP_DMA zone being exhausted (though that explains performance problems, the < 16MB are woefully inadequate), they're caused by the fact that the pages' kernel mappings continue to be cached.
zander is offline   Reply With Quote
Old 04-03-06, 08:07 PM   #35
zander
NVIDIA Corporation
 
zander's Avatar
 
Join Date: Aug 2002
Posts: 3,740
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

@JaXXoN: the above stack trace is insufficient to diagnose the problem, it's unclear to me at this point what's causing it. I suspect it may be related to how the driver yields, but I'll have to check.
zander is offline   Reply With Quote
Old 04-03-06, 08:09 PM   #36
dmetz99
Registered User
 
Join Date: Mar 2005
Posts: 84
Default Re: [PATCH, REALTIME] nvidia-1.0-8178 and Linux-2.6.16-rt11

@JaXXoN - You're right - it did not change anything. Got backtrace identical previous one.

@zander - what other info would be helpful?
dmetz99 is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 10:02 PM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.