Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Closed Thread
 
Thread Tools
Old 11-09-02, 11:22 AM   #1
ScoobyDoo
Registered User
 
ScoobyDoo's Avatar
 
Join Date: Nov 2002
Posts: 11
Unhappy Linux/FreeBSD VBLANK (vertical sync) Implementation

Hello,



I am hoping someone from Nvidia will take a few moments to read this and maybe shed some light on this issue for me and others.

I am writing a desktop application in OpenGL. One of the main priorities in a mutlitasking desktop application is to co-operate with the other applications running on the system.

I feel the Linux/FreeBSD drivers have a broken implementation that means desktop applications cannot be successfully developed for this card.

Under the Windows drivers, I can turn on vertical syncing and this allows me to only render the frames I need to. Whilst the application is "waiting" for the next vertical sync to occur, the SwapBuffers() call blocks the application/thread so no CPU time is used unnecessarily. Basically, this means that the application uses < 3% CPU time overall, as most of its life is spent "waiting" in the SwapBuffers() call.

Under the Linux/FreeBSD drivers, if I turn on vertical syncing, the SwapBuffers() call uses a busy-loop whilst waiting for the next vertical sync to occur. This results in 100% of the CPU time being used unnecessarily. This means the application uses 100% CPU time, even when doing very little, thanks to the inefficiency of the SwapBuffers() call. An application that always uses 100% of the CPU is awful for a desktop environment and degrades performance of the whole system.

Now, this was not always the case so I have been told! Apparently, once upon a time, one could sucessfully block the application/thread (so no CPU time is used) while waiting for the vertical sync by issuing a poll() on "/dev/nvidia0". Basically this would do the job:

pollVerticalSync.fd = verticalSyncFD;
pollVerticalSync.events = 0xffff;
pollVerticalSync.revents = 0xffff;
verticalSyncFD = open("/dev/nvidia0", O_RDONLY);
poll(&pollVerticalSync, 1, -1);


Unfortunately, this does not work, and has not for a long time. The poll() returns instantly, instead of blocking until the next vertical sync.

So, please Nvidia, how are we supposed to write desktop applications, that sync to the vertical refresh of the monitor (important for some video work etc.) without any efficient way of handling the vertical sync? The Windows implementation works fine, this is a major setback for Linux/FreeBSD workstation use!

Please reply,

Jamie Burns.
Dynamic Expression Ltd., UK.
ScoobyDoo is offline  
Old 11-09-02, 02:53 PM   #2
bwkaz
Registered User
 
Join Date: Sep 2002
Posts: 2,262
Default

How are you "turning on the vertical sync", out of curiosity?

Have you bothered to check the return value of poll() in your test program, so you know whether it returns because the conditions are met, or because of an error in the parameters? $20 says it's returning a negative value, and setting errno to EBADF (no, this isn't actually any kind of wager, but you understand what I'm trying to get across, don't you?)...

A simple check for error would be to find out if the return value is less than zero, and if so, call perror("poll"); or something similar. And a check of the manpage for poll would tell you what the specific error value you're getting, if any, means. You may also find the select() syscall's manpage helpful. The reason I'm thinking about negative return values is that it looks like your file descriptor is invalid when you call poll() -- you should call open() first.

You're also calling poll() with the "event to poll for" field set to (unsigned short)-1. Why? Again according to the manpage, the only valid values are currently 1, 2, and 4, or sums of those (so in other words, the only valid bits that can be set are bits 0, 1, and 2, and you're setting all bits up to 15). I therefore also wouldn't be surprised if it's returning with an error because you're passing an invalid set of events to wait for...

Even if you can get these issues resolved and the poll() syscall does then work, there's no guarantee that it will in the future. As far as I (as a non-nVidia person) am concerned, the nvidia* device file interface is and should be completely opaque; in other words, there's nothing stopping nVidia from changing this behavior at will and then changing their OpenGL driver (libGL.so.<version> and the matching gl* and glX* functions) to match, in the future. A better solution than poll()ing the device file would be using the OpenGL library to wait for vsync for you, by using a flag or a shell variable on the OpenGL function call that does the displaying (like glXSwapBuffers()'s default behavior with no special flags or shell setup, for example). That way, you would always be waiting for retrace (unless the __GL_SYNC_TO_VBLANK shell environment variable isn't set; take a look at the nVidia README file, in the section regarding the SYNC_TO_VBLANK shell variable for more info).

In other words, to make a long post slightly shorter, I think you should be enabling vblank by using the __GL_SYNC_TO_VBLANK shell variable (and perhaps a setenv() in your program, if you absolutely can't live with the user's setting and have to make it override the user's preferences -- generally not a good idea in the land of Unix, but it would work), and the glXSwapBuffers function. If you are doing both of those things, and it isn't working, only then would I start to report bugs, as millions of other people (or so) are using these drivers and those environment variables, and aren't complaining. It goes along with part of ESR's "How to Ask Questions" paper -- it is possible that you're the only one that's using the drivers and env. variables that way, and therefore the only person that's seeing these problems, but it's about a thousand to one chance. So yeah -- I'm asking if this is the way you're doing it or not.
__________________
Registered Linux User #219692

Last edited by bwkaz; 11-09-02 at 02:57 PM.
bwkaz is offline  
Old 11-09-02, 03:25 PM   #3
ScoobyDoo
Registered User
 
ScoobyDoo's Avatar
 
Join Date: Nov 2002
Posts: 11
Smile

Hi, thanks for your thoughts

Yes, in order to get vertical sync going I am using the:

__GL_SYNC_TO_VBLANK

environment variable.

This does as it says, just in an unacceptably inefficient fashion!

It certainly does make calls to:

glXSwapBuffers()

wait for the vertical retrace, but while it is waiting, the CPU is at 100%. In the Windows drivers, the CPU would be at 0%.

Does this make sense? Same functionality, but one uses 100% CPU, and one doesnt. This is probably OK if you are running a game (where you generally dont care if other tasks get access to the CPU) but for a desktop application it is not good.

You are quite right about the poll() code - this was only ever a hack (albeit one that had been reported to work at one point on Linux) which I thought may trigger some peoples memory of this sort of thing.

What I would love is just for the code which is behind the:

__GL_SYNC_TO_VBLANK

to actually work like its Windows counterpart, not as a hack, but a reliable feature.

To see if it really is just me (!) please do compile up a simple OpenGL test program, and make sure it sits in a tight loop, with only glXSwapBuffers() to slow it down (no sleep() etc.).

Run this with vertical sync enabled and you should see on Linux/FreeBSD that it will consume 100% of CPU power whilst running. On windows, it will consume only a few % (depending on your setup) to do the same task.

I think the problem is, most people use OpenGL for games, and so are used to their CPU running at 100%. Games in general suck as much CPU power as possible.

Jamie Burns.
ScoobyDoo is offline  
Old 11-09-02, 05:13 PM   #4
bwkaz
Registered User
 
Join Date: Sep 2002
Posts: 2,262
Default

When I run glxgears with the VBLANK variable set, yes, it does go at 60Hz (which is my monitor refresh), but it doesn't use 100% of the CPU. It's closer to 80%, and the Distributed Folding client that I'm running in the background at the same time gets 80% as well (according to ps, which should be a one-shot deal, right? I'm wondering how it can add up to >100%... either way, though, they're getting equal CPU).

Have you tried the changes I was suggesting for the poll() alternative?
__________________
Registered Linux User #219692
bwkaz is offline  
Old 11-09-02, 07:13 PM   #5
ScoobyDoo
Registered User
 
ScoobyDoo's Avatar
 
Join Date: Nov 2002
Posts: 11
Default

Hi again,

Thanks again for your thoughts, I do appreciate it.

=============================================

"either way, though, they're getting equal CPU"

But even this statement shows the futility of the driver implementation. The point is glxgears should not be consuming equal processor time! It should be consuming hardly any! When you run ps you dont expect every process listed to consume the same processor time as each other - you expect a handful of processes to be using *some*, with possibly a couple of processes really going for it (with good reason). The reason for this is that processes "sleep" when they dont need the CPU.

The glxgears process, when it is waiting for a vsync interrupt, does not need the CPU at all so it should sleep until the interrupt comes! When the scheduler offers glxgears some CPU time, instead of being greedy and grabbing as much as it can, it should say "no thank you, i dont need any right now".

Instead of doing this (like any nice Unix program should), it sits in a "busy-loop" which I can only imagine is something like:

while (1) {
if (vsync == true) break;
}

And this just hammers the CPU, grabbing as much time as it can get its greedy hands on.

It almost reminds me of the way people used to code for DOS in the old days when you didnt care about using up CPU time as no other programs would be running anyway. This is 2002! We use multitasking Operating Systems now!

As I have said, the Windows drivers do not seem to do this. They correctly sleep until the interrupt arrives.

=============================================

Here is some of the output from my box, running glxgears with vertical sync enabled (correctly runs at 60fps):

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 150 93.7 2.2 76564 7212 p0 R+ 1:45AM 1:37.48 /usr/X11R6/bin/glxgears
root 97 0.0 0.3 1264 960 v0 Is 1:42AM 0:00.03 login -p root
root 98 0.0 0.2 1324 792 v0 I 1:42AM 0:00.03 -csh (csh)
root 101 0.0 0.4 1512 1244 v0 I 1:42AM 0:00.01 bash
root 102 0.0 0.1 636 444 v0 I+ 1:42AM 0:00.01 /bin/sh /usr/X11R6/bin/startx
root 112 0.0 0.3 1748 1108 v0 I+ 1:42AM 0:00.01 xinit /root/.xinitrc -- -nolisten tcp
root 118 0.0 1.0 4756 3268 v0 S 1:43AM 0:00.74 /usr/X11R6/bin/wmaker
root 119 0.0 0.9 3992 2984 v0 S 1:43AM 0:00.12 xterm

So, we can see that even with vertical syncing turned on, the CPU usage is > 90% (not quite my 100%, but close enough to grind a Unix multiuser system to a halt).

I added a small program that just spins in a loop. In a real life situation this could be an mp3 encoder, video player, or whatever. I reniced glxgears to -10 to give it priority.

The following shows how such a processor intensive task co-operates with an OpenGL application using Nvidia drivers under Linux or FreeBSD:

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 150 71.8 2.2 76564 7212 p0 R<+ 1:45AM 5:02.06 /usr/X11R6/bin/glxgears
root 200 28.9 0.2 1256 732 p1 R+ 1:50AM 0:06.08 ./a.out
root 97 0.0 0.3 1264 960 v0 Is 1:42AM 0:00.03 login -p root
root 98 0.0 0.2 1324 792 v0 I 1:42AM 0:00.03 -csh (csh)
root 101 0.0 0.4 1512 1244 v0 I 1:42AM 0:00.01 bash
root 102 0.0 0.1 636 444 v0 I+ 1:42AM 0:00.01 /bin/sh /usr/X11R6/bin/startx
root 112 0.0 0.3 1748 1108 v0 I+ 1:42AM 0:00.01 xinit /root/.xinitrc -- -nolisten tcp
root 118 0.0 1.0 4756 3268 v0 S 1:43AM 0:01.13 /usr/X11R6/bin/wmaker
root 119 0.0 0.9 3992 2984 v0 S 1:43AM 0:00.14 xterm


So, here we see that a processor intensive task (a.out) can only grab about 30% of the CPU time. Also to note is that glxgears dropped down to about 40FPS. The way this should be is that the processor intensive task gets about > 90%, and glxgears gets < 10% (yet runs at full speed - it is hardly doing anything).

Again, why is it the worse scenario above, and not the better (like windows offers)?

The glxgears application should be "waiting" inside glxSwapBuffers() for most of its life doing *nothing* (as it would on Windows). Instead it spends most of its life inside glxSwapBuffers() hammering as much precious CPU time as it can get its greedy mits on, dragging the system to a halt with it.

If I hadnt have reniced glxgears, the FPS would have shot down to about 20-30FPS.

This situation is not good for multiuser/multitasking Operating Systems! It goes completely against some of the design philosophy of Unix.

This means that any time an OpenGL application needing video timing using the Nvidia drivers is run on a multiuser/multitasking system (Linux/FreeBSD) then no other processes have any hope of running concurrently at speed! Yet on Windows they will run fine!

=============================================

Can you see how this is a bad thing for any Unix-like OS?

The crux of the matter is I am trying to write an OpenGL desktop a bit like the what Apple have done with Aqua/Jaguar. You cannot have the GUI taking up 90% of the system resources leaving only 10% for all other applications!

I actually started to write my code on windows and was very happy with the vertical sync performance. The CPU usage was < 3% which is about what I would expect, especially pre optimisation. When I ported it to Linux the CPU usage was > 90% no matter how much I tried to optimise it.

Somebody help!!

Dont make me go and spend 3000 on a really slow macintosh and send FreeBSD/Linux to the dustbin!!

Jamie.

ps.

With regard to the poll, it does not return < 0 on my test program. It simply returns immediately, without waiting. I am not sure which events to look for, this was just a piece of code a guy said he used to use a year or so ago, but it stopped working when he updated his drivers/kernel one time.

Last edited by ScoobyDoo; 11-09-02 at 07:39 PM.
ScoobyDoo is offline  
Old 11-09-02, 10:52 PM   #6
bwkaz
Registered User
 
Join Date: Sep 2002
Posts: 2,262
Default

OK. First, let me say that Windows does a lot of crap behind users' backs. I wouldn't be surprised if nVidia cards didn't actually interrupt anything when a monitor vblank happened. It wouldn't surprise me if Windows actually kept track of that, and made known a way (possibly not even a true, exported Windows API call, but something that's guaranteed to be not changed, and made known to only driver developers) to set up a callback to track it. A lot like an interrupt, but entirely on the software side. Which would make writing that way of doing it into the drivers infinitely easier.

Saying that windows drivers can do something that Linux drivers can't doesn't necessarily mean that it's a bug in anything, that's all I'm saying. Go ahead and ask the nVidia people to fix it, but don't be surprised if it isn't possible. Don't be surprised if it is possible, but never happens. My opinion of all this is that you are one program among at least thousands. If it's really hard to do, but you're the only person asking for it, I wouldn't be surprised if it doesn't get done. Even worse, if adding that feature must break (for some arcane reason that none of use can see, because the drivers are still closed-source) the other 99.9% of programs out there, then I really wouldn't be surprised to never see it happen. Not that I'm saying this is the case (again, I don't know any more about the hardware or the closed-source part of the driver than you do), but it is a possibility, however remote.

Since you're writing a program that acts one heck of a lot like a window manager, why not just look at what other people have already done with window managers that look like Aqua? There are quite a few Aqua look-alike skins for normal window managers... and if you want it to act like Aqua too, why do you NEED OpenGL? Why not just use the X RENDER extension, which accelerates 2D graphics in X11? I wouldn't think it'd be too hard to do most of what I've seen Aqua do with just 2D graphics anyway... but hey, I'll admit that I haven't seen it do much at all.

Quote:
With regard to the poll, it does not return < 0 on my test program. It simply returns immediately, without waiting.
OK, very well. But I was seeing at least two different things that could have been wrong with the program, that could have caused it to return an error, and no indication whatsoever that you had ever checked for any of those errors. I assume that you have tried changing the order of open() and the setop of the poll() struct's fd member around, with no change? I also assume you've tried reading the man page for poll, and tried changing the event member of the struct to be at least legal (here's a hint that should be blindingly obvious from that man page: the legal value for what you appear to be trying to do is 7, for POLLIN, POLLPRI, and POLLOUT all set), with no change? And you aren't saying how you determine the function to be coming back immediately -- how are you doing that?

Once you at least make this poll() stuff valid C syscall requests that still don't do what the guy said they used to, then OK, look into asking for the feature (and yes, "asking for a feature" is all that you are doing, you are not reporting a bug, you are not reporting a "broken implementation", nothing anything as melodromatic as all that -- a difference between the Windows and Linux drivers is not anything like a difference between the documented Linux behavior and the actual Linux behavior). But frankly, I want to see that you've tried making it work yourself first. I see that you've tried glXSwapBuffers(), OK, very well. But bringing up this hack, and then having it be completely wrong from at least two different angles, and then saying that it doesn't work, just makes perfect sense -- by all rights that code shouldn't have ever worked in the first place! See if fixing it helps.

Just as one more little thing:

Quote:
If I hadnt have reniced glxgears, the FPS would have shot down to about 20-30FPS.
As it should have! You are running two different processes that are both trying to take all of the CPU -- one supposedly legitimately, one supposedly not -- both at the same priority. Is it any wonder that they split the CPU evenly? Is it any wonder that of the 1/60 of the time that the vblank is actually occurring, you miss half of it? Which drops your framerate down to 30fps, or even 20? See, when you're syncing to vblank, the framerate must be a divisor of your refresh rate -- either 60fps, 30fps, 20fps, 15fps, 12fps, 10fps, 8.57142fps, 7.5fps, etc., or etc., on down. Basically, the framerate must be, at all times, 60fps divided by some integer. If you just miss one vertical refresh, you will stall until the next, which drops the framerate down one more step. Assuming the unthinkable (or at least the very, very unikely) happens, and you miss 60 times in a row, you'll get 1fps for that frame.

--------

Then, last of all before I just hit submit and say screw it, I've been typing and editing for waaaay too long, a couple of things I'd think about trying to do in your program:

Do the waiting yourself. You're writing a window manager for X, basically, so you should have access to X's refresh rate. So then make it a habit on every frame to call some usleep() or whatever function for the majority of the refresh rate time, then call glXSwapBuffers() with vsync enabled.

Or, use something like BogoMIPS in your program. Fork your execution stream (but don't use fork() itself, just clone() or pthread_create() or something like that), then do all your GL rendering in one stream, and set a global (but mutex'ed) variable at the end. In the other stream/thread, just keep mutex'ed checking the global variable, and when it's set, join() with the other thread. But in the meantime, be incrementing a counter. Once you join, figure out how much of the vblank time you haven't used (from that counter), and then usleep for a short time less than that. Then call glXSwapBuffers.

Something like this pseudocode:

Code:
// initialization:

counter=0;

do {
    maxcounter++;
} until(exactly one second has passed since counter got set to zero);

maxcounter /= get_refresh_rate();

// in your rendering loop:

clone();

if(I'm the child thread) {
    do all rendering;

    mutex_get(mtx);
        global_var = 1;
    mutex_release(mtx);
}
else {
    mutex_get(mtx);
        local = global_var;
    mutex_release(mtx);

    if(local != 1) {
        counter++;
    }
    else {
        usleep(a number of microseconds a few less than a number that,
               had I been counter++'ing during that amount of time, would
               have made counter be equal to maxcounter);

        join(parent_thread);  // just to clean up after it; it's already done

        glXSwapBuffers(whatever is needed here);
    }
}
__________________
Registered Linux User #219692

Last edited by bwkaz; 11-09-02 at 10:58 PM.
bwkaz is offline  
Old 11-10-02, 07:50 AM   #7
ScoobyDoo
Registered User
 
ScoobyDoo's Avatar
 
Join Date: Nov 2002
Posts: 11
Default

Well, yes, windows does a lot of crap, but at the very least it can run my app smooth as silk in < 3% CPU. Proof of the pudding is in the eating.

Well, I don't want to get too much into why I need OpenGL. But it is the technology of choice as far as I am concerned. Modern windowing systems make use of what Apple would call a compositing layer. Over the next five years I read Apple expects Microsoft to use a similar approach (DirectX no doubt) for theie windowing system. As usual, Linux/FreeBSD and X11 for that matter will be way behind because most of their coders are kernel hackers and can't grasp these concepts, or the need for certain driver features.

I pasted the poll() code out of order. In my application I open /dev/nvidia0 in a seperate function and pasted it in the wrong place in the original post. The code I have once worked. Reread that - I mentioned it several times. In fact I had a reply from Nvidia today and they said:

Quote:
Yes, at one point the driver did poll rather than busy wait, though there were various problems with this (there were some scheduling issues, if I remember correctly). Nvidia.
So this is not a problem with the poll code I was using.

The guy from Nvidia didnt explicitly say what the scheduling problem was, but I heard elsewhere that it could often miss out vsync's as the Linux scheduler could not get back to the application in time (in < 1/100th second).

Quote:
As it should have!
No no no. It shouldnt! This is my whole point. It should still run at full speed with only a couple of % of CPU time. This isnt a difficult concept to grasp now is it. If the drivers were perfect, the glxgears should have used 3% of the CPU with that nice value, and the a.out program should have had 97%, and glxgears should have gotten priority. Anything else shows the margin of error (about 90% wasted CPU time).

Any option using usleep() is prone to problems as this call is subject very much to the Linux scheduler. The average latency of the Linux kernel scheduler is 10,000 microseconds, but this can be as high as 100,000 microseconds under load. So if you call usleep() there is a good chance the scheduler never gets back to you in time (60hz == 16667 microseconds), and you will miss the vsync (this is why nvidia didnt slip a usleep() in their glxSwapBuffers() function themselves).

I was encouraged by Nvidia's response however! And I am thankful they took the time to read and reply to my email (the first portion if this thread). They replied:

Quote:
Consider your request received; we'll investigate better solutions
for a future release.
Regards,

Jamie.
ScoobyDoo is offline  
Old 11-10-02, 09:51 AM   #8
bwkaz
Registered User
 
Join Date: Sep 2002
Posts: 2,262
Default

Quote:
Consider your request received; we'll investigate better solutions for a future release.
Well, OK, at least they're thinking about it.

Maybe setitimer() would be another possibility, rather than usleep()? (and actually, re-reading the usleep() manpage, I see that whoever wrote it is saying to use nanosleep() instead. I wonder if there are the same scheduling issues with that...)

Good luck in any case...
__________________
Registered Linux User #219692
bwkaz is offline  

Old 07-26-03, 02:49 PM   #9
dlyne
Registered User
 
Join Date: Jul 2003
Posts: 4
Default

I tried the poll() code out (given in the first posting in this thread).

Pretty much seems to work. However, I get a slight tearing at the top of the screen. Seems to be pretty smooth and taking up 95% less CPU time though.
dlyne is offline  
Old 01-16-07, 10:00 AM   #10
bram
Registered User
 
Join Date: Nov 2003
Posts: 17
Default Re: Linux/FreeBSD VBLANK (vertical sync) Implementation

Was support for this dropped?
This no longer works with newer drivers, like
87.76 and 97.46

Bram
bram is offline  
Old 01-16-07, 11:52 AM   #11
AaronP
NVIDIA Corporation
 
AaronP's Avatar
 
Join Date: Mar 2005
Posts: 2,487
Default Re: Linux/FreeBSD VBLANK (vertical sync) Implementation

This thread is ancient. __GL_SYNC_TO_VBLANK=1 glxgears on my machine uses roughly 2% CPU time on my Athlon 64 3000+. Waiting for vblank on the CPU is a broken design anyway, so I'm going to close this thread. If you're having trouble with sync to vblank, please start a new thread.
AaronP is offline  
Closed Thread


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sync it up: Hands on with the preview of Windows 8's cloud sync service News Archived News Items 0 06-02-12 08:30 PM
Vertical sync being forced on newer drivers with multi-card. Xevious NVIDIA Linux 0 05-11-12 05:31 AM

All times are GMT -5. The time now is 08:23 AM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.