Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 11-12-09, 05:02 PM   #1
uau
Registered User
 
Join Date: Sep 2009
Posts: 45
Default VDPAU API and implementation issues

Below are some problems I've come across while implementing a better VPDAU driver for MPlayer than the svn version has. General API issues are listed first, then implementation problems.



Functions to get vsync interval and time

As discussed previously in the MPlayer thread, since the VDPAU implementation refuses to switch frames more than once per display refresh there should be a method to get the current display refresh interval. Otherwise applications can't be sure they're not trying to queue frames more rapidly than VDPAU is willing to show them. Additionally, once you're doing vsync-aware timing, it would also be useful to have a function to directly get the timestamp of a recent vsync (to calculate approximate times of future vsyncs as this plus a multiple of the interval). Once some frames have been displayed you can get vsync times by querying their status, but it would be nicer to have a correct value from the start.


BlockUntilSurfaceIdle: no non-blocking way to wait for event

In general it's not nice that the only way to wait for an event is blocking and can't be integrated in an event loop. This hasn't been a practical problem in my use yet as I haven't queued frames far ahead, but I think it would be more of an issue with multiple queued frames. This could be implemented as an fd that becomes readable when status changes, either with a message about the change or just a dummy byte that means you should recheck the status of the surface(s) you're interested in.


Bad documentation of RenderOutputSurface blend_state parameter

The documentation says "The blend math is the familiar OpenGL blend math: dst.a = equation(blendFactorDstAlpha * dst.a, blendFactorSrcAlpha * src.a)". This should have more details or at least a pointer to the OpenGL documentation - having to go search for the details elsewhere when working on VDPAU is annoying. What the current documentation does say is also wrong: the MIN and MAX equations ignore the blend factors. I at least did not remember that from OpenGL and wondered why my code didn't work until I looked up the applicable OpenGL documentation and checked the details there.



The following issues are more implementation-related. I used a 9500GT with 185.18.36 drivers; I haven't yet tested whether any of them have been fixed in latest drivers.

Trying to queue more than 8 frames for display seems to block. If there is meant to be a limit lower than the amount of surfaces you can allocate then this should be documented.

Queuing up to 8 frames worked if doing nothing else, but trying to do other operations like upload video surfaces while having two or more unshown surfaces queued for display caused the driver to use a lot of CPU (could have been a busyloop until there was only one yet undisplayed surface). This at least looks like a clear implementation problem.

The BlockUntilSurfaceIdle documentation says it will block indefinitely if queried about the most recent surface added to the queue (presumably until a surface is added from another thread, if ever). However in my tests it seems to return with an error, with the corresponding error message being "A catch-all error, used when no other error code applies.".

The VSYNC timestamps returned by QuerySurfaceStatus and BlockUntilSurfaceIdle are quite accurate when using overlay, but when overlay is disabled the accuracy drops a lot. I a test I saw the delta between timestamps returned for frames shown on consecutive display refreshes vary from 11055232 to 12650112 on a 85 Hz display. Other activity in X seems to increase the variation. This instability confused the algorithm I used to estimate VDPAU display FPS (which worked fine with overlay).

I was unable to see ANY difference between bob deinterlacing and the higher modes. The higher modes were slower though so it did look like they were at least enabled properly. I also tested this with plain svn MPlayer to check I hadn't broken anything, but didn't see differences with that either.
uau is offline   Reply With Quote
Old 11-13-09, 03:52 AM   #2
cehoyos
FFmpeg developer
 
Join Date: Jan 2009
Location: Vienna, Austria
Posts: 467
Default Re: VDPAU API and implementation issues

Quote:
Originally Posted by uau View Post
I was unable to see ANY difference between bob deinterlacing and the higher modes.
Differences between Bob and temporal de-interlacing are clearly visible with different drivers (iincluding 190.42) and different graphic cards (feature set B, C and D) on latest MPlayer svn.
Differences between temporal de-interlacing and temporal-spatial de-interlacing are more difficult to see;-)

Carl Eugen
cehoyos is offline   Reply With Quote
Old 11-13-09, 11:16 AM   #3
Stephen Warren
Moderator
 
Stephen Warren's Avatar
 
Join Date: Aug 2005
Posts: 1,327
Default Re: VDPAU API and implementation issues

Functions to get vsync interval and time:

Philosophically, the presentation queue is intended to be used in a "feedback mode" rather than a pre-calculated mode. I think that's where most of our disconnect is.

In other words, we intended applications to queue up N frames at the start of presenting a stream, the monitor the actual display times of those frames as they get displayed/idled in the presentation queue, then make decisions regarding whether to skip the display of future frames based on whether the display of previous frames lagged at all.

It sounds like you're attempting to use a model where you calculate/predict when a future VSYNC will occur, and adjust future scheduling of frames in terms of that.

For the issue you mention of knowing which frequency of frames to queue up when pre-loading the presentation queue when beginning to present a stream, I think that XF86VidMode should allow a close enough approximation that there will be no issue.

Unfortunately, I'm not sure if it would be possible to implement a "tell me when the/a most recent VSYNC occurred" API. I'll file a feature request to investigate this, although I certainly am not committing to implementing it. You may be able to simulate this by presenting a dummy surface and querying when it gets presented.

One other thing to note: In the NVIDIA implementation, the clock used to scan out pixels is not locked to the presentation queue timestamp clock; they may slowly drift (hopefully very slowly). This is another argument to use a feedback mode of operation, based on actual rather than pre-calculated VSYNC times.


BlockUntilSurfaceIdle: no non-blocking way to wait for event:

We envisaged applications that needed this functionality would use multiple threads. The thread that blocks inside BlockUntilSurfaceIdle could itself signal back to the main thread using a pipe/select mechanism. Would that work for you?



Inability to queue more than a couple surfaces into the presentation queue while performing other operations:

Did this occur in both the overlay- and blit-based presentation queues? To investigate this, it'd be easiest if we could reproduce the issue using your code. Can you provide an application that reproduces this. Thanks.


Presentation timestamps jitter:

Unfortunately, the timestamps will jitter a lot more in the blit-based presentation queue. Yes, this may be affected by CPU/GPU load. It's unlikely this will change in the near term, or possibly even long term.


De-interlacing algorithm performance:

The most obvious difference between bob and better algorithms should be increased vertical resolution in the output. In some cases, whether this is noticable will depend on the exact image being displayed.


Various documentation issues:

I'll add a few more notes to vdpau.h that should help.
Stephen Warren is offline   Reply With Quote
Old 11-13-09, 01:32 PM   #4
uau
Registered User
 
Join Date: Sep 2009
Posts: 45
Default Re: VDPAU API and implementation issues

Quote:
Originally Posted by Stephen Warren View Post
It sounds like you're attempting to use a model where you calculate/predict when a future VSYNC will occur, and adjust future scheduling of frames in terms of that.
Yes that's what I'm doing, and I think it does allow a better end result. If you implement a frame rate limiting mechanism then you may as well synchronize it with the real display updates to achieve the most accurate possible results. In addition to lagging caused by display frame rate limits, there's also another kind of undesirable behavior that can be fixed by explicit synchronization with display refreshes, namely jitter caused by times randomly falling a bit before or a bit after a vsync boundary. This occurs when for example playing 24 FPS content on a 72 Hz display. Ideally you'd show each frame for 3 display refreshes, but if the queued times happen to be near vsync boundaries and VDPAU shows each frame at the next vsync after the timestamp it was queued with then "random" variation in timing can cause alternating 2- and 4- refresh frames. With vsync-aware timing this can be fixed by adjusting the timestamps to avoid unwanted changes from just-after-vsync to just-before-vsync or vice versa. In principle VDPAU could do similar adjustments internally by taking into account previously queued timestamps, but I'm not sure whether such "smart" adjustments would really be appropriate at this level.

Quote:
Unfortunately, I'm not sure if it would be possible to implement a "tell me when the/a most recent VSYNC occurred" API. I'll file a feature request to investigate this, although I certainly am not committing to implementing it. You may be able to simulate this by presenting a dummy surface and querying when it gets presented.
This is not a major issue; any inaccuracies caused by not having the correct timing from the start are unlikely to be really visible. Just nice to have.

Quote:
One other thing to note: In the NVIDIA implementation, the clock used to scan out pixels is not locked to the presentation queue timestamp clock; they may slowly drift (hopefully very slowly). This is another argument to use a feedback mode of operation, based on actual rather than pre-calculated VSYNC times.
If by "drift" you only mean that intervals between VSYNCs are not necessarily a constant multiple of some queue timestamp amount then that's not a problem. I don't extrapolate timestamps arbitrarily far into the future from a single VSYNC time, but rather always use the latest available VSYNC time from the surface status functions as the base. So as long as there's no significant deviation in the interval from the last queried surface to the next frame being processed to be queued there's no problem.

Quote:
BlockUntilSurfaceIdle: no non-blocking way to wait for event:

We envisaged applications that needed this functionality would use multiple threads. The thread that blocks inside BlockUntilSurfaceIdle could itself signal back to the main thread using a pipe/select mechanism. Would that work for you?
Yes it should work, though adding thread management does make things clumsier.

Quote:
Inability to queue more than a couple surfaces into the presentation queue while performing other operations:

Did this occur in both the overlay- and blit-based presentation queues?
I initially noticed it with overlay enabled, but IIRC I did test that disabling overlay made no difference.
Quote:
To investigate this, it'd be easiest if we could reproduce the issue using your code. Can you provide an application that reproduces this. Thanks.
I initially noticed significant increased CPU use in a test version of MPlayer that queued multiple frames ahead, then tested it with a version that queued frames at now + 20 seconds while manually stepping through one frame at a time. I don't have any of the test code left though. I can recreate a version later.

Quote:
Unfortunately, the timestamps will jitter a lot more in the blit-based presentation queue. Yes, this may be affected by CPU/GPU load. It's unlikely this will change in the near term, or possibly even long term.
Couldn't the driver "fake" better values? Even if the real time when things are processed varies, it could calculate the ideal vsync time and use that as the timestamp.
uau is offline   Reply With Quote
Old 11-16-09, 10:20 AM   #5
uau
Registered User
 
Join Date: Sep 2009
Posts: 45
Default Re: VDPAU API and implementation issues

I recreated a version of the code that displays the frame queuing problem.
The following should work to check out and build the modified version of MPlayer:

$ git clone git://repo.or.cz/mplayer-build.git
$ cd mplayer-build
$ ./init --shallow
$ make -j 6
$ cd mplayer
$ git checkout origin/vdpau_problem_test
$ make
$ VDPAU_TRACE=1 ./mplayer -quiet /tmp/test.mkv

The test version always queues frames at time now + 10 seconds. On my machine the output looks like this:

<snip>
vdp_presentation_queue_display(3, 7, 1592, 1080, 1258378599251773728)
-> 0
Queued surface number 2 for display at now + 10 seconds.
vdp_video_surface_put_bits_y_cb_cr(14, 1, {0x7fafa3ab9a20, 0x7fafa3a2de98, 0x7fafa4a0be98}, {1952, 976, 976}, )

and then it blocks there for about 10 seconds.

I've now tested that the behavior is the same with driver version 190.42.
uau is offline   Reply With Quote
Old 11-16-09, 12:42 PM   #6
Stephen Warren
Moderator
 
Stephen Warren's Avatar
 
Join Date: Aug 2005
Posts: 1,327
Default Re: VDPAU API and implementation issues

Quote:
Originally Posted by uau View Post
The test version always queues frames at time now + 10 seconds. On my machine the output looks like this:
Ah, that explains it.

The typical use-case for the presentation queue is to queue a few frames ahead *at 24-60Hz*. The reason for queueing frames ahead is to cover up and typically short-term latency in the system, application, GPU rendering, etc.

Queueing 10s ahead (in fact, queueing anything over 0.5s ahead is the value we picked) is an unusual situation.

Due to some implementation details in the blit-based presentation queue, events such as window movements "accelerate" any queued frames such that they are displayed immediately. When queueing e.g. 24-60Hz video, this will be noticed as a temporal glitch in the display for a short portion of time (e.g 1/3 s with max 8 frames queued at 24Hz). If the frames being queued are a significant time in the future, the effect will be significantly more noticeable; if the app queues at 0.1Hz, then you might end up with frames displayed at t=10, t=10.1 (should be t=20), t=30. To avoid this situation, any time we see a frame that should be displayed more than 0.5s in the future, the driver blocks until 0.5s before the timestamp before queueing the frame into the HW, to avoid significantly early display. The blocking itself is not a busy wait, but I suppose it could cause other parts of the driver, or other drivers, to busy-wait.

We could solve your issue simply by removing this "wait until close to the timestamp" behaviour. Would that be better than blocking here?
Stephen Warren is offline   Reply With Quote
Old 11-16-09, 02:06 PM   #7
uau
Registered User
 
Join Date: Sep 2009
Posts: 45
Default Re: VDPAU API and implementation issues

That 0.5 s threshold alone doesn't explain all the problems I saw. The reason I tested with that obvious-to-see 10 s limit at all was that I already saw problems with shorter intervals. Also queuing ONE frame 10 s ahead seems to cause no such issues.

I just tried changing the code to queue the frames 0.1 s ahead instead of 10 s, and there are still problems: playback is slow (even with enough output surfaces that block_until_surface_idle should not limit the playback speed in this case), the mplayer process uses 100% CPU, and VDPAU_TRACE shows that the process still spends most time blocking (likely in a busyloop based on the CPU use) in vdp_video_surface_put_bits_y_cb_cr.
uau is offline   Reply With Quote
Old 11-16-09, 02:21 PM   #8
Stephen Warren
Moderator
 
Stephen Warren's Avatar
 
Join Date: Aug 2005
Posts: 1,327
Default Re: VDPAU API and implementation issues

What exactly do you mean my "N seconds ahead"; is it:

a) Using original timestamps from the stream, with 0.1s added to each>
b) Creating fake timestamps, each 0.1s apart?

Either way, what modification do I need to make to your git tree to repro the problem without manipulated timestamps?

Thanks.
Stephen Warren is offline   Reply With Quote

Old 11-16-09, 03:29 PM   #9
uau
Registered User
 
Join Date: Sep 2009
Posts: 45
Default Re: VDPAU API and implementation issues

I certainly don't do b) - the timestamps are not 0.1 s apart from each other. I'm not sure whether you mean the same thing with a) - what does "added to each" mean when the absolute value is pretty much meaningless? Basically the test code queues each frame to be shown at vdp_presentation_queue_get_current_time() + 0.1 seconds (0.1 seconds into the future from the moment it's queued), and MPlayer tries to queue new frames at the natural rate the video would play at.

The main difference the test code has to "natural" playback that tries to queue frames 0.1 s ahead of their intended display time is that when VDPAU blocks and the queuing happens late as a result, it still queues the frames 0.1 s into the future from the queuing time (instead of the original time which is now less than 0.1 s away due to queuing running late). If "natural" code was slowed down by the blocking it'd queue frames less ahead, making the problem less obvious (queuing frames 0.1 s ahead causes slowdown -> queuing starts running late -> display time is now less than 0.1 s ahead of queuing time and there are no multiple frames queued simultaneously -> speed recovers). With "natural" code the main visible symptom of the problem is high CPU use.

I changed the branch to queue 0.1 s ahead of the queuing moment instead of 10 s. If you want a completely "natural" version I can add that; but I think the 0.1 s version should be "valid" behavior from VDPAU's point of view - VDPAU shouldn't be able to tell it's not just a normal video, and there's no excuse for the behavior it shows.
uau is offline   Reply With Quote
Old 11-24-09, 04:59 PM   #10
uau
Registered User
 
Join Date: Sep 2009
Posts: 45
Default Re: VDPAU API and implementation issues

I did some more testing about the queuing issue. It still happens with 195.22 drivers. I noticed one extra bit of information: if I use hardware decoding, AND there is no OSD/subtitle content displayed in MPlayer, then things seem to work fine even when queuing multiple surfaces ahead.

However using vdp_video_surface_put_bits_y_cb_cr() with software decoding triggers problems - VDPAU_TRACE shows VDPAU spending a lot of time blocking in that function as mentioned in my earlier post.

OSD/subtitle content also causes problems, but it seems to depend on the number of textures drawn after video mixer rendering. The problems also occur with non-changing libass subtitles that do not cause any reuploads of texture data. Basically it looks like the operations become a lot slower with frames queued ahead, and a much smaller amount of extra rendering starts causing slowdown.
uau is offline   Reply With Quote
Old 11-24-09, 06:42 PM   #11
Stephen Warren
Moderator
 
Stephen Warren's Avatar
 
Join Date: Aug 2005
Posts: 1,327
Default Re: VDPAU API and implementation issues

Quote:
Originally Posted by uau View Post
The VSYNC timestamps returned by QuerySurfaceStatus and BlockUntilSurfaceIdle are quite accurate when using overlay, but when overlay is disabled the accuracy drops a lot.
This should be much better in 195.22, at least for windows that are not redirected.

I haven't had a chance to investigate/fix the other issues yet. Sorry.
Stephen Warren is offline   Reply With Quote
Old 12-01-09, 12:50 PM   #12
Stephen Warren
Moderator
 
Stephen Warren's Avatar
 
Join Date: Aug 2005
Posts: 1,327
Default Re: VDPAU API and implementation issues

Quote:
Originally Posted by uau View Post
Queuing up to 8 frames worked if doing nothing else, but trying to do other operations like upload video surfaces while having two or more unshown surfaces queued for display caused the driver to use a lot of CPU (could have been a busyloop until there was only one yet undisplayed surface). This at least looks like a clear implementation problem.
We've identified a fix for this issue. It will be included in a future driver release.
Stephen Warren is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 03:08 PM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.