Originally Posted by gilboa
What I don't get is this: If indeed the driver fails in user-land context, how come the user-land application doesn't get stuck (be that in the first function, or the proceeding poll attempts) in an un-interruptible sleep? (Zombie)
If I understood you correctly, user-land AP would become Zombile if
it can't return from its kernel space part( ie. in this case, somewhere
within the ioctl() call).
However user-land application (ie. X) can be killed. So this is my blind
User-land application does some kind of 'poll' on hardware back and
forth within a loop. This kind of 'poll' would wait a definately
very small mount of time and then return, no matter the result is
acceptable or not. Then the user-land application check
the 'poll' result and finds that is unacceptable, so the user-land
application decide to continue loops with great confidence
of getting an acceptable result in the very short future.
But the acceptable result never comes.
When the user-land application 'poll/ioctl', it must have resched
happened. In one side I guess this user-land application(X) must
have 'sched_setscheduler' to some real time scheduling policy
(Round-robin or FIFO) in order to offer services to many
Xwindow APs. In the other side, when the user-land application
using ioctl() to poll the hardware, a resched happens and that's
why we can remotely ssh/telnet to the 'freezing' PC.
Basically, my guess is like yours. I prefer they add some sorts
of flags in nvidia's driver/lib (Maybe an big array that can be accessed
and logged by another program or module). I believe sometimes
printk is not that synchronous.