OK, I did some research on my own.
First I have to straighten out that the by me called "freeze" is actually no freeze. Its the kernel that disables the interrupt of the nvidia module. This results in disabling all graphical output which is like a freeze.
But I am sometimes still able to ssh into my machine and make a remote restart.
And the SysRQ keys are always still working, so the system is not completely dead.
So, to the next step: Why is the kernel disabling the IRQ?
I am not a kernel hacker, so I am guessing here a bit:
The kernel eventually gets an IRQ (from the BIOS?) and it tries to handle it. Therefore it calls the appropriate handler in __do_IRQ
The handler is the function nv_kern_isr
defined in nv.c
The handler must return a boolean value (IRQ_HANDLED or IRQ_NONE) that indicates if the IRQ was handled or not.
If the return value is IRQ_HANDLED then everything is OK. But if it is IRQ_NONE, then the IRQ was not handled, i.e. it was not meant for this handler.
The kernel tries to give the IRQ to the next handler. But in my case there is no other handler on this interrupt, so nobody cared for this interrupt.
If that happens more than 99900 times, the kernel reports it as a bad IRQ in note_interrupt
and disables the interrupt (and the kernel is so kind to inform me about it
The question is now: Why does the nvidia interrupt handler not handle the IRQ?
The return value of the handler is determined by the return value of the function rm_isr
And this is for me a dead end, because this function is implemented in the closed source binary library.
So either nvidia releases a free open source version of its driver (what many Linux users would love to see
) or the nvidia hackers themself must find out why rm_isr does not handle the interrupt
and freezes my system!!!
Oh, and thanks for your tips, Wolf. I tried different memory setting in the Bios and also your kernel parameters, but it did not help
I hope that someone from nVidia reads this...