PDA

View Full Version : panics with 195.36.24


crsd
05-13-10, 05:40 AM
Hi,

I'm getting following panics with nvidia-driver 195.36.24 on 9.0-CURRENT r207995 amd64:

with debug.witness.watch=1 (reproducible on xorg-server start):

blockable sleep lock (sleep mutex) select mtxpool @ sys/kern/sys_generic.c:1479

db:0:kdb.enter.panic> run lockinfo
db:1:lockinfo> show locks
db:1:locks> show alllocks
Process 1509 (xdm) thread 0xffffff005da09000 (100218)
exclusive sx user map (user map) r = 0 (0xffffff005d564b68) locked @ /home/yuri/src/FreeBSD/head/sys/vm/vm_map.c:2991
db:1:alllocks> show lockedvnods
Locked vnodes
db:0:kdb.enter.panic> show pcpu
cpuid = 3
dynamic pcpu = 0xffffff807f3e8780
curthread = 0xffffff005d9eeb40: pid 1511 "Xorg"
curpcb = 0xffffff8058913d40
fpcurthread = none
idlethread = 0xffffff000340a780: pid 11 "idle: cpu3"
curpmap = 0
tssp = 0xffffffff80e8cc38
commontssp = 0xffffffff80e8cc38
rsp0 = 0xffffff8058913d40
gs32p = 0xffffffff80e8ba70
ldt = 0xffffffff80e8bab0
tss = 0xffffffff80e8baa0
spin locks held:
db:0:kdb.enter.panic> bt
Tracing pid 1511 tid 100219 td 0xffffff005d9eeb40
kdb_enter() at kdb_enter+0x3d
panic() at panic+0x17b
witness_checkorder() at witness_checkorder+0x948
_mtx_lock_flags() at _mtx_lock_flags+0x78
selrecord() at selrecord+0x81
nvidia_dev_poll() at nvidia_dev_poll+0x57
devfs_poll_f() at devfs_poll_f+0x61
kern_select() at kern_select+0x4f2
select() at select+0x5d
syscall() at syscall+0x102
Xfast_syscall() at Xfast_syscall+0xe1
--- syscall (93, FreeBSD ELF64, select), rip = 0x8016c0ecc, rsp = 0x7fffffffe9e8, rbp = 0x6c2160 ---

with debug.witness.watch=0 (random):

mi_switch: switch in a critical section

db:0:kdb.enter.panic> run lockinfo
db:1:lockinfo> show locks
db:1:locks> show alllocks
db:1:alllocks> show lockedvnods
Locked vnodes
db:0:kdb.enter.panic> show pcpu
cpuid = 2
dynamic pcpu = 0xffffff807f3e1780
curthread = 0xffffff0005db2000: pid 1518 "Xorg"
curpcb = 0xffffff80588aad40
fpcurthread = 0xffffff0005db2000: pid 1518 "Xorg"
idlethread = 0xffffff000340a3c0: pid 11 "idle: cpu2"
curpmap = 0
tssp = 0xffffffff80e8cbd0
commontssp = 0xffffffff80e8cbd0
rsp0 = 0xffffff80588aad40
gs32p = 0xffffffff80e8ba08
ldt = 0xffffffff80e8ba48
tss = 0xffffffff80e8ba38
spin locks held:
db:0:kdb.enter.panic> bt
Tracing pid 1518 tid 100198 td 0xffffff0005db2000
kdb_enter() at kdb_enter+0x3d
panic() at panic+0x17b
mi_switch() at mi_switch+0x341
turnstile_wait() at turnstile_wait+0x243
_mtx_lock_sleep() at _mtx_lock_sleep+0xd6
_mtx_lock_flags() at _mtx_lock_flags+0xe1
selrecord() at selrecord+0x81
nvidia_dev_poll() at nvidia_dev_poll+0x57
devfs_poll_f() at devfs_poll_f+0x61
kern_select() at kern_select+0x4f2
select() at select+0x5d
syscall() at syscall+0x102
Xfast_syscall() at Xfast_syscall+0xe1
--- syscall (93, FreeBSD ELF64, select), rip = 0x8016c0ecc, rsp = 0x7fffffffe9e8, rbp = 0x801c20cc0 ---

zander
05-13-10, 10:59 AM
Is this specific to 195.36.24 (vs. e.g. 195.36.15) or FreeBSD 9? Or neither?

crsd
05-13-10, 11:20 AM
Another panic with all available for amd64 drivers (195.22, 195.36.15, 195.36.24) after recent changes to vm (on X server start):

mutex page lock not owned at /home/yuri/src/FreeBSD/head/sys/vm/vm_page.c:1572

db:0:kdb.enter.panic> run lockinfo
db:1:lockinfo> show locks
db:1:locks> show alllocks
db:1:alllocks> show lockedvnods
Locked vnodes
db:0:kdb.enter.panic> show pcpu
cpuid = 1
dynamic pcpu = 0xffffff807f3da780
curthread = 0xffffff0005dcd780: pid 1518 "Xorg"
curpcb = 0xffffff80587c9d40
fpcurthread = none
idlethread = 0xffffff000340a000: pid 11 "idle: cpu1"
curpmap = 0
tssp = 0xffffffff80e8cb68
commontssp = 0xffffffff80e8cb68
rsp0 = 0xffffff80587c9d40
gs32p = 0xffffffff80e8b9a0
ldt = 0xffffffff80e8b9e0
tss = 0xffffffff80e8b9d0
spin locks held:
db:0:kdb.enter.panic> bt
Tracing pid 1518 tid 100153 td 0xffffff0005dcd780
kdb_enter() at kdb_enter+0x3d
panic() at panic+0x17b
_mtx_assert() at _mtx_assert+0xdc
vm_page_wire() at vm_page_wire+0x37
nv_alloc_system_pages() at nv_alloc_system_pages+0x21a
nv_alloc_pages() at nv_alloc_pages+0xdd
_nv020074rm() at _nv020074rm+0x7f


Otherwise 195.22 works stable (after commenting out assert in sys/vm/vm_page.c, which is done for other driver versions as well).

Panics reported in previous post are specific to FreeBSD 9 and both 195.36.15 and 195.36.24.

arundel
05-13-10, 01:37 PM
this thread might be interesting in connection with your panic:

http://www.mail-archive.com/freebsd-current@freebsd.org/msg122234.html

alan cox is currently changing some vm stuff.

crsd
05-13-10, 01:43 PM
Yes, I started that thread, should have linked it here too. :-) Thanks

crsd
06-23-10, 03:59 AM
Still getting lot of random (?) panics with 256.35 -CURRENT/amd64 r209358:
mi_switch: switch in a critical section

cpuid = 1
dynamic pcpu = 0xffffff807f399400
curthread = 0xffffff011c337000: pid 1952 "Xorg"
curpcb = 0xffffff80796f2d40
fpcurthread = none
idlethread = 0xffffff000351d000: tid 100005 "idle: cpu1"
curpmap = 0
tssp = 0xffffffff80eadd68
commontssp = 0xffffffff80eadd68
rsp0 = 0xffffff80796f2d40
gs32p = 0xffffffff80eacba0
ldt = 0xffffffff80eacbe0
tss = 0xffffffff80eacbd0
spin locks held:
db:0:kdb.enter.panic> bt
Tracing pid 1952 tid 100225 td 0xffffff011c337000
kdb_enter() at kdb_enter+0x3d
panic() at panic+0x17b
mi_switch() at mi_switch+0x341
turnstile_wait() at turnstile_wait+0x243
_mtx_lock_sleep() at _mtx_lock_sleep+0xd6
_mtx_lock_flags() at _mtx_lock_flags+0xe1
selrecord() at selrecord+0x81
nvidia_dev_poll() at nvidia_dev_poll+0x57
devfs_poll_f() at devfs_poll_f+0x61
kern_select() at kern_select+0x4f2
select() at select+0x5d
syscallenter() at syscallenter+0xf0
syscall() at syscall+0x4c
Xfast_syscall() at Xfast_syscall+0xe1
--- syscall (93, FreeBSD ELF64, select), rip = 0x8016d5ebc, rsp = 0x7fffffffe9e8, rbp = 0x801c217c0 ---

bigknife
07-07-10, 07:38 AM
I suspect these panics are caused by the use of spin mutexes for &filep->event_mtx. It is certainly not safe to hold a spin mutex across selrecord(). Given that the nvidia driver uses a regular interrupt handler (rather than a filter), it should be safe to simply convert the event_mtx locks to be a regular mutex. To do that, replace 'MTX_SPIN' with 'MTX_DEF' in the mtx_init() calls in nvidia_ctl.c and nvidia_dev.c and replace all calls to mtx_lock_spin() and mtx_unlock_spin() with calls to mtx_lock() and mtx_unlock() instead.

zander
07-07-10, 10:12 AM
That's a good point. However, looking at that piece of code again, I don't think the mutex should be held across the call to selrecord(), even if it were of type MTX_DEF (see nv_kern_post()). I'll try to make some time to get this fixed soon.

bigknife
07-08-10, 08:14 AM
Yes, selrecord() and selwakeup() do have internal locking (albeit a single global mutex, ugh). One minor nit, if devfs_get_cdevpriv() fails for some reason in nvidia_ctl_poll() or nvidia_dev_poll(), then the function should return a mask of 0 rather than the errno value from devfs_get_cdevpriv().

zander
07-13-10, 11:46 AM
I'll fix that, too.

crsd
08-20-10, 07:47 PM
So everything looks stable again with 256.44, thanks!