nV News Forums

 
 

nV News Forums (http://www.nvnews.net/vbulletin/index.php)
-   NVIDIA Linux (http://www.nvnews.net/vbulletin/forumdisplay.php?f=14)
-   -   K1000M on Thinkpad W530: card falls off the bus (http://www.nvnews.net/vbulletin/showthread.php?t=191780)

Godlikearg 09-14-12 03:45 PM

K1000M on Thinkpad W530: card falls off the bus
 
Hi,

I've been breaking my head trying to get Bumblebee to work. After some generous help on #bumblebee IRC channel, I started doing some low-level testing, and managed to narrow down the problem.

Distro is Gentoo. Kernel version is 3.2 (I can try with 3.5 or other kernels but I suspect the problem will persist). Driver versions I have tried are 304.22, 304.43 and 304.48.

Either doing this:

Code:

# modprobe nvidia
# nvidia-xconfig -query-gpu-info

or this:

Code:

# nvidia-xconfig -query-gpu-info
from a tty as soon as the system has started (no X running though I suspect with X the results would be the same), yields this on dmesg:

Code:

[  48.990268] nvidia: module license 'NVIDIA' taints kernel.
[  48.990271] Disabling lock debugging due to kernel taint
[  49.030980] nvidia 0000:01:00.0: power state changed by ACPI to D0
[  49.030984] nvidia 0000:01:00.0: power state changed by ACPI to D0
[  49.030987] nvidia 0000:01:00.0: enabling device (0004 -> 0007)
[  49.030992] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[  49.030999] nvidia 0000:01:00.0: setting latency timer to 64
[  49.031003] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=none:owns=none
[  49.031104] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  304.22  Mon Jul  9 21:07:07 PDT 2012
[  54.728024] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[  54.728039] NVRM: os_pci_init_handle: invalid context!
[  54.728041] NVRM: os_pci_init_handle: invalid context!
[  54.728045] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[  54.728048] NVRM: os_pci_init_handle: invalid context!
[  54.728049] NVRM: os_pci_init_handle: invalid context!
[  54.971061] NVRM: RmInitAdapter failed! (0x26:0xffffffff:1181)
[  54.971071] NVRM: rm_init_adapter(0) failed
[  54.974870] NVRM: RmInitAdapter failed! (0x23:0x2f:675)
[  54.974873] NVRM: rm_init_adapter(0) failed

lspci output:

Code:

01:00.0 VGA compatible controller [0300]: nVidia Corporation Device [10de:0ffc] (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: nvidia

No bumblebee component (bbswitch, bumblebeed, optirun) was run, like I said, I managed to narrow down the problem to this.

Any ideas?

Thanks in advance
Cheers
GODLiKE

Godlikearg 09-16-12 09:26 PM

Re: K1000M on Thinkpad W530: card falls off the bus
 
Forgot to attach nvidia-bug-report: http://www.vicarious.com.ar/~godlike...-report.log.gz (have patience, it's my home connection :))

rockob 09-17-12 11:06 PM

Re: K1000M on Thinkpad W530: card falls off the bus
 
Don't you need to put 'optirun' in front of any commands you want to run with the nvidia card? For instance, lspci tells you 'unknown header type 7f' because the card is off (ie in lower power state), so if you do 'optirun lspci' you should see more useful information. If you run nvidia-settings, you also need to specify the X display to use, ie "optirun nvidia-settings -c :8".

And you shouldn't need to worry about nvidia-xconfig, just edit the config file that bumblebee is using (eg on Ubuntu it puts this in /etc/bumblebee/xorg.conf.nvidia).

Godlikearg 09-18-12 07:41 AM

Re: K1000M on Thinkpad W530: card falls off the bus
 
Quote:

Originally Posted by rockob (Post 2575890)
Don't you need to put 'optirun' in front of any commands you want to run with the nvidia card? For instance, lspci tells you 'unknown header type 7f' because the card is off (ie in lower power state), so if you do 'optirun lspci' you should see more useful information. If you run nvidia-settings, you also need to specify the X display to use, ie "optirun nvidia-settings -c :8".

And you shouldn't need to worry about nvidia-xconfig, just edit the config file that bumblebee is using (eg on Ubuntu it puts this in /etc/bumblebee/xorg.conf.nvidia).

optirun is only needed when you wish to run some application (e.g. a game) using the dedicated GPU. nvidia-xconfig / nvidia-smi and such commands do not need optirun as they work at a lower level.

Moreover, optirun basically what it does is running whatever it is you put after "optirun" in another X server running on the dedicated GPU, and then drawing the results back to the main display. "optirun lspci" does not make sense in this scenario.

rockob 09-18-12 09:05 AM

Re: K1000M on Thinkpad W530: card falls off the bus
 
Quote:

Originally Posted by Godlikearg (Post 2575951)
optirun is only needed when you wish to run some application (e.g. a game) using the dedicated GPU. nvidia-xconfig / nvidia-smi and such commands do not need optirun as they work at a lower level.

Moreover, optirun basically what it does is running whatever it is you put after "optirun" in another X server running on the dedicated GPU, and then drawing the results back to the main display. "optirun lspci" does not make sense in this scenario.

Perhaps, but in bumblebee you need to use optirun to enable the nvidia card and the nvidia libraries. Otherwise you're just using the intel card and the intel libraries and the nvidia card is turned off.

This is why your lspci command couldn't get any details about the nvidia card, and why "optirun lspci" makes perfect sense. For instance, on my system:

Code:

optirun lspci -s 1:00.0  -v
which gives:

Code:

01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [GeForce GT 540M] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 050e
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at f0000000 (32-bit, non-prefetchable) [size=16M]
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Memory at d0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 3000 [size=128]
        [virtual] Expansion ROM at f1000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidia_current, nouveau, nvidiafb

whereas as what you tried is:

Code:

lspci -s 1:00.0  -v
gives

Code:

01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [GeForce GT 540M] (rev ff) (prog-if ff)
        !!! Unknown header type 7f


Godlikearg 09-18-12 11:12 AM

Re: K1000M on Thinkpad W530: card falls off the bus
 
I should have mentioned it before, but lspci errors out only after I get the "fallen off the bus" error (which happens whenever I wish to use the GPU).

Here's what I get after a clean reboot, and nothing loaded (not bbswitch, not nvidia module, no nothing). Also, I can modprobe nvidia and throw an lspci afterwards and the result is the same. I also tried modprobing both nvidia an bbswitch and manually power-cycling the card, which works. Only after doing anything that actually requires use of the card (be it optirun, nvidia-xconfig, nvidia-smi, or a CUDA program), does the GPU fall off the bus and the lspci output is as displayed on my first post.

Code:

panther godlike # lspci -d 10de: -vvnn
01:00.0 VGA compatible controller [0300]: nVidia Corporation Device [10de:0ffc] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device [17aa:21f5]
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 11
        Region 0: Memory at f0000000 (32-bit, non-prefetchable) [disabled] [size=16M]
        Region 1: Memory at c0000000 (64-bit, prefetchable) [disabled] [size=256M]
        Region 3: Memory at d0000000 (64-bit, prefetchable) [disabled] [size=32M]
        Region 5: I/O ports at 5000 [disabled] [size=128]
        Expansion ROM at f1000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap:        MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl:        Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta:        CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap:        Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl:        ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta:        Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                        Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                        Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                        EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:        LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:        Fixed- WRR32- WRR64- WRR128-
                Ctrl:        ArbSelect=Fixed
                Status:        InProgress-
                VC0:        Caps:        PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:        Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:        Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status:        NegoPending- InProgress-
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] #19
        Kernel modules: nvidia

The difference in your case is that surely you have started bumblebee before running those commands, and by default, bumblebee turns off the card. During my debug sessions I have set bumblebee to not turn off the card when loaded (which, in turn, made bbswitch keep the card on when modprobing it).

Godlikearg 09-18-12 08:44 PM

Re: K1000M on Thinkpad W530: card falls off the bus
 
I just booted on an Ubuntu 12.04 x64 livecd and can confirm that the GPU is working. At least nvidia-xconfig -query-gpu-info now gives me something.

Godlikearg 09-19-12 02:07 AM

Re: K1000M on Thinkpad W530: card falls off the bus
 
Fixed it. I was missing these two kernel options:

Code:

CONFIG_NO_HZ:                                               
                                                           
This option enables a tickless system: timer interrupts will
only trigger on an as-needed basis both when the system is 
busy and when the system is idle.                           


CONFIG_RCU_FAST_NO_HZ:                                         
                                                               
This option causes RCU to attempt to accelerate grace periods 
in order to allow CPUs to enter dynticks-idle state more       
quickly.  On the other hand, this option increases the overhead
of the dynticks-idle checking, particularly on systems with   
large numbers of CPUs.

The second one depends on the first. After enabling both of those, I could query my GPU. I don't know why both are needed, but I'm guessing it's something to do with the interrupts. On my main desktop machine, only the first one is set.

Anyway, I'm off to sleep. Hope this serves somebody.

Godlikearg 09-19-12 02:32 PM

Re: K1000M on Thinkpad W530: card falls off the bus
 
One more thing: after doing more testing at the request of the Bumblebee guys, I could see that IOMMU kernel configuration has an impact too. Without this option compiled in:

Code:

CONFIG_CALGARY_IOMMU:                                     
                                                           
Support for hardware IOMMUs in IBM's xSeries x366 and x460 
systems. Needed to run systems with more than 3GB of memory
properly with 32-bit PCI devices that do not support DAC   
(Double Address Cycle). Calgary also supports bus level   
isolation, where all DMAs pass through the IOMMU.  This   
prevents them from going anywhere except their intended   
destination. This catches hard-to-find kernel bugs and     
mis-behaving drivers and devices that do not use the DMA-API
properly to set up their DMA buffers.  The IOMMU can be   
turned off at boot time with the iommu=off parameter.     
Normally the kernel will make the right choice by itself. 
If unsure, say Y.

while I would not get "has fallen off the bus" I do get the following messages (rminitcontext etc) and the card is unusable.


All times are GMT -5. The time now is 07:35 AM.

Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.