|
|
#1 | |
|
Registered User
|
Hello all,
I am working on 2 identical 1U GPU systems running RHEL6. They each have a pair of M2075 GPUs installed. If I run either a cuda app (deviceQuery from the SDK for example) or run "nvidia-smi -a" the system becomes unresponsive for about 5 seconds. I cannot type anything, move windows, click anything, etc. I also reproduced this in a plain text console (no X). If I run it in a tty, I cannot switch ttys or type anything until the process completes. I have tried several other configs on this without a problem: Gentoo 3.0.6 kernel, openSuse 11.4 2.6.37 based, and a custom 3.1.4 kernel on RHEL6 all worked fine. The problem only exists when running on the RHEL6 2.6.32 series kernel. I originally tested with 2.6.32-71.29.1.el6. After performing updates, I currently have the latest kernel update from Redhat - 2.6.32-220.el6.x86_64. I have tried the following Nvidia driver versions all with the same behavior: 290.10, 280.13, and 275.09.07. nouveau has been disabled both with "blacklist nouveau" in /etc/modprobe.d/blacklist.conf, and with "rdblacklist=nouveau" as a kernel parameter. Are there known problems running on RHEL6's kernels? Thanks
__________________
Use the source! |
|
|
|
|
|
|
#2 | |
|
Registered User
Join Date: Dec 2011
Posts: 1
|
Hey Quesar,
Can you post the log from nvidia-bug-report.sh? This will be installed with the display driver. Also what output do you get for: $ time nvidia-smi -q Thanks, Robert |
|
|
|
|
|
|
#3 |
|
Registered User
Join Date: Feb 2005
Location: Paris, France
Posts: 129
|
I don't think rdblacklist=nouveau really matters, only nouveau.modeset=0 is important.
|
|
|
|
|
|
#4 | ||
|
Registered User
|
Quote:
[microway@rhel6-server ~]$ time nvidia-smi -q ==============NVSMI LOG============== Timestamp : Tue Dec 13 11:32:35 2011 Driver Version : 290.10 Attached GPUs : 2 GPU 0000:05:00.0 Product Name : Tesla M2075 Display Mode : Disabled Persistence Mode : Disabled Driver Model Current : N/A Pending : N/A Serial Number : 0322711096879 GPU UUID : GPU-03c10c64efb3e2cb-631fec73-fd4d75d6-74383501-219d3814fb5a7c6ad1efae72 VBIOS Version : 70.10.46.00.02 Inforom Version OEM Object : 1.1 ECC Object : 2.0 Power Management Object : 4.0 PCI Bus : 0x05 Device : 0x00 Domain : 0x0000 Device Id : 0x109410DE Bus Id : 0000:05:00.0 Sub System Id : 0x088810DE GPU Link Info PCIe Generation Max : 2 Current : 1 Link Width Max : 16x Current : 16x Fan Speed : N/A Performance State : P12 Memory Usage Total : 5375 MB Used : 9 MB Free : 5365 MB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile Single Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Aggregate Single Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Temperature Gpu : N/A Power Readings Power Management : Supported Power Draw : 34.69 W Power Limit : 225 W Clocks Graphics : 50 MHz SM : 101 MHz Memory : 135 MHz Max Clocks Graphics : 573 MHz SM : 1147 MHz Memory : 1566 MHz Compute Processes : None GPU 0000:06:00.0 Product Name : Tesla M2075 Display Mode : Disabled Persistence Mode : Disabled Driver Model Current : N/A Pending : N/A Serial Number : 0322711097238 GPU UUID : GPU-49cd0d0f20862797-3bf20766-0ffbb2f4-4522ef91-7f7198555512afb63f2795cb VBIOS Version : 70.10.46.00.02 Inforom Version OEM Object : 1.1 ECC Object : 2.0 Power Management Object : 4.0 PCI Bus : 0x06 Device : 0x00 Domain : 0x0000 Device Id : 0x109410DE Bus Id : 0000:06:00.0 Sub System Id : 0x088810DE GPU Link Info PCIe Generation Max : 2 Current : 1 Link Width Max : 16x Current : 16x Fan Speed : N/A Performance State : P12 Memory Usage Total : 5375 MB Used : 9 MB Free : 5365 MB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile Single Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Aggregate Single Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Temperature Gpu : N/A Power Readings Power Management : Supported Power Draw : 38.66 W Power Limit : 225 W Clocks Graphics : 50 MHz SM : 101 MHz Memory : 135 MHz Max Clocks Graphics : 573 MHz SM : 1147 MHz Memory : 1566 MHz Compute Processes : None real 0m21.054s user 0m0.002s sys 0m20.892s Quote:
Combined, they make sure the driver doesn't get loaded at all. nouveau.modeset=0 makes sure that even if the driver gets loaded, it doesn't try to change the mode. However, as long as you have the module blacklisted in both locations, it shouldn't be needed at all. Also, I have another piece of info now too. Based on info from a colleague experiencing the same problem, I ran "nvidia-smi --loop=5". The system did the 10-20 second pause/unresponsive things again, but after that, as long as the nvidia-smi loop kept running, the system response was normal. Cuda jobs would run and return very quickly as expected. I'm not sure what this indicates, but it seems like potentially valuable info in debugging this problem.
__________________
Use the source! |
||
|
|
|
![]() |
| Thread Tools | |
|
|