Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 12-08-11, 03:00 PM   #1
QuesarVII
Registered User
 
Join Date: Sep 2003
Location: MA
Posts: 4
Send a message via AIM to QuesarVII
Default system temporarily unresponsive slows/hangs running nvidia-smi or cuda apps on RHEL6

Hello all,

I am working on 2 identical 1U GPU systems running RHEL6. They each have a pair of M2075 GPUs installed.

If I run either a cuda app (deviceQuery from the SDK for example) or run "nvidia-smi -a" the system becomes unresponsive for about 5 seconds. I cannot type anything, move windows, click anything, etc. I also reproduced this in a plain text console (no X). If I run it in a tty, I cannot switch ttys or type anything until the process completes.

I have tried several other configs on this without a problem: Gentoo 3.0.6 kernel, openSuse 11.4 2.6.37 based, and a custom 3.1.4 kernel on RHEL6 all worked fine. The problem only exists when running on the RHEL6 2.6.32 series kernel.

I originally tested with 2.6.32-71.29.1.el6. After performing updates, I currently have the latest kernel update from Redhat - 2.6.32-220.el6.x86_64. I have tried the following Nvidia driver versions all with the same behavior: 290.10, 280.13, and 275.09.07. nouveau has been disabled both with "blacklist nouveau" in /etc/modprobe.d/blacklist.conf, and with "rdblacklist=nouveau" as a kernel parameter.

Are there known problems running on RHEL6's kernels?

Thanks
__________________
Use the source!
QuesarVII is offline   Reply With Quote
Old 12-12-11, 12:20 PM   #2
ralexander
Registered User
 
Join Date: Dec 2011
Posts: 1
Default Re: system temporarily unresponsive slows/hangs running nvidia-smi or cuda apps on RH

Hey Quesar,

Can you post the log from nvidia-bug-report.sh? This will be installed with the display driver.

Also what output do you get for:
$ time nvidia-smi -q

Thanks,
Robert
ralexander is offline   Reply With Quote
Old 12-12-11, 01:02 PM   #3
kwizart
Registered User
 
Join Date: Feb 2005
Location: Paris, France
Posts: 129
Default Re: system temporarily unresponsive slows/hangs running nvidia-smi or cuda apps on RH

I don't think rdblacklist=nouveau really matters, only nouveau.modeset=0 is important.
kwizart is offline   Reply With Quote
Old 12-13-11, 10:42 AM   #4
QuesarVII
Registered User
 
Join Date: Sep 2003
Location: MA
Posts: 4
Send a message via AIM to QuesarVII
Default Re: system temporarily unresponsive slows/hangs running nvidia-smi or cuda apps on RH

Quote:
Originally Posted by ralexander View Post
Hey Quesar,

Can you post the log from nvidia-bug-report.sh? This will be installed with the display driver.

Also what output do you get for:
$ time nvidia-smi -q

Thanks,
Robert
Here is the nvidia-bug-report.log.gz.

[microway@rhel6-server ~]$ time nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Tue Dec 13 11:32:35 2011

Driver Version : 290.10

Attached GPUs : 2

GPU 0000:05:00.0
Product Name : Tesla M2075
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322711096879
GPU UUID : GPU-03c10c64efb3e2cb-631fec73-fd4d75d6-74383501-219d3814fb5a7c6ad1efae72
VBIOS Version : 70.10.46.00.02
Inforom Version
OEM Object : 1.1
ECC Object : 2.0
Power Management Object : 4.0
PCI
Bus : 0x05
Device : 0x00
Domain : 0x0000
Device Id : 0x109410DE
Bus Id : 0000:05:00.0
Sub System Id : 0x088810DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P12
Memory Usage
Total : 5375 MB
Used : 9 MB
Free : 5365 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Temperature
Gpu : N/A
Power Readings
Power Management : Supported
Power Draw : 34.69 W
Power Limit : 225 W
Clocks
Graphics : 50 MHz
SM : 101 MHz
Memory : 135 MHz
Max Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1566 MHz
Compute Processes : None

GPU 0000:06:00.0
Product Name : Tesla M2075
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322711097238
GPU UUID : GPU-49cd0d0f20862797-3bf20766-0ffbb2f4-4522ef91-7f7198555512afb63f2795cb
VBIOS Version : 70.10.46.00.02
Inforom Version
OEM Object : 1.1
ECC Object : 2.0
Power Management Object : 4.0
PCI
Bus : 0x06
Device : 0x00
Domain : 0x0000
Device Id : 0x109410DE
Bus Id : 0000:06:00.0
Sub System Id : 0x088810DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P12
Memory Usage
Total : 5375 MB
Used : 9 MB
Free : 5365 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Temperature
Gpu : N/A
Power Readings
Power Management : Supported
Power Draw : 38.66 W
Power Limit : 225 W
Clocks
Graphics : 50 MHz
SM : 101 MHz
Memory : 135 MHz
Max Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1566 MHz
Compute Processes : None


real 0m21.054s
user 0m0.002s
sys 0m20.892s


Quote:
Originally Posted by kwizart View Post
I don't think rdblacklist=nouveau really matters, only nouveau.modeset=0 is important.
rdblacklist=nouveau prevents the initrd image from loading the nouveau driver. Then "blacklist nouveau" in /etc/modprobe.d/blacklist.conf prevents the loaded OS from loading the driver afterward.
Combined, they make sure the driver doesn't get loaded at all. nouveau.modeset=0 makes sure that even if the driver gets loaded, it doesn't try to change the mode. However, as long as you have the module blacklisted in both locations, it shouldn't be needed at all.



Also, I have another piece of info now too. Based on info from a colleague experiencing the same problem, I ran "nvidia-smi --loop=5". The system did the 10-20 second pause/unresponsive things again, but after that, as long as the nvidia-smi loop kept running, the system response was normal. Cuda jobs would run and return very quickly as expected. I'm not sure what this indicates, but it seems like potentially valuable info in debugging this problem.
__________________
Use the source!
QuesarVII is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 02:48 PM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.