Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 10-20-11, 10:59 AM   #1
echevreau
Registered User
 
Join Date: Oct 2011
Posts: 3
Default NVRM Xid 26 error

Hello,

What is the meaning of this error :
NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000

Can it be a hardware error ?

Thanks
Attached Files
File Type: gz nvidia-bug-report.log.gz (27.7 KB, 29 views)
echevreau is offline   Reply With Quote
Old 10-20-11, 12:37 PM   #2
artem
Registered User
 
Join Date: Jun 2006
Posts: 710
Default Re: NVRM Xid 26 error

It can be anything, when does it happen for you? Do you do something specific when it triggers?
artem is offline   Reply With Quote
Old 10-21-11, 03:28 AM   #3
echevreau
Registered User
 
Join Date: Oct 2011
Posts: 3
Default Re: NVRM Xid 26 error

The context is this:

An MPI+GPU job is running for a good half hour and then gets stuck, if you look at the call stacks can be seen that all processes waiting in a MPI_alltoall except one, the node bullx002, which is always a core GPU.

The customer did a complete kill of MPI processes and then try to do a "deviceQuery" on bullx002 remains blocked and if he look at the log there is the message with the code 26 Xid
echevreau is offline   Reply With Quote
Old 10-21-11, 03:55 AM   #4
lexa2
Registered User
 
Join Date: Jul 2011
Location: Moscow, Russian Federation
Posts: 58
Send a message via ICQ to lexa2 Send a message via Skype™ to lexa2
Default Re: NVRM Xid 26 error

Might be either a driver bug or a hardware fault. If you want nvidia devteam to be able top help you please be kind enough to provide complete details. Most HW/SW configuration details are available in nvidia-bug-report log, but you hadn't shed any light on the exact GPU usage pattern. From your answer I assume that you're doing some kind of parallel compiting using both multicore CPU and GPU. This leads us to question about what low level lib do you use to compute things on GPU? Is it Cuda or OpenCL? Next, AFAIRC, MPI is a high-level standard that had been implemented in a lot of different libs from different vendors. Which one do you use? How does the MPI lib you use gain access to GPU hardware? Is it using Cuda and/or OpenCL directly or passing through some middleware?

Your best bet would be to try to reproduce the GPU hang you experience and write specialized small and separate testcase for it. It would help nVIDIA devteam greatly with reproducing and debugging the problem.
lexa2 is offline   Reply With Quote
Old 10-21-11, 07:18 AM   #5
echevreau
Registered User
 
Join Date: Oct 2011
Posts: 3
Default Re: NVRM Xid 26 error

for the details

The system is B505 BullX blasde with 2 M1060 and 2 E5540

Nvidia driver version is 270.41.19:
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16 23:32:08 PDT 2011

We use CUDA 4.0 (Nvidia standard release)

for MPI part whe use IntelMPI 4.0 with shm and ofa communication layer and the infiniband network is the one
provided in the bullx bladecenter.
We use one GPU per MPI process, so we use only 2 cores among 8 cores per node.
All the MPI stuff is handled by CPU i.e we don't use thing like moving data directly from the GPU memory to the network device.

I launch a run this morning, now the computation is stuck in the same state I describe before.
I use nvidia-smi and I see that the gpu 0:83:0 (I suppose it's the pci device id) run at 100% but
the xid error does not come. A strange thing is that nvidia-smi does not repport GPU temperature
while it repports it on my NVS290 and C2070 workstation attached boards.

Now i killed the MPI process and Xid error appear several seconds after :
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000

And if I try a deviceQuery I have the following error:
> -bash-3.2$ ./deviceQuery
> [deviceQuery] starting...
> ./deviceQuery Starting...
>
> CUDA Device Query (Runtime API) version (CUDART static linking)
>
> cudaGetDeviceCount returned 10
> -> invalid device ordinal
> [deviceQuery] test results...
> FAILED
>
> Press ENTER to exit...

the folowing error message also appear in the log:
NVRM: RmInitAdapter failed! (0x26:0xffffffff:1050)
NVRM: rm_init_adapter(1) failed
















-bash-3.2$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Fri Oct 21 13:03:40 2011

Driver Version : 270.41.19

Attached GPUs : 2

GPU 0:2:0
Product Name : Tesla M1060
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-e463ac18a737100e-dcfb3fd4-823b9b82-a65f1811-f0f496036439b2124dc7c090
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 2
Device : 0
Domain : 0
Device Id : 5E710DE
Bus Id : 0:2:0
Fan Speed : N/A
Memory Usage
Total : 4095 Mb
Used : 64 Mb
Free : 4031 Mb
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : N/A
Power Readings
Power State : P0
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : 612 MHz
SM : 1296 MHz
Memory : 792 MHz
GPU 0:83:0
Product Name : Tesla M1060
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-ddee45f6f0d255c2-d748202f-3aa7f5e7-a4a382f4-46eee3c1b9dbb4b9d9d14cda
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 83
Device : 0
Domain : 0
Device Id : 5E710DE
Bus Id : 0:83:0
Fan Speed : N/A
Memory Usage
Total : 4095 Mb
Used : 74 Mb
Free : 4021 Mb
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : N/A
Power Readings
Power State : P0
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : 612 MHz
SM : 1296 MHz
Memory : 792 MHz
















For more Information:
deviceQuery Output:

##############
-bash-3.2$ ./deviceQuery
[deviceQuery] starting...
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla M1060"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4096 MBytes (4294770688 bytes)
(30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores
GPU Clock Speed: 1.30 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 512-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla M1060"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4096 MBytes (4294770688 bytes)
(30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores
GPU Clock Speed: 1.30 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 512-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Tesla M1060, Device = Tesla M1060
[deviceQuery] test results...
PASSED

Press ENTER to exit...
###############
echevreau is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 03:41 AM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.