nV News Forums

 
 

nV News Forums (http://www.nvnews.net/vbulletin/index.php)
-   NVIDIA Linux (http://www.nvnews.net/vbulletin/forumdisplay.php?f=14)
-   -   NVRM Xid 26 error (http://www.nvnews.net/vbulletin/showthread.php?t=167572)

echevreau 10-20-11 10:59 AM

NVRM Xid 26 error
 
1 Attachment(s)
Hello,

What is the meaning of this error :
NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000

Can it be a hardware error ?

Thanks

artem 10-20-11 12:37 PM

Re: NVRM Xid 26 error
 
It can be anything, when does it happen for you? Do you do something specific when it triggers?

echevreau 10-21-11 03:28 AM

Re: NVRM Xid 26 error
 
The context is this:

An MPI+GPU job is running for a good half hour and then gets stuck, if you look at the call stacks can be seen that all processes waiting in a MPI_alltoall except one, the node bullx002, which is always a core GPU.

The customer did a complete kill of MPI processes and then try to do a "deviceQuery" on bullx002 remains blocked and if he look at the log there is the message with the code 26 Xid

lexa2 10-21-11 03:55 AM

Re: NVRM Xid 26 error
 
Might be either a driver bug or a hardware fault. If you want nvidia devteam to be able top help you please be kind enough to provide complete details. Most HW/SW configuration details are available in nvidia-bug-report log, but you hadn't shed any light on the exact GPU usage pattern. From your answer I assume that you're doing some kind of parallel compiting using both multicore CPU and GPU. This leads us to question about what low level lib do you use to compute things on GPU? Is it Cuda or OpenCL? Next, AFAIRC, MPI is a high-level standard that had been implemented in a lot of different libs from different vendors. Which one do you use? How does the MPI lib you use gain access to GPU hardware? Is it using Cuda and/or OpenCL directly or passing through some middleware?

Your best bet would be to try to reproduce the GPU hang you experience and write specialized small and separate testcase for it. It would help nVIDIA devteam greatly with reproducing and debugging the problem.

echevreau 10-21-11 07:18 AM

Re: NVRM Xid 26 error
 
for the details

The system is B505 BullX blasde with 2 M1060 and 2 E5540

Nvidia driver version is 270.41.19:
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16 23:32:08 PDT 2011

We use CUDA 4.0 (Nvidia standard release)

for MPI part whe use IntelMPI 4.0 with shm and ofa communication layer and the infiniband network is the one
provided in the bullx bladecenter.
We use one GPU per MPI process, so we use only 2 cores among 8 cores per node.
All the MPI stuff is handled by CPU i.e we don't use thing like moving data directly from the GPU memory to the network device.

I launch a run this morning, now the computation is stuck in the same state I describe before.
I use nvidia-smi and I see that the gpu 0:83:0 (I suppose it's the pci device id) run at 100% but
the xid error does not come. A strange thing is that nvidia-smi does not repport GPU temperature
while it repports it on my NVS290 and C2070 workstation attached boards.

Now i killed the MPI process and Xid error appear several seconds after :
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000

And if I try a deviceQuery I have the following error:
> -bash-3.2$ ./deviceQuery
> [deviceQuery] starting...
> ./deviceQuery Starting...
>
> CUDA Device Query (Runtime API) version (CUDART static linking)
>
> cudaGetDeviceCount returned 10
> -> invalid device ordinal
> [deviceQuery] test results...
> FAILED
>
> Press ENTER to exit...

the folowing error message also appear in the log:
NVRM: RmInitAdapter failed! (0x26:0xffffffff:1050)
NVRM: rm_init_adapter(1) failed
















-bash-3.2$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Fri Oct 21 13:03:40 2011

Driver Version : 270.41.19

Attached GPUs : 2

GPU 0:2:0
Product Name : Tesla M1060
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-e463ac18a737100e-dcfb3fd4-823b9b82-a65f1811-f0f496036439b2124dc7c090
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 2
Device : 0
Domain : 0
Device Id : 5E710DE
Bus Id : 0:2:0
Fan Speed : N/A
Memory Usage
Total : 4095 Mb
Used : 64 Mb
Free : 4031 Mb
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : N/A
Power Readings
Power State : P0
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : 612 MHz
SM : 1296 MHz
Memory : 792 MHz
GPU 0:83:0
Product Name : Tesla M1060
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-ddee45f6f0d255c2-d748202f-3aa7f5e7-a4a382f4-46eee3c1b9dbb4b9d9d14cda
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 83
Device : 0
Domain : 0
Device Id : 5E710DE
Bus Id : 0:83:0
Fan Speed : N/A
Memory Usage
Total : 4095 Mb
Used : 74 Mb
Free : 4021 Mb
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : N/A
Power Readings
Power State : P0
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : 612 MHz
SM : 1296 MHz
Memory : 792 MHz
















For more Information:
deviceQuery Output:

##############
-bash-3.2$ ./deviceQuery
[deviceQuery] starting...
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla M1060"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4096 MBytes (4294770688 bytes)
(30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores
GPU Clock Speed: 1.30 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 512-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla M1060"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4096 MBytes (4294770688 bytes)
(30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores
GPU Clock Speed: 1.30 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 512-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Tesla M1060, Device = Tesla M1060
[deviceQuery] test results...
PASSED

Press ENTER to exit...
###############


All times are GMT -5. The time now is 10:54 AM.

Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.