View Single Post
Old 10-21-11, 08:18 AM   #5
echevreau
Registered User
 
Join Date: Oct 2011
Posts: 3
Default Re: NVRM Xid 26 error

for the details

The system is B505 BullX blasde with 2 M1060 and 2 E5540

Nvidia driver version is 270.41.19:
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16 23:32:08 PDT 2011

We use CUDA 4.0 (Nvidia standard release)

for MPI part whe use IntelMPI 4.0 with shm and ofa communication layer and the infiniband network is the one
provided in the bullx bladecenter.
We use one GPU per MPI process, so we use only 2 cores among 8 cores per node.
All the MPI stuff is handled by CPU i.e we don't use thing like moving data directly from the GPU memory to the network device.

I launch a run this morning, now the computation is stuck in the same state I describe before.
I use nvidia-smi and I see that the gpu 0:83:0 (I suppose it's the pci device id) run at 100% but
the xid error does not come. A strange thing is that nvidia-smi does not repport GPU temperature
while it repports it on my NVS290 and C2070 workstation attached boards.

Now i killed the MPI process and Xid error appear several seconds after :
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000
> NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000

And if I try a deviceQuery I have the following error:
> -bash-3.2$ ./deviceQuery
> [deviceQuery] starting...
> ./deviceQuery Starting...
>
> CUDA Device Query (Runtime API) version (CUDART static linking)
>
> cudaGetDeviceCount returned 10
> -> invalid device ordinal
> [deviceQuery] test results...
> FAILED
>
> Press ENTER to exit...

the folowing error message also appear in the log:
NVRM: RmInitAdapter failed! (0x26:0xffffffff:1050)
NVRM: rm_init_adapter(1) failed
















-bash-3.2$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Fri Oct 21 13:03:40 2011

Driver Version : 270.41.19

Attached GPUs : 2

GPU 0:2:0
Product Name : Tesla M1060
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-e463ac18a737100e-dcfb3fd4-823b9b82-a65f1811-f0f496036439b2124dc7c090
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 2
Device : 0
Domain : 0
Device Id : 5E710DE
Bus Id : 0:2:0
Fan Speed : N/A
Memory Usage
Total : 4095 Mb
Used : 64 Mb
Free : 4031 Mb
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : N/A
Power Readings
Power State : P0
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : 612 MHz
SM : 1296 MHz
Memory : 792 MHz
GPU 0:83:0
Product Name : Tesla M1060
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-ddee45f6f0d255c2-d748202f-3aa7f5e7-a4a382f4-46eee3c1b9dbb4b9d9d14cda
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 83
Device : 0
Domain : 0
Device Id : 5E710DE
Bus Id : 0:83:0
Fan Speed : N/A
Memory Usage
Total : 4095 Mb
Used : 74 Mb
Free : 4021 Mb
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : N/A
Power Readings
Power State : P0
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : 612 MHz
SM : 1296 MHz
Memory : 792 MHz
















For more Information:
deviceQuery Output:

##############
-bash-3.2$ ./deviceQuery
[deviceQuery] starting...
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla M1060"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4096 MBytes (4294770688 bytes)
(30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores
GPU Clock Speed: 1.30 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 512-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla M1060"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4096 MBytes (4294770688 bytes)
(30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores
GPU Clock Speed: 1.30 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 512-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Tesla M1060, Device = Tesla M1060
[deviceQuery] test results...
PASSED

Press ENTER to exit...
###############
echevreau is offline   Reply With Quote