|
|
#1 | |
|
Registered User
Join Date: Oct 2011
Posts: 3
|
Hello,
What is the meaning of this error : NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000 Can it be a hardware error ? Thanks |
|
|
|
|
|
|
#2 | |
|
Registered User
Join Date: Jun 2006
Posts: 678
|
It can be anything, when does it happen for you? Do you do something specific when it triggers?
|
|
|
|
|
|
|
#3 |
|
Registered User
Join Date: Oct 2011
Posts: 3
|
The context is this:
An MPI+GPU job is running for a good half hour and then gets stuck, if you look at the call stacks can be seen that all processes waiting in a MPI_alltoall except one, the node bullx002, which is always a core GPU. The customer did a complete kill of MPI processes and then try to do a "deviceQuery" on bullx002 remains blocked and if he look at the log there is the message with the code 26 Xid |
|
|
|
|
|
#4 | |
|
Registered User
|
Might be either a driver bug or a hardware fault. If you want nvidia devteam to be able top help you please be kind enough to provide complete details. Most HW/SW configuration details are available in nvidia-bug-report log, but you hadn't shed any light on the exact GPU usage pattern. From your answer I assume that you're doing some kind of parallel compiting using both multicore CPU and GPU. This leads us to question about what low level lib do you use to compute things on GPU? Is it Cuda or OpenCL? Next, AFAIRC, MPI is a high-level standard that had been implemented in a lot of different libs from different vendors. Which one do you use? How does the MPI lib you use gain access to GPU hardware? Is it using Cuda and/or OpenCL directly or passing through some middleware?
Your best bet would be to try to reproduce the GPU hang you experience and write specialized small and separate testcase for it. It would help nVIDIA devteam greatly with reproducing and debugging the problem. |
|
|
|
|
|
|
#5 |
|
Registered User
Join Date: Oct 2011
Posts: 3
|
for the details
The system is B505 BullX blasde with 2 M1060 and 2 E5540 Nvidia driver version is 270.41.19: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16 23:32:08 PDT 2011 We use CUDA 4.0 (Nvidia standard release) for MPI part whe use IntelMPI 4.0 with shm and ofa communication layer and the infiniband network is the one provided in the bullx bladecenter. We use one GPU per MPI process, so we use only 2 cores among 8 cores per node. All the MPI stuff is handled by CPU i.e we don't use thing like moving data directly from the GPU memory to the network device. I launch a run this morning, now the computation is stuck in the same state I describe before. I use nvidia-smi and I see that the gpu 0:83:0 (I suppose it's the pci device id) run at 100% but the xid error does not come. A strange thing is that nvidia-smi does not repport GPU temperature while it repports it on my NVS290 and C2070 workstation attached boards. Now i killed the MPI process and Xid error appear several seconds after : > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000 > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000 > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000 > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000 > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000 > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000 > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 00400000 > NVRM: Xid (0000:83:00): 26, Ch 00000000 M 00000860 D 00023780 intr 04400000 And if I try a deviceQuery I have the following error: > -bash-3.2$ ./deviceQuery > [deviceQuery] starting... > ./deviceQuery Starting... > > CUDA Device Query (Runtime API) version (CUDART static linking) > > cudaGetDeviceCount returned 10 > -> invalid device ordinal > [deviceQuery] test results... > FAILED > > Press ENTER to exit... the folowing error message also appear in the log: NVRM: RmInitAdapter failed! (0x26:0xffffffff:1050) NVRM: rm_init_adapter(1) failed -bash-3.2$ nvidia-smi -q ==============NVSMI LOG============== Timestamp : Fri Oct 21 13:03:40 2011 Driver Version : 270.41.19 Attached GPUs : 2 GPU 0:2:0 Product Name : Tesla M1060 Display Mode : Disabled Persistence Mode : Disabled Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-e463ac18a737100e-dcfb3fd4-823b9b82-a65f1811-f0f496036439b2124dc7c090 Inforom Version OEM Object : N/A ECC Object : N/A Power Management Object : N/A PCI Bus : 2 Device : 0 Domain : 0 Device Id : 5E710DE Bus Id : 0:2:0 Fan Speed : N/A Memory Usage Total : 4095 Mb Used : 64 Mb Free : 4031 Mb Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Temperature Gpu : N/A Power Readings Power State : P0 Power Management : N/A Power Draw : N/A Power Limit : N/A Clocks Graphics : 612 MHz SM : 1296 MHz Memory : 792 MHz GPU 0:83:0 Product Name : Tesla M1060 Display Mode : Disabled Persistence Mode : Disabled Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-ddee45f6f0d255c2-d748202f-3aa7f5e7-a4a382f4-46eee3c1b9dbb4b9d9d14cda Inforom Version OEM Object : N/A ECC Object : N/A Power Management Object : N/A PCI Bus : 83 Device : 0 Domain : 0 Device Id : 5E710DE Bus Id : 0:83:0 Fan Speed : N/A Memory Usage Total : 4095 Mb Used : 74 Mb Free : 4021 Mb Compute Mode : Default Utilization Gpu : 100 % Memory : 0 % Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Total : N/A Temperature Gpu : N/A Power Readings Power State : P0 Power Management : N/A Power Draw : N/A Power Limit : N/A Clocks Graphics : 612 MHz SM : 1296 MHz Memory : 792 MHz For more Information: deviceQuery Output: ############## -bash-3.2$ ./deviceQuery [deviceQuery] starting... ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Found 2 CUDA Capable device(s) Device 0: "Tesla M1060" CUDA Driver Version / Runtime Version 4.0 / 4.0 CUDA Capability Major/Minor version number: 1.3 Total amount of global memory: 4096 MBytes (4294770688 bytes) (30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores GPU Clock Speed: 1.30 GHz Memory Clock rate: 800.00 Mhz Memory Bus Width: 512-bit Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048) Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 256 bytes Concurrent copy and execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: No Alignment requirement for Surfaces: Yes Device has ECC support enabled: No Device is using TCC driver mode: No Device supports Unified Addressing (UVA): No Device PCI Bus ID / PCI location ID: 2 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "Tesla M1060" CUDA Driver Version / Runtime Version 4.0 / 4.0 CUDA Capability Major/Minor version number: 1.3 Total amount of global memory: 4096 MBytes (4294770688 bytes) (30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores GPU Clock Speed: 1.30 GHz Memory Clock rate: 800.00 Mhz Memory Bus Width: 512-bit Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048) Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 256 bytes Concurrent copy and execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: No Alignment requirement for Surfaces: Yes Device has ECC support enabled: No Device is using TCC driver mode: No Device supports Unified Addressing (UVA): No Device PCI Bus ID / PCI location ID: 131 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Tesla M1060, Device = Tesla M1060 [deviceQuery] test results... PASSED Press ENTER to exit... ############### |
|
|
|
![]() |
| Thread Tools | |
|
|