NVidia has today released the newest version of their popular CUDA Toolkit, version 3.2, that boasts all around performance improvements and several new features. The new version includes a new Sparse Matrix library 'CUSPARSE' to offset the command CUBLAS and CULAPACK libraries that excel at dense matrices. Also, they have a new GPU-accelerated random-number library 'CURAND'. GPU accelerated random numbers may seem a bit pointless at first glance, but random number entropy is a big deal in large-scale crypto, so I'm sure certain government labs will love that feature. But even that's not all, as they've added some nice cluster management features (to allow admins to lock processes to certain GPU's, a necessary feature in queue-driven clusters) as well as support for 64-bit memory addressing which opens up the 6GB memory available on the Quadro 6000.
In addition, they've just announced the new version of Parallel NSight, v1.5, that includes compatibility with Microsoft Visual Studio 2010. The new version offers a new 'Dual GPU' mode that enables the Compute Debugger on a system with 2 suitable GPU's, previously a feature reserved only for network debugging or the Multi-OS SLI systems. It adds support for the new Fermi Hardware (GTS460 and such), and all of the features of CUDA3.2.
For those of you in the GPU compute space, however, the big news may be the new 'TCC' Driver. For a while now, Nvidia has offered a special 'Tesla Compute Cluster' driver that enables CUDA and GPU support without dragging in the Windows Display Subsystems. While initially intended to overcome some problems with Window's strange requirements for hardware access when using Remote Desktop and in cluster systems like HPCServer, the driver loads the Tesla card (or Quadro card, if you really want to) not as a display device, but as an additional compute card installed in the system. While not intended, Nvidia found some interesting side-effects in how Windows deals with it. When working with the Windows Display systems and the WDDM (Windows Display Driver Model), you are required to bundle all of your kernels together before you load them to the card, each kernel taking approximately 30 microseconds. If you, instead, go through the Windows Driver Model (WDM) then you can load kernels when convenient, and it only takes approximately 2.5 microseconds. That means a complex situation requiring 10 compute kernels:
- WDDM: 30 microseconds * 10 kernels = 300 microseconds
- WDM: 2.5 microseconds * 10 kernels = 25 microseconds.
For people doing very heavy GPU computation, this adds up fast. However, users found themselves having to make a choice: Load up the TCC driver and lose all display support, or load up the display driver and deal with the slightly degraded performance.
No more, as the new driver enables a run-time switch that can toggle between Display mode and TCC mode. Now you can take your dual Quadro system and run in graphics SLI mode for superior performance, then switch one of your Quadros to TCC mode and run your compute codes faster. Granted, it's not a situation many people find themselves in but for the few that do: It's a welcome change.
Parallel NSight will be available next week (at GTC conveniently) on September 22nd.
Full release after the break.
Read the rest of NVidia Releases CUDA3.2, NSight 1.5
Reach out to the community of Visualization and Graphics Experts by Advertising on VizWorld.com
- NVidia Releases Parallel Nsight To The Masses
- Nvidia Drivers 257.21
- NVidia releases new 3D Vision Pro, OptiX2, SceniX 6