|Lightspeed Memory Architecture|
NVIDIA has coined the name Lightspeed Memory Architecture as a series of enhancements which were developed to increase the efficiency of data transfer between the graphics processing unit and graphics memory. A key point NVIDIA mentioned during our affiliate meeting regarding memory bandwidth is that they have maxed out the bandwidth offered by today's high speed DDR memory.
The advances in graphics memory technology have not been able to keep pace with the increased execution capability of the GeForce. This might be part of the reason the core clock speed of the GeForce3 was lowered to 200MHz compared to 250MHz on the GeForce2 Ultra. At this time, memory access remains the single largest bottleneck in making significant advances in computational throughput for rendering traditional real-time 3D graphics.
The key components of Lightspeed Memory Architecture consists of:
- Crossbar Memory Arcitecture
- Occlussion Detection
- Z-Buffer Compression
|Crossbar Memory Architecture|
Crossbar Memory Architecture has been around for quite some time. Originally used on mainframe computers to increase memory bandwidth in multi-processor systems, companies such as Unisys, SGI, and Sun have been able to bring the technology down to the server and workstation platforms. Crossbar switches have also been designed to link entire computer systems as well.
NVIDIA is taking graphics memory architecture to the next level and is employing the use of a crossbar to maximize the efficiency of data transfer between the graphics processing unit and the graphics memory on the GeForce3.
A memory crossbar can eliminate bottlenecks associated with existing memory architecture as it replaces the conventional system bus architecture. Instead of sharing a bus, communication between the processor and the memory uses dedicated connections.
Vector IRAM System Floorplan
Image courtesy of University of California, Berkeley
Traditional memory controllers transfer information to the frame buffer in large blocks of data. However, there are cases when smaller amounts of data are processed via the memory controller. If the amount of data being processed is less than memory controller can handle, then only a portion of the potential memory bandwidth is being utilized. On average the legacy memory controller on the GeForce is able to deliver around 50% of the peak memory bandwidth to the frame buffer.
In order to maximize the utilization of data to and from the memory controller to the frame buffer, NVIDIA has implemented a crossbar memory architecture on the GeForce3. The 256-bit graphics controller on the GeForce3 has been partitioned into four 64-bit memory controllers each of which communicate with each other and the load and store units on the graphics processing unit.
* Note that although the memory controller on the GeForce is 128-bits wide, the use of Dual Data Rate memory effectively increases the flow of data to 256-bits per clock cycle. DDR makes the bus function like a bus that works twice as fast, rather than twice as wide, but the net benefit is similar.
GeForce3 Crossbar Memory Architecture
Since the load and store units on the graphics processing unit are capable of issuing a memory load or a store on every cycle, the memory controllers should be capable of processing these requests at the same rate. This technique leads to a load reduction on the bus and ensures sufficient capacity for memory access. As each bank of memory has a dedicated bus to the crossbar, the crossbar can run at a higher frequency than the memory bus and transfer data at full speed with no contention.
What you can't see won't hurt you.
I'm not sure who originally coined that phrase, but there are many instances where it just isn't true. Think of the last time you were playing Counter-Strike and while going about your business you find yourself laying on the ground dead. The dreaded sniper attack. Ironically, the same phrase applies to rendering graphics except in this case, what you can't see will hurt memory bandwidth.
The second memory bandwidth saving feature NVIDIA has implemented in the GeForce3 is referred to as Occlusion Culling. In it's definitive form the word cull means to select from a group.
When rendering a 3D scene, portions of the scene are hidden from view due to objects appearing in front of other objects. Prior to the GeForce3, graphics data for all objects that contributed to making up a scene, even if some objects are not visible in the final scene, were processed through the rendering pipeline. The rendering associated with these hidden objects is often referred to as overdraw.
On average, today's games have a depth complexity of two, which means that for every pixel that ends up being visible, two additional pixels are rendered. It's been estimated that Quake3 Arena generates around 33% of overdraw. While Quake3 Arena primarily takes place in an indoor setting, imagine the amount of overdraw that can be associated with games that render densly populated areas such a city or forest. In these cases, the amount of overdraw can be significant.
Image courtesy of University of North Carolina
This illustration is from a submarine auxiliary machine room and demonstrates a practical use of occlusion culling. This particular 3D model contains a total of 632,252 polygons. In this view, over 80% of the model, which are parts drawn in red, are occluded by the parts drawn in blue. If the designer was interested in only viewing the parts drawn in red, over 125,000 polygons could be culled away with occlusion culling.
Objects that exist in the 3D world are located with X, Y, and Z coordinates. The Z coordinate indicates the depth associated with an object and is stored in the Z-buffer of graphics memory. The idea behind occlusion culling is to make use of data in the Z-buffer to render only those objects which are visible. An efficient occlusion culling algorithm should detect hidden objects early in the graphics pipeline in order to avoid unnecessary processing.
An additional occlusion technique which developers can make use of is an Occlusion Query. An application can issue a request to the graphics processor to render an area to test for visibility. If the graphics processing unit determines that the region is going to be occluded, then all the representative geometry and rendering representing that region can be bypassed.
The final component of the GeForce3 Lightspeed Memory Architecture is referred to as a Lossless Z Compression. Recall that the Z-buffer contains data which represents the depth of objects and is typically read and written for every pixel rendered.
To minimize the amount of Z-buffer traffic, the GeForce3 uses of a 4:1 lossless data compression scheme as a further method to reduce memory bandwidth. The compression is implemented in hardware and is transparent to applications. Both the compression and decompression take place in real-time with no reduction in image quality or precision.
Next Page: High Resolution Antialiasing