contention among multiple cards (poor texture-loading performance)?
Hi. I have a multi-screen Dell XPS710 2-cpu EM64T system
with two graphics cards -- currently a Dell Anonymous 10de:0400
and a GeForce 7300 GT. I'm trying to load 2-D RGB textures into the
cards as fast as possible and draw each texture once.
Tried various sorts of optimizations -- texture size (it's faster to load
a bunch of 256x256 textures than a smaller number of 1024x1024 ones, say),
pixel-buffer objects (they slow things down in this case), pixel formats
(on some cards BGR is faster than RGB, though on these it doesn't matter).
If I just run this on one graphics card at a time, the best speed I've seen is
about 250 Mpixel/sec (of 24-bit textures) on the faster card. That's about 750 MB/sec -- much slower than (a) system memory (~2.5GB/sec according to the STREAM Copy benchmark) or (b) what I'd expect for a 16x PCIe path
(lspci -vv says "2.5 Gb/sec").
This is without using either Twinview or Xinerama, so that I get two separate screens, one per card.
The test program's inner loop calls glBindTexture, glTexSubImage2D
on a pre-set chunk of memory, and then (optionally) draws a textured quad.
Noting that the program is spending about 100% CPU, I also tried running
two copies of it: one on each screen (i.e. one on each graphics card).
Result on this system: it's MUCH SLOWER, at least 2x, on two cards than
it is on one.
E.g. running just on card A (:0.0) gives 250 Mpix/s; running just on B gives
about 230 Mpix/s.
But running one process each on A and B concurrently gives, at best,
about 110 Mpix/s. This is only a little faster than running both processes
concurrently on the same screen, where I'd expect lots of degredation
due to context switching. But using two cards and two processes on two CPUs, should I suffer from that??
I've also gotten to try this on a Quadroplex system with a pair of
Quadro 4500 FX2 cards running in sync, for a total of 8 screens.
On those, the peak texture load speed is lower (~150 Mpix/s), but multiple instances running on different screens get better overall performance,
until it saturates at about 4-5 instances at 480 Mpix/s. That's ~1.5GB/s,
much* more reasonable for a bus limitation.
But I'd really like to understand why there's so much interference between
two different graphics cards... Can I get better than this?