I dont think it can be a large proportion of T&L done on the CPU. If it was all, that would mean all vertices get transformed on the cpu. It just wouldn't be fast enough.
It may be they do a few more matrix setups, or have to do more DMA setting up on the cpu which uses slightly more cpu time. But the T&L does work. You would get the speed of a TNT2 otherwise, which was the last card that did everything on the cpu.