Your understanding of shuffle instruction is not correct.
The tradtional ( read: NV20, NV25, NV30, ... ) architectures work on Vec4s. The R300 works on Vec3s and on Scalars at the same time.
This results in improved performance if you can run both of them in parallel.
All they're doing is saying "Do this before that instead of after that" - nothing more. This will result in no IQ difference, and the shader will still work in all cases.
IMO, this is a perfectly valid optimization, and ATI is really only removing it to make sure people who don't know what they're talking about don't spread BS about them cheating.