Ok, I was wrong, what I said is true in my gcc version. here go two useful links:
here is a discussion about the subject:
look for the MMX/3Dnow!/SSE/SSE2 compilers thread. As you can see, SSE2 instructions may take longer than FPU instructions.
And here there is a depiction of what can you expect:
I realize x86 architecture is not appropiated for number crunching (look at the great results of the humble G3 300MHz Apple iBook)