Floating Point Arithmetics

64 Views Asked by At

I have been experimenting with understanding floating-point arithmetic. I have a 64-bit processor. I have asked Matlab to use format longe, which should display a floating-point with doubt precision.

I see that $$3.16229-3.16228=1.000000000006551e-05$$ while $$.316229*10^3-.316228*10^3=9.999999999763531e-04 $$

I am not able to understand the difference. Is it because these numbers will be converted to binary representation for calculations, and then truncated to 52 bits. If yes, why the above two representations give different answers.

1

There are 1 best solutions below

0
On

Yes, that is correct. The numbers can not be represented exactly in floating point, this truncation error becomes visible in the difference. As the factor 10 is not a power of 2, the binary representations in the second formula differ from those in the first one, giving different truncation errors and different errors in the subtraction.