I wanted to explain some notions about floating point and thought using Integer>>asFloat in Squeak as example. Unfortunately, I found the code was both not that clear and not that efficient and some of the comments were inaccurate (I can tell so without any diplomacy since it's my own production).

So I decided to change it once again.

First, let's see the comment. Three lines should be enough. Note that I didn't mention nearest Float, because in case of overflow we will answer Float infinity, while the nearest Float would be Float fmax.

Integer>>asFloat

"Answer a Float that best approximates the value of the receiver.

This algorithm is optimized to process only the significant digits of a LargeInteger.

And it does honour IEEE 754 round to nearest even mode in case of excess precision (see details below)."

"Answer a Float that best approximates the value of the receiver.

This algorithm is optimized to process only the significant digits of a LargeInteger.

And it does honour IEEE 754 round to nearest even mode in case of excess precision (see details below)."

Then, it might be useful to detail what round to nearest even means exactly. It's quite long and I'm not sure this belongs to the method or if it should just be replaced by a link... But it does also help understanding the code below.

"How numbers are rounded in IEEE 754 default rounding mode:

A shift is applied so that the highest 53 bits are placed before the floating point to form a mantissa.

The trailing bits form the fraction part placed after the floating point.

This fractional number must be rounded to the nearest integer.

If fraction part is 2r0.1, exactly between two consecutive integers, there is a tie.

The nearest even integer is chosen in this case.

Examples (First 52bits of mantissa are omitted for brevity):

2r0.00001 is rounded downward to 2r0

2r1.00001 is rounded downward to 2r1

2r0.1 is a tie and rounded to 2r0 (nearest even)

2r1.1 is a tie and rounded to 2r10 (nearest even)

2r0.10001 is rounded upward to 2r1

2r1.10001 is rounded upward to 2r10

Thus, if the next bit after floating point is 0, the mantissa is left unchanged.

If next bit after floating point is 1, an odd mantissa is always rounded upper.

An even mantissa is rounded upper only if the fraction part is not a tie."

Then we can come with a cleaner implementation by just naming the variables more explicitly to fit the comments:

| mantissa shift sum excess mask trailingBits nextBitIsSet tie |

mantissa := self abs.

"Check how many bits excess the maximum precision of a Float mantissa."

excess := mantissa highBit - Float precision.

excess > 0

ifTrue:

["Remove the excess bits"

mask := (1 bitShift: excess) - 1.

trailingBits := mantissa bitAnd: mask.

"trailingBits isZero ifFalse: [Inexact signal]."

mantissa := mantissa bitShift: excess negated.

shift := excess.

"But care to honour IEEE 754 round to nearest even mode."

nextBitIsSet := trailingBits highBit = excess.

(nextBitIsSet and: [mantissa odd or:

[tie := trailingBits isPowerOfTwo.

tie not]])

ifTrue: [mantissa := mantissa + 1]]

ifFalse: [shift := 0].

"Now that mantissa has no more excess precision, the following floating point operations will be exact."

sum := 0.0.

1 to: mantissa digitLength do:

[:byteIndex |

sum := sum + ((mantissa digitAt: byteIndex) asFloat timesTwoPower: shift)

shift := shift + 8].

^ self positive

ifTrue: [sum]

ifFalse: [sum negated]

Note that in this case, the rounding is performed entirely in Smalltalk and the algorithm will round to nearest even whatever the hardware rounding mode is, because all successive floating point operations will be exact.

The hardware inexact flag thus won't be set. We could replace this by a soft Inexact signal, but this is commented out because we don't care of and don't have such exception in Smalltalk until now. It's just a Smalltalkish comment.

But this code still performs many bit operations on LargeIntegers in case of excess precision. We could try alternatives that won't produce intermediate LargeIntegers.

excess > 0

ifTrue:

["Take care to honour IEEE 754 round to nearest even mode."

mustRoundUpper := (mantissa bitAt: excess) > 0 "The bit after floating point is set"

and: [(mantissa bitAt: 1 + excess) > 0 "The mantissa is odd"

or: [excess > 1 and: [self anyBitOfMagnitudeFrom: 1 to: excess - 1 "This is not a tie"]]].

"Remove the excess bits"

mantissa := mantissa bitShift: excess negated.

shift := excess.

mustRoundUpper ifTrue: [mantissa := mantissa + 1]]

But see how we must assist the code with many comments... Not that good. And not that faster...

We can still write more efficient code: we just have to let the hardware perform the rounding by itself. In effect, if we just add one excess bit, we are sure that there will be at most one inexact operation, and thus that we won't cumulate round off errors. But in this case, it is necessary to explain in preamble:

"Algorihm details:

Floating point hardware will correctly handle the rounding by itself with a single inexact operation if mantissa has one excess bit of precision.

Except in the last case when extra bits are present after an even mantissa, we must round upper by ourselves.

Note 1: the inexact flag in floating point hardware must not be trusted because it won't take into account the bits we truncated by ourselves.

Note 2: the floating point hardware is presumed configured in default rounding mode."

mantissa := self abs.

"Check how many bits excess the maximum precision of a Float mantissa."

excess := mantissa highBit - Float precision.

excess > 1

ifTrue:

["Remove the excess bits but one"

mantissa := mantissa bitShift: 1 - excess.

shift := excess - 1.

"Handle the case of extra bits truncated after an even mantissa."

((mantissa bitAnd: 2r11) = 2r01 and: [self anyBitOfMagnitudeFrom: 1 to: shift])

ifTrue: [mantissa := mantissa + 1]]

ifFalse: [shift := 0].

"Now that mantissa has at most 1 excess bit of precision, let floating point operations perform the final rounding."

sum := 0.0.

1 to: mantissa digitLength do:

[:byteIndex |

sum := sum + ((mantissa digitAt: byteIndex) asFloat timesTwoPower: shift).

shift := shift + 8].

^ self positive

ifTrue: [sum]

ifFalse: [sum negated]

This is shorter and faster, and a bit more fragile because now depending on hardware settings... But we never change these settings in Smalltalk, so we don't mind. Eventually, we could also come with a version that will always honour hardware rounding mode in the future - only if we can control it from within Smalltalk. The most annoying thing is that we trade some didactic properties now hidden into hardware black box. Now I must decide wether to commit immediately or wait for further comments...

## No comments:

## Post a Comment