Friday, January 18, 2013

Update on Squeak LargeInteger 32 bits speed-up

I have connected the 32-bits digits implementation of LargeIntegersPlugin to a COG VM, and published my experiments on http://smalltalkhub.com/#!/~nice/NiceVMExperiments.

Version VMMaker.oscog-nice.271 is quite safe on a little endian machine, Kernel tests are passing (you need to merge the Kernel version which is provided in same repository along with tests to update the montgomeryDigitLength to 32 instead of 8 as currently hardcoded in raisedTo:modulo:).

But don't ever think of using a modified StackInterpreter VM on a big-endian machine, nor on an image saved from a big endian machine! The Big-Endian bootstrap is not ready yet. It will require:
  • an additional bit in image format for specifying that LargeIntegers have digits stored in 32-bits-bigEndian format.
  • a modified VM initialization phase to swap or unswap the instance of LargeIntegers (I began looking at updateObjectsPostByteSwapFrom:to:, but above bit must be first allocated).
  • an image side modification of digitAt: and digitAt:put: to call the primitives that swap bytes to mimic 8-bit digit order.
Note that the byte swapping is complex: 32 bits VM currently swap all 32 bits words if image endianness differs from VM endianness. Then it undo the swapping for every variable byte instance. We would have to add an exception of instance of LargeIntegers. But we must take the bit indicating whether LargeIntegers are saved BigEndian, which might differ from the one indicating that image was saved BigEndian for backward compatibility...

There is room for further improvment because the normalization step copies the result if ever its length is not a multiple of 4. Instead, we would just have to let allocated size always be multiple of 4, and connect digitLength to a specialized primitive for handling the leading zeroes.

The micro benchmark is now a bit modified: 

#(100 1000 10000) do: [:n |
    | p1 p2 |
    p1 := 1<<n - 1.
    p2 := p1 sqrtFloor*p1+1.
    Transcript cr; show:  n printString , ' bits'.
    Transcript cr; show: '+' , ' -> '; show: [p1+p2] bench.
    Transcript cr; show: '-' , ' -> '; show: [p2-p1] bench.
    Transcript cr; show: '*' , ' -> '; show: [p1*p2] bench.
    Transcript cr; show: '//' , ' -> '; show: [p2//p1] bench.
    Transcript cr; show: '\\' , ' -> '; show: [p2\\p1] bench.
    Transcript cr; show: 'quo:' , ' -> '; show: [p2 quo: p1] bench.
    Transcript cr; show: 'rem:' , ' -> '; show: [p2 rem: p1] bench.
    Transcript cr; show: 'quo: 10' , ' -> '; show: [p1 quo: 10] bench.
    Transcript cr; show: 'rem: 10' , ' -> '; show: [p1 rem: 10] bench.
    Transcript cr; show: '<<53' , ' -> '; show: [p1 << 53] bench.
    Transcript cr; show: '>>35' , ' -> '; show: [p2 >> 35] bench.
    Transcript cr.
].


I experienced a problem of repeatability of timings on these micro-benchmarks since my last post. The reference VM was COG 2640 . Now with COG 2672, +, - ,<< and >> apparently have severe lost of performance. It might be some kind of artefact, because 2672 does not experience any noticeable slowdown in macro benchmarks... So we will only consider 3rd column containing improvment ratio.

Number of operations per second in 8-bit vs 32-bit Large Integer Plugin

Macro benchmark won't change significantly, unless they are oriented toward LargeInteger crunching...
But for LargeInteger crunching, speed-up is noticeable:

[ArbitraryPrecisionFloatTest suite run] timeToRun.
 -> 29250 COG 2640 (eem.238)
-> 29835 COG 2672 (eem.255)
-> 20823 modified nice.271


or:

[(Smalltalk allClasses
    select: [:e |  (e category beginsWith: 'KernelTests-Numbers')
        and: [e inheritsFrom: TestCase]])
    do: [:e | e suite run]] timeToRun.

 -> 3851 COG 2640
-> 3642 COG 2672
-> 1948 modified nice.271


Same kind of ratio with Shootout piDigits...
The speedup might also be very interesting in cryptography.