Saturday, December 29, 2012

Accelerating LargeInteger in Squeak

A quick picture of current implementation:

The LargeIntegers in Squeak are accelerated by using a multi-layer mechanism:
  • The first layer is in VM Integer primitives (for example primitive 29 for LargePositiveInteger>>#*). If the Integer parameters fit on 64 bits and so does the result, then they are handled quite fast at this level, especially in COG VM, the most common primitives being directly inlined into JIT.
  • The second layer is using the LargeIntegersPlugin (named LargeIntegers) - see Integer>>#* and  #digitMultiply:neg: for example which uses primDigitMultiplyNegative.
  • The third layer when above two failed is to fall back to Smalltalk code iterating on 8 bits digits (found at end of Integer>>#digitMultiply:neg: for example)
Currently, the 3rd layer  is not used, because the plugin is integrated internally in every distributed VM.
But it suggests that the LargeIntegersPlugin is also performing operations on 8-bits digits which is quite dated... Most arithmetic units handle at least 32 or 64 bits nowadays.

A new 32-bit implementation:

So what performance could we gain by using 32-bits digits?
To do so, I thought of two options:
  • add a new Integer variableWordSubclass: #LargePositiveInteger32, and its Negative32 counterpart, plus a dedicated LargeInteger32Plugin with new primitives
  • change the LargeIntegersPlugin to operate on 32-bits digits
The first option is completed by adding byteAt: byteAt:put: and byteLength primitives to preserve a 8-bits digit based fallback code (this is required because fallback must operate on SmallInteger).
The second option would be possible, because the allocated size is always a multiple of 4 bytes, even for Byte Arrays. But it has two drawbacks:
  • LargeInteger are necessary for a modern Squeak image to work, so each mistake would cost me a lot of debugging pain, and I would first have to learn how to use the VM simulator...
  • It would not be a great option on big-endian machines due to  necessary swapping of each word (unless of course we use native word ordering and cheat for digitAt: and digitAt:put: fallback code).
So I opted for  a new LargeInteger32Plugin solution tested in a 32bits VM. All my operations will be defined using unsigned int, those requiring 64 bits (* + - quo:) will use unsigned long long. Note that 64 bits operations are emulated by gcc which does not use native 64 bits capabilities of ALU in this configuration.


Micro-benchmark results:

So here are the first results on my old 2.26GHz Intel core 2 duo Mac-mini for a hacked COG VM derivated from version 2640. They are presented in term of number of operations executed per second (8bits on left, 32bits on middle, percentage of improvement on right column) for various operands bit lengths.

Number of operations per second in 8-bit vs 32-bit Large Integer Plugin


The main improvement is for *,  around 7x at kilobit is already something. Note the primitive still implements a naïve O(N²) product, we know from other experiments that a simple Karatsuba algorithm programmed at image side already improves things a bit above kilobit length in a COG VM (see optimizing-product-of-prime-powers).

Then come the score of rem: but it is not fair... I cheated to use the same primitive as quo: which in fact computes both quotient and remainder. It is not used in 8-bit version because performing operations with Integer primitives in case of 64 bits or less is much much faster than using the LargeIntegersPlugin. Otherwise, we would have something similar to \\ score, which is already interesting thanks to usage of *...

The improvment of + and - are low, those are only O(N), probably we mainly measure the overhead because internal loops are too simple...


Micro-benchmark source code:

Here are the gory details:

#(100 1000 10000) collect: [:n |
    p1 := 1<<n - 1.
    p2 := p1 sqrtFloor*p1+1.
    q1 := LargePositiveInteger32 digitLength: p1 digitLength.
    1 to: p1 digitLength do: [:i | q1 digitAt: i put: (p1 digitAt: i)].
    q2 := LargePositiveInteger32 digitLength: p2 digitLength.
    1 to: p2 digitLength do: [:i | q2 digitAt: i put: (p2 digitAt: i)].
    n printString , ' bits' -> {
    '+' -> {
        [p1+p2] bench.
        [q1+q2] bench}.
    '-' -> {
        [p2-p1] bench.
        [q2-q1] bench}.
    '*' -> {
        [p1*p2] bench.
        [q1*q2] bench}.
    '//' -> {
        [p2//p1] bench.
        [q2//q1] bench}.
    '\\' -> {
        [p2\\p1] bench.
        [q2\\q1] bench}.
    'quo:' -> {
        [p2 quo: p1] bench.
        [q2 quo: q1] bench}.
    'rem:' -> {
        [p2 rem: p1] bench.
        [q2 rem: q1] bench}.
    'quo: 10' -> {
        [p1 quo: 10] bench.
        [q1 quo: 10] bench}.
    'rem: 10' -> {
        [p1 rem: 10] bench.
        [q1 rem: 10] bench}.
    '<<53' -> {
        [p1 << 53] bench.
        [q1 << 53] bench}.
    '>>35' -> {
        [p2 >> 35] bench.
        [q2 >> 35] bench}.
}].


This benchmark is very naive, because distribution of 1-bit and 0-bit used above is not fair. Since the probability of having a null 8-bit digit is higher than that of having a null 32-bit digit, and since primDigitMultiplyNegative eliminates N loops at each null digits, this is probably unfair for the 8-bit version, but for a first idea of performance, we don't care - only 1 byte out of 256 is null without prior knowledge of distribution.

Basic unit tests:

If results are false, then the benchmark is worth nothing, so I performed some quick sanity tests with involved pairs of Integers.
Since I can't mix the 8-bit and 32-bit Integers (no interest by now), the tests are quite low level.
I disabled logDebuggerStackToFile Preferences, I don't use traditional assert:, and only evaluate from a Workspace, because a Debugger trying to print the 32-bit LargeIntegers may involve such mixed arithmetic and crash the image.
IMO, the Squeak Debugger is much too clever and performs too many message sends on the debuggee...

assert :=
    [p digitLength = q digitLength ifFalse: [self halt: 'different length'].
    1 to: (p digitLength min: q digitLength) do:

        [:i |
        (p digitAt: i) = (q digitAt: i) ifFalse: [self halt: 'different digit at ' , i printString]]].
#(100 1000 10000) do: [:n |
    p1 := 1<<n - 1.
    p2 := p1 sqrtFloor*p1+1.
    q1 := LargePositiveInteger32 digitLength: p1 digitLength.
    1 to: p1 digitLength do: [:i | q1 digitAt: i put: (p1 digitAt: i)].
    q2 := LargePositiveInteger32 digitLength: p2 digitLength.
    1 to: p2 digitLength do: [:i | q2 digitAt: i put: (p2 digitAt: i)].
 

    p := p1. q := q1. assert value.
    p := p2. q := q2. assert value.
    p := p1*3. q := q1*3. assert value.
    p := p1+5. q := q1+5. assert value.
    p := p1-3. q := q1-3. assert value.
    p := p1+p2. q := q1+q2. assert value.
    p := p1-p2. q := q1-q2. assert value.
    p := p2-p1. q := q2-q1. assert value.
    p := p1*p2. q := q1*q2. assert value.
    p := p2<<53. q := q2<<53. assert value.
    p := p2>>35. q := q2>>35. assert value.
    p := p2 quo: p1. q := q2 quo: q1. assert value.
    p := p2 rem: p1. q := q2 rem: q1. assert value.
    p := p2//p1. q := q2//q1. assert value.
    p := p2\\p1. q := q2\\q1. assert value].

Source code and other details:

I went thru complications for either adding my new LargePositive/NegativeInteger32 classes to Smalltalk specialObjectsArray or using addGCRoot: mechanism.
I finally used specialObjectsArray for the bench to be fair.
Note that unlike 8-bit LargeInteger, my 32-bit LargeInteger classes are not compact, but I have no idea of the influence on above micro-benchmark.

The LargePositiveInteger32 and LargeInteger32Plugin code is unpublished yet. I have to clean it up and increase testing before publishing. And I don't really know yet where to publish this stuff. But it will of course be available as MIT.

The volume is also too high for commenting it in this too long post, but this might feed a few more posts if there is any interest...