Smallissimo: 2013

Thursday, August 8, 2013

COG VM: Correctly reading Image Segment with/without native order 32 bits digits Large Integers

This post is about one of the last thing that were missing in my own Squeak/Pharo COG Virtual Machine hack which use native order 32 bits digits Large Integers: that is image segment loading/saving support.

On little endian machines, this is not a problem because native 32 bits word order matches Large Integers byte order. So loading image segment on little endian machines just work when the segment has been saved from a little endian machine, or from an old VM without native order 32 bits digits Large Integers support.

This is only a problem on big endian machines.
The only hurdle was to decide how to mark the segment so that it can be loaded correctly.
But let's explain a bit what's going on with byte order at load time...

Swapping byte order at image segment load time:

Loading an image segment from a different endianness requires byte-swapping of byte objects. This is because the primitive for loading image segment first swap every 32 bits word in the segment (by invoking ObjectMemory>>reverseBytesFrom:to:) as if every object were 32 bits word-array or composed exclusively of 32 bits object oriented pointers. But this is not the case for byte-arrays, and must be undone for those objects.

This is further processed by invoking method ObjectMemory>>byteSwapByteObjectsFrom:to:flipFloatsIf:
The last parameter of this method is for swapping the two words composing a Float (64 bits IEEE 754 double precision format) if ever the segment comes from an old image. Indeed, in old images, Float were always stored in Big Endian format, whatever VM endianness...

This byte swapping is triggered by a test of image segment version which is stored into first 32 bits word data, versus current ObjectMemory imageSegmentVersion.

Let's see this method:

imageSegmentVersion
    | wholeWord |
    "a more complex version that tells both the word reversal and the endianness of the machine it came from. Low half of word is 6502. Top byte is top byte of #doesNotUnderstand: on this machine. ($d on the Mac or $s on the PC)"

    wholeWord := self longAt: (self splObj: SelectorDoesNotUnderstand) + BaseHeaderSize.
        "first data word, 'does' "
    ^ self imageFormatVersion bitOr: (wholeWord bitAnd: 16rFF000000)

So the image segment version is 16rXXYYYYYY, where most significant byte 16rXX indicates endianness (16r73 the code of $s for little endian, 16r64 the code of $d for big endian) , and the rest indicates image version...

Well, except that ObjectMemory>>loadImageSegmentFrom:outPointers: only interpret low 16 bits as image version, so the used bits are more something like 16rXX00YYYY.

Hacking the image segment version

Since we are in COG, we will hack NewObjectMemory>>imageSegmentVersion. We will use one of the high bits of unused byte (the second most significant), the 7th bit starting at 1, (weight 64), that is the 22nd one in the whole word.
Note that we already used the 7th bit of image header flags (the 7th 32 bits word of the image), so our VM hack is somehow self consistent.

imageSegmentVersion
    | wholeWord flagLargeIntsAreStoredInBigEndian |
    "a more complex version that tells both the word reversal and the endianness of the machine it came from. Low half of word is 6502. Top byte is top byte of #doesNotUnderstand: on this machine. ($d on the Mac or $s on the PC)"

    wholeWord := self longAt: (self splObj: SelectorDoesNotUnderstand) + BaseHeaderSize.
        "first data word, 'does' "
    flagLargeIntsAreStoredInBigEndian := self vmEndianness << 22.
    ^coInterpreter imageFormatVersion bitOr: (wholeWord bitAnd: 16rFF000000) + flagLargeIntsAreStoredInBigEndian

The method vmEndianness answers 0 for little endian, 1 for big endian, so we only set the 22nd bit on segment saved from big endian machine 16rXX40YYYY. This let room for increasing image version up to 1<<22 if we want to.

Effectively swapping bytes

We add a parameter to already existing ObjectMemory method

byteSwapByteObjectsFrom: startOop to: stopAddr butLargeIntIf: skipLargeInt flipFloatsIf: flipFloatWords
    "Byte-swap the words of all bytes objects in a range of the
    image, including Strings, ByteArrays, and CompiledMethods.
    This returns these objects to their original byte ordering
    after blindly byte-swapping the entire image. For compiled
    methods, byte-swap only their bytecodes part. For Floats
    swap their most and least significant words if required."
    | oop fmt temp wordAddr |
    oop := startOop.
    [self oop: oop isLessThan: stopAddr] whileTrue:
        [(self isFreeObject: oop) ifFalse:
            [fmt := self formatOf: oop.
            fmt >= 8 ifTrue:
                [(skipLargeInt
                    and: [(self compactClassIndexOf: oop) = ClassLargePositiveIntegerCompactIndex
                    or: [(self compactClassIndexOf: oop) = ClassLargeNegativeIntegerCompactIndex]])
                    ifFalse:
                        ["oop contains bytes; unswap"
                        wordAddr := oop + BaseHeaderSize.
                        fmt >= 12 ifTrue: "compiled method; start after methodHeader and literals"
                            [wordAddr := wordAddr + ((self literalCountOf: oop) + LiteralStart * BytesPerOop)].
                        self reverseBytesFrom: wordAddr to: oop + (self sizeBitsOf: oop)]].
            fmt = 6 ifTrue: "Bitmap, Float etc"
                [(self compactClassIndexOf: oop) = ClassFloatCompactIndex
                    ifTrue:
                        [flipFloatWords ifTrue:
                            [temp := self longAt: oop + BaseHeaderSize.
                            self longAt: oop + BaseHeaderSize put: (self longAt: oop + BaseHeaderSize + 4).
                            self longAt: oop + BaseHeaderSize + 4 put: temp]]
                    ifFalse:
                        [BytesPerWord = 8 ifTrue:
                            ["Object contains 32-bit half-words packed into 64-bit machine words."
                            wordAddr := oop + BaseHeaderSize.
                            self reverseWordsFrom: wordAddr to: oop + (self sizeBitsOf: oop)]]]].
            oop := self objectAfter: oop]

We also add a method for swapping large integers only, if ever the VM is big endian and the segment is also big endian with old little endian order large integers format.

wordSwapLargeIntsFrom: startOop to: stopAddr
    "Swap the bytes of LargeIntegers in a range of the image."
    | oop wordAddr |
    oop := startOop.
    [self oop: oop isLessThan: stopAddr] whileTrue:
        [(self isFreeObject: oop) ifFalse:
            [((self formatOf: oop) = 8
            and: [(self compactClassIndexOf: oop) = ClassLargePositiveIntegerCompactIndex
                or: [(self compactClassIndexOf: oop) = ClassLargeNegativeIntegerCompactIndex]]) ifTrue:
                [wordAddr := oop + BaseHeaderSize.
                self reverseBytesFrom: wordAddr to: oop + (self sizeBitsOf: oop)]].
        oop := self objectAfter: oop]

We could have factored this loop with Float swapping, but this will be an optimization for a later time (though this would have saved a stupid copy paste bug in version 319).

Deciding when to swap the bytes

Now we have three possible format for Large Integers:

little endian for segments saved from little endian VM (16r7300YYYY)
little endian for segments saved from old big endian VM (16r6400YYYY)
big endian for segments saved from new big endian VM (16r6440YYYY)

We want to swap large integer bytes in those cases:

VM is big endian, and segment is big endian with old little endian large integers
VM is big endian and segment is little endian
VM is little endian and segment is big endian with new big endian large integers

In all cases, this is when 22nd bit of image segment data does not match that of current NewObjectMemory imageSegmentVersion. So we modify these lines of NewObjectMemory>>loadImageSegmentFrom:outPointers: handling the byte reversal:

    "Reverse the Byte type objects if the data is from opposite endian machine.
    Revese the words in Floats if from an earlier version with different Float order.
    Test top byte. $d on the Mac or $s on the PC. Rest of word is equal."
    swapLargeInts := (self vmEndianness << 22) ~= (data bitAnd: 1 << 22).
    (data >> 24) = (self imageSegmentVersion >> 24)
        ifTrue:
            "Need to swap floats if the segment is being loaded into a little-endian VM from a version
            that keeps Floats in big-endian word order as was the case prior to the 6505 image format."
            [(self isPlatformFloatOrderVersion: (data bitAnd: 16rFFFF "low 2 bytes")) ifFalse:
                [self vmEndianness ~= 1 "~= 1 => little-endian" ifTrue:
                    [segOop := self oopFromChunk: segmentWordArray + BaseHeaderSize + BytesPerWord.
                    self wordSwapFloatsFrom: segOop to: endSeg + BytesPerWord]].
            "Need to swap large integers if both segment and vm are big endian, but segment did not use native 32 bit word order for large integers"
            swapLargeInts ifTrue: [ self wordSwapLargeIntsFrom: segOop to: endSeg + BytesPerWord]]
        ifFalse: "Reverse the byte-type objects once"
            ["Change of endianness: need to swap large integers, except if segment is big endian but did not use 32 bit word order for large integers"
            segOop := self oopFromChunk: segmentWordArray + BaseHeaderSize + BytesPerWord.
                "Oop of first embedded object"
            self byteSwapByteObjectsFrom: segOop
                    to: endSeg + BytesPerWord
                    butLargeIntIf: swapLargeInts "don't unswap already swapped large integers"
                    flipFloatsIf: (self isPlatformFloatOrderVersion: (data bitAnd: 16rFFFF "low 2 bytes"))].

Last precision, when we want to effectively swap large integers, then we just omit to unswap them, because the save/load mechanism already swapped them.

Code for COG can be found at http://smalltalkhub.com/#!/~nice/NiceVMExperiments/versions/VMMaker.oscog-nice.320.

Similar code for interpreter VM is at http://smalltalkhub.com/#!/~nice/NiceVMExperiments/versions/VMMaker-nice.324.

Lat thing, this code is currently untested on big endian machines: they are becoming quite rare nowadays and I don't own any.

Wednesday, June 19, 2013

Got my first censored answer on SO!

Yeah, I'm proud of this one, http://stackoverflow.com/a/17180087/1396822 because it was a good joke, not insulting, completely on topic, and cultural.

For the few having less than 10k reputations, the content was
"And here is how it was resolved: http://en.wikipedia.org/wiki/Drawing_Hands"

I have to confess that I violated one rule from http://stackoverflow.com/help/deleted-answers "barely more than a link to an external site", but I fell like some touch of spirit is lacking in this case. And a longer answer would obviously violate one of our beloved rules "Small is beautiful".

Saturday, April 6, 2013

Still hacking LargeIntegerPlugins in interpreter VM

First thing, apologies for my bogus link of last post: it seems that
http://smalltalkhub.com/#!/~nice/NiceVMExperiments/versions/VMMaker-nice.315
triggers a bad bug in CCodeGenerator...
Indeed, the Smalltalk slang code:

    largeClass := isNegative
                   ifTrue: [objectMemory classLargeNegativeInteger]
                   ifFalse: [objectMemory classLargePositiveInteger].

generates this kind of C sentence:

    (test) ? statement1 , statement2 : statement3 , statement4;

which is apparently parsed as:

    ((test) ? statement1 , statement2 : statement3) , statement4;

Since the same slang is working perfectly in COG VM (different code generation) this bug stole me 2 hours of sleep last night, and I only understood the thing while posting an issue to vm-dev this evening.

So I provide a corrected package here if any one want to experiment:
http://smalltalkhub.com/#!/~nice/NiceVMExperiments/versions/VMMaker-nice.316

While at it last night, I could not resist and gave a try to the little hack which avoids allocation and copy of a new large integer and prefers an in place modification when the normalized integer fits in same number of words.
Since every object is allocated in 32-bit word boundaries in a 32bit VM, it's indeed possible to just modify some header bits which specify the size and leave the LargeInteger data unchanged. To make it short:

the byteSize rounded up to next word boundary is stored in a header word
the excess byte length (which is to be removed from this rounded byteSize) is stored in bits 8 & 9 of the previous header word (baseHeader >> 8 bitAnd: 3).
so we just need to change those 2 bits if (newByteSize+3 bitOr: 3)=(oldByteSize+3 bitOr: 3) - which means unchanged byteSize rounded up to next 4-bytes word boundary.

A similar technique should work for 64-bit word boundary in 64-bit VM which should be even more interesting (but I did not write a portable hack).

Of course such hack is fragile. If ever the object format happens to change (and it surely will), we would have to re-implement or remove it, but it's just for fun.

The code can be found at:
http://smalltalkhub.com/#!/~nice/NiceVMExperiments/versions/VMMaker-nice.317

As explained in commit comment, it is necessary to hack platform independent files platforms/Cross/vm/sqVirtualMachine.[ch] to register a new interpreterProxy function fixLast2BitsofByteLengthOf(). The function itself is implemented in ObjectMemory and generated in src/vm/interp.c.

This changes my micro-benchmarks a tiny bit:

Number of LargeInt operations per seconds for VM 4.10.10 vs hacked version 317

Monday, April 1, 2013

32-bit word LargeIntegers backport in Interpreter VM

For 3 months, I'm using a modified COG VM with LargeIntegersPlugin v2.0. The plugin is stable and smooth behaved - no crash.

This plugin is a hack that handles LargeInteger as natively ordered 32-bit digits on VM side, while the class is still seen as 8-bit digits on image side.

Yesterday, I wanted to check how easy it would be to backport this version to an Interpreter VM. Normally, the answer should be very, because most of the plugins code is shared between VMs. Well, most is shared, but some dust will inevitably jam the cogs.

The first thing is that COG still provides some class variables that have disappeared from the Interpreter, and of course the plugins uses some of these (VMBIGENDIAN, BytesPerWord, BaseHeaderSize, ...). So we have to modify a bit (simple enough, vmEndianness, bytesPerWord and baseHeaderSize messages are available)...

Then, Eliot Miranda has corrected a lot of C code generation quirks in the COG branch, and these are percolating back into the interpreter branch very slowly. LargeIntegersPlugin v2.0 uses 64 bit integers to store the results of operations on 32 bits words, and then split the results with bit operations, bitAnd: 16rFFFFFFFF, and bitShift: -32 (>> 32) all along the code. But old code generator cast every right shifted operand to an unsigned int in order to avoid Undefined Behavior of C with right shifted signed ints (Tsss!). But usqInt is 32 bits long in a 32 bits VM, so this cast is wrong for 64 bits ints. LargeIntegersPlugin v2.0 requires a backport of this specific change.

But that's not all. The SqueakVMUNIXPATHS.xcodeproj project used to compile on Mac lacks a settings for operating on 64 bits ints:


Missing Xcode Project Setting

I think that's all I had to do to make it work, so here are the first results of largeIntegerPlugins v2.0 (right column), compared to a 4.10.10 VM (left column) compiled on same old MacMini computer.

Micro benchmark on basic LargeInteger operations (# ops per seconds)

The micro-benchmark shows a poor performance on +. As we can see, the operations which should be theoretically proportional to bit-length, is not. Which means that most time is spent in primitive overhead. The v2.0 plugin has more overhead, because it operates on 32-bit words, and the final word is generally too large (has more than 8 leading zero bits). So a final normalization requires one more allocation and copy in a smaller LargeInteger, which spoils efficiency by a factor 2. Theoretically, we could avoid the copy and just hack the header of the LargeInteger to modify it's length but this part of object model is unbelievably complex, so I avoided hacking it so far.

Another dumb benchmark, running Squeak 4.5 KernelTests-Number takes 6.2 seconds with 2.0 plugins versus 7.8 seconds for VM 4.10.10 (2.0 seconds vs 3.6 seconds in COG).

Source code can be found in my SmalltalkHub repository http://smalltalkhub.com/#!/~nice/NiceVMExperiments, at VMMaker-nice.311 or more to date VMMaker-nice.315.

There are still a few items on the TODO list, all for the BigEndian VM cases:

implement a decently fast 8-bit digit at: and at:put: primitives both on COG and interpreter;
check about image segments (they might require byte swapping too);
handle the primitives that copy bytes (at least abort them).