Saturday, March 31, 2018

Slow HDF5 progress

I've been struggling with DLLCC, and found another bizarre behavior that I reported on S.O., http://stackoverflow.com/questions/49564024/isnt-pointer-type-checking-disabled-in-dll-c-connect-and-is-that-ok. I then took some time to reread the F... manual.

I think that I have deeper understanding of DLLCC now and published a new patch using double dispatching on the public store. This enables passing a Smallapack.CArrayAccessor as argument to the external function call. What's this thing (the name is not so nice)? It's a proxy to some C data (on heap), somehow like a CPointer (a CType plus a pointer on C data), but with two differences: 
  1. it carries a length, and thus enables safe access of contents from within the Smalltalk world. If you think of it two minutes, every pointer should be bounded, that's clearly the way to go. Alas, in 2018 we still are at assembler level with those C API, so the pointer coming from external world still are unbounded (a pity for safety...).
  2. it is using 1-based index and enables handling from within Smalltalk world like any other collection without too much mindstorm.
With the time spent on DLLCC, the progress was slow on HDF5, I just corrected the H5TString transfer. The documentation was not clear about fixed size null terminated strings. I first thought (or read?) that the terminating null was mandatory. But then the type H5T_C_S1 - a C type string of fixed size 1 - makes no sense! So I speculated that an extra byte was allocated for the terminator... This was wrong! If not enough place is allocated, then the null terminator is omitted. The difference of NullTerminated and NullPadded HDF5 Strings is then germane: the former ends before first null, the later at last non null, but in neither case null is mandatory...

Along with the CArrayAccessor change, I can now explore the contents of a HDF5 file created by Matlab save -v7.3. a Small step in the right direction.

Thursday, March 29, 2018

HDF5 interface requires a patch in DLLCC

Apologies to those who would have tried the HDF5 interface (if any): I completely forgot that I had to patch argument coercion in DLL/C-Connect to let it work.

I suspect that this is a bug of DLLCC, see http://stackoverflow.com/questions/49544642/why-cant-i-pass-an-uninterpretedbytes-to-a-void-thru-dll-c-connect

I have published a new bundle with the override. Alternate solution would be to either pretend that the function can take an _oopref* buffer argument, or to go thru copyToHeap copyFromHeap: complications that I absolutely do not want (HDF5 data can be big, thus I do not wish any extra/unecessary copy).

Monday, March 26, 2018

Interfacing Smalltalk with HDF5

I am going to speak about connecting Smalltalk with external (scientific) data again.

This time, it's not with Matlab, but HDF5, the hierarchical data format supported by http://www.hdfgroup.org 

Every language targetting science/engineering niche must have an interface to HDF5. That is the case of Python and Matlab to only cite two. And you know that I'd like to promote the usage of Smalltalk in this area too. So lets do it: here comes the HDF5 bundle in Cincom public store.

If interfacing with Matlab mat-file format was a piece of cake, HDF5 is much more involved. First, because HDF5 is like a file system. There are recursive named Group of data, like directories. And groups are not necessarily arranged as a tree, but can form arbitrary graphs (circular) thanks to links. Second, because HDF5 comes with a type system. It can hold arbitrary types, whether atomic (integer, floating point,bit fields) or composite (structure, arrays). The types can even be references to other objects, which means that it is sufficiently general to describe heterogeneous collections of dynamically typed data.

There are two kind of way to store data: the first is Dataset which are named entries in Groups (a bit like files in directories). a Dataset is a rectilinear multiple dimension array (like the MultipleDimensionArray that I recently promoted in Visualworks public store and Squeak STEM). The second is Attributes. All named entries (Group, Dataset and named types) can have attributes, which are kind of string-key arbitrary-value pairs. Attributes generally hold meta data. They lack the ability to perform the read/write of sub-regions of data. We somehow can compare attributes to the property list of Morphic.

For additional complexity from the interfacing point of view, arrays can be of variable length. Thus the buffer for holding complex data cannot be preallocated from within Smalltalk before the data transfer, but has to be allocated on the fly by the HDF5 library. Thats potentially means either memory leaks or dangling references to freed memory, or the two if the programmer worked too late at night.

HDF5 comes with lot of documentation including reference, user guide, tutorial and examples if you want to learn more.

Among the many features of HDF5, let's focus of some of the most essential:
  • the ability to perform type transformations during transfer operations. This enables language interoperability, since the type system is flexible enough to accomodate many languages (if not all thru C interface).
  • the ability to read/write sub-regions of dataset. This enables handling of huge data, the whole dataset does not have to reside in memory.
  • it scales well (or at least it can if used adequately).
That being said, this comes with a price: HDF5 is complex, and like the manipulated data, the API is bigger than big too.

It's clear that the target languages that HDF5 creators had in mind were statically typed. So the mapping to dynamically types objects is not straightforward, nor optimized, especially for the compound type (structure). The transfer necessarily involves two steps: the transfer or raw data, followed by a transformation to Smalltalk data for read, et vice et versa for write. For handling all cases, including references, a visitor pattern will be necessary (references can be cyclic too).

For huge data, the way to go is to use proxy to HDF5 objects. That means giving minimal behavior to the proxys, at least for storing/retrieving whole data or subregions. Application specific behavior should better lie in application specific objects, and those objects would use a generic HDF5 proxy. Since data structure is hierarchical (recursive), an application specific visitor should construct the application specific object graph. The early implementation that I just published is not yet there. It does not even use a visitor, but hardcode an arbitrary visit in the HDF5 proxies. It's more a minimal proof of concept at this stage, but a usable proof of concept though.

So, how do we use it? For creating a HDF5 file, first do this:

out:= H5File create: 'foo.h5'.
H5Dataset createPath: 'float1' parent: out rootGroup value: 1.3e0.
H5Dataset createPath: 'double2' parent: out rootGroup value: Double pi.
H5Dataset createPath: 'int3' parent: out rootGroup value: -357.
out close.


The close operation is not strictly necessary, because I implemented a registry of opened hdf5 objects with auto-close facility when the entries are reclaimed. But forcing a close fushes the file, otherwise the close will be delayed until all opened HDF5 entities have been reclaimed by garbage collector.

For reading, it's like this:

in:= H5File readOnly: 'foo.h5'.
float1 := (in / 'float1') value.
double2 := (in / 'double2') value.

int3 := (in / 'int3') value.
in close.


This example is a bit poor. We can also create/query groups, attributes and handle more complex data, thanks to MultipleDimensionArray, RawArray for atomic data types, the CArrayAccessor of Smallapack for arbitrary data. 

For writing Smalltalk objects on HDF5 files, one must define these 3 essential selectors:
  • h5mem returns a buffer containing the data to transfer and usable by HDF5 API (must be compatible with DLL-C-Connect interface)
  • h5type gives a HDF5 type description of buffer contents
  • h5space gives  a HDF5 description of buffer layout (dimensions of the dataset)
Last note: I've used HDF5 version 1.8.10 for the interface. Unfortunately, HDF5 use macros for enabling backward compatibility, but the way DLL/C-connect works, we will have to subclass the H5Interface in order to support various versions.

That's all for this post, I may add more implementation details later on.

Sunday, March 25, 2018

MatFileReader is ported to Squeak

I've ported the MatFileReader package initially written in Visualworks to Squeak. It should work unchanged in Pharo but I did not test. Usage in Squeak is simply:

    (MatFileReader on: (StandardFileStream readOnlyFileNamed: 'test.mat') binary) decode.

This will answer a Dictionary with workspace variable names as keys and MxArray as values which are essentially stubs by now. With a visitor pattern or by subclassing MatFileReader, it is possible to map the values to other Smalltalk classes.

Code repository is on squeaksource, http://www.squeaksource.com/STEM.html

    MCHttpRepository
        location: 'http://www.squeaksource.com/STEM'
        user: ''
        password: ''.

STEM stands for Smalltalk Tools for Engineering and Mathematics (or Squeak Tools if you're tainted). The name is short, close to the main acronym for Science Technology Engineering Mathematics, and otherwise have quite many meanings (http://en.wikipedia.org/wiki/Stem) . I think I will use it by now.

I've not assembled any MonticelloConfigurationMap nor prepared any Metacello ConfigurationOfSTEM, so by now you'll have to install all the packages in this repository by yourself.

Note that STEM is not a concurrent of Polymath (http://github.com/PolyMathOrg/PolyMath). Polymath is for Pharo, and I don't yet use Pharo, I'm far more fluent in Squeak. License is MIT, and volunteers for porting/integrating into Polymath are welcome if the library sound interesting enough. I just ask to have a little respect for authorship, if ever the environment still permits it.

Thursday, March 15, 2018

Importing Mat-file in Smalltalk

A very long time since I did not write anything here...

I've been busy, but finally found time to release a small utility in Cincom Smalltalk public store: it's a package named MatFileReader.

The package is capable of importing data from the .mat files produced by Matlab, into Visualworks. Nothing spectacular, but it may be useful for scientific applications. Usage is fairly simple, just do-it:

     (MatFileReader on: 'foobar.mat' asFilename readStream) decode.

I had developped the package around 2005, but it was messy and full of personnal dependencies. The dependencis are now reduced to the BinaryStreamRawArray and MultipleDimensionArray packages.
  • RawArray are collections holding bits of data efficiently in space - bits are translated into/from Smalltalk objects with at: and at:put: operations like a ByteArray or a WordArray.
  • BinaryStream is just a Stream wrapper with convenient messages for reading chunks of binary data, especially into RawArray.
  • MultipleDimensionArray is just a SequenceableCollection wrapper which let the single dimension raw storage be viewed with multiple dimensions. The layout is that of FORTRAN and Matlab, column-wise (first index varies first).
Also two other dependencies are the unzip from PNG readers and EphemeralValueDictionary from Glorp which have been used for reading compressed data and automatically reclaiming classes respectively (classes are created on the fly for holding Matlab struct data). When Ephemerons will be ready in Squeak, this package can provide a good test.

The package would be more useful if data were imported into some application specific classes (like Smallapack for example), but for that one just have to either subclass MatFileReader, or post-process with a tree visitor.

License is M.I.T.

When I'll have time, I'll also handle the export to mat-file, and try to port into Squeak/Pharo. Such package could be useful for Polymath (https://github.com/PolyMathOrg/PolyMath).