[MLton] PackWord to/from nonsense

Wed Jul 8 08:26:37 PDT 2009

On Tue, 7 Jul 2009, Wesley W. Terpstra wrote:
> In our networking code, I worked around this by using _prim
> "Word8Array_subWordX"  if MLton is used. This avoids the two C calls casting
> in and out of a 64-bit word for every word written into the data stream.

A number of 64-bit operations can (and should) be implemented by the 
native x86 codegen, to avoid the C calls.  This should help even in the 
presence of conversion optimizations.

> I
> recently ran into trouble on a 64-bit machine because SeqIndex.int is not
> int, and I got a PrimApp error. As a stop-gap measure, I'm open to
> suggestions of an Int/Word type that must match SeqIndex.

You can use the same technique that the Basis Library uses.  There is an 
(undocumented) MLB path variable SEQINDEX_INT which expands to either 
"int32" or "int64", depending on the size of indices of the target 
platform.  You can nicely package it up in a .mlb file as follows:

** seqindex.mlb
local
   $(SML_LIB)/basis/basis.mlb
in
   seqindex-$(SEQINDEX_INT).sml
end

** seqindex-int32.sml
structure SeqIndex = Int32

** seqindex-int64.sml
structure SeqIndex = Int64

> It would be nice to have 'unsafe' versions without the LargeWord baggage
> available somewhere, so _prim isn't needed. Armed with 'unsafe' PackWord, it
> would be easy to implement faster string/Word8Array copies, as discussed
> beforre.

I'm not sure why you call them "unsafe" versions.  Your proposed PACK_WORD 
signature (with the "type word" specification) wouldn't be unsafe in any 
way.

> I'll also note that PackWord represents yet another case where the basis
> library expects MLton to optimize fromLarge o toLarge to nothing.
> ...
> If that conversion optimization were placed before commonArg and knownCase I
> think Int8.fromFixed o Int8.toFixed would even become a no-op with overflow
> checking:
>
> x_1 = ...
> x_2 = WordU8_sextdToWord64 x_1
> x_3 = WordU64_sextdToWord8 x_2
> (* from iwconv0 bounds checking: *)
> x_4 = WordU8_sextdToWord64 x_3
> x_5 = Word64_eq (x_2, x_4)
> raise Overflow exception if x_5 is false
>
> First, comes the new optimization:
> x_3 = x_1
> Then comes commonArg/commSubexp
> x_4 and x_3 are replaced by x_2 and x_1 respectively
> Then comes knownCase:
> Word64_eq (x_2, x_2) is never false -> exception never raised
>
> Am I correct in this assessment?

In general, yes, conversion optimization should be a win. However, the 
"clean-up" optimizations aren't commonArg and knownCase.  The SSA shrinker 
(ssa/shrink.fun) will perform the necessary simplifications:
   * copy propagation of x_3 = x_1 (replace all uses of x_3 by x_1 and
     eliminate the x_3 variable)
   * prim-app folding of Word64_eq (x_2, x_2) to true
   * case simplification of a manifest discriminant

knownCase handles case simplification when the discriminant is only 
manifest on some of the incoming edges.  That is, the SSA shrinker will 
get:
   L_1:
     x_10 = true
     case x_10 of true => L_11 | false => L_12
while knownCase will get:
   L_1():
     x_10 = true
     L_4(x_10)
   L_2():
     x_20 = false
     L_4(x_20)
   L_3():
     x_30 = Word64_eq (x_1, x_2)
     L_4(x_30)
   L_4(x_40):
     case x_40 of true => L_11 | false => L_12
transforming it to:
   L_1():
     x_10 = true
     L_11()
   L_2():
     x_20 = false
     L_12()
   L_3():
     x_30 = Word64_eq (x_1, x_2)
     L_4(x_30)
   L_4(x_40):
     case x_40 of true => L_11 | false => L_12
It is likely that then the SSA shrinker will be able to eliminate the use 
of x_10 and x_20 as unused variables, perform the jump chaining to replace 
transfers to L_1 by L_11 and L_2 by L_12, and combine the L_3 and L_4 
blocks (assuming that now L_3 is the only predecessor of L_4).

> If so, that's a pretty serious speed-up: 5
> C calls and a potential branch turned into a no-op. Compared to 4 conversion
> in/out of an IntInf, things look even better!