[MLton-user] more optimization questions

brian denheyer briand@aracnet.com
Sat, 19 Nov 2005 10:27:47 -0800


On Nov 18, 2005, at 11:46 PM, Stephen Weeks wrote:

>
>> I'm implementing my nested loops using something like :
> ...
>> So the loops are implemented as for(1, n (fn (i)=> (for (1, m, (fn
>> (j) =>   etc...
>>
>> Any reason why this might cause slowdowns ?
>
> No.  That looks like a fine way to implement for loops to me.
>
That's what I figured - just checking :-)

>> Surprise, surprise, my counting profiling says I'm spending a LOT of
>> time in my array sub and index routines :
>>
>>      fun index (A3{n1, n2, n3, ...}, i, j, k) =
>>          (* 2 mults/2 adds for each index count *)
>>          k + n3 * (j + n2 * i)
>>
>>      fun sub (arr as A3{elems, ...}, i, j, k) =
>>          Array.sub(elems, index(arr, i, j, k))
>
> These also look fine.
>
Good to know.

> To confirm that there is no record construction and no boxing, you
> could do allocation profiling.  Even though time profiling fails,
> allocation profiling should still work because it uses a different
> mechanism.
>
> If there's not much expected allocation other than some arrays, you
> could do tests with various sizes and measure total program allocation
> using @MLton gc-summary -- to verify that the amount of memory
> allocated is roughly what you would expect.

Here's the results.  I think they confirm what is expected :

3,602,876 bytes allocated (5,148 bytes by GC)
                              function                                 
cur
-------------------------------------------------------------------  
-----
Array3.array  array3d.sml: 70                                        
99.3%
<main>                                                                
0.3%
<gc>                                                                  
0.1%
Char0.memoize  <basis>/text/char0.sml: 60                             
0.1%

GC type         time ms  number           bytes       bytes/sec
-------------   ------- ------- --------------- ---------------
copying              10       2       2,070,232     207,023,200
mark-compact          0       0               0               -
minor                 0       0               0               -
total GC time: 19 ms (0.4%)
max pause: 19 ms
total allocated: 3,608,068 bytes
max live: 2,059,164 bytes
max semispace: 18,522,112 bytes
max stack size: 1,080 bytes
marked cards: 0
minor scanned: 0 bytes
minor skipped: 0 bytes

Total memory allocation is pretty much exactly what I get counting up  
array space in the program.


>> Also, is it possible that the integer calculations (for the index)
>> might cause any problems ?  Like maybe they are boxed or converting
>> back and forth to tagged and un-tagged or something similar.
>
> Because you are using the C codegen and because you're using PowerPC,
> which hasn't received anything like the amount of tuning we've done on
> x86, you may see different results there than what you or others have
> seen on x86.
>

That may be, but as Mathew Fluet pointed out on c.l.f, there is not  
much to optimize.  This is a finite difference code.  Loop, do fp,  
loop some more.  No transcendental functions or any complex logic, or  
even bit-twiddling.

That's what makes this so confusing.  IME, optimizing compilers like  
mlton and gambit-c for scheme do very, very well on simple loop/float  
constructs like what I am doing.

As for power-pc optimization, I'm really interested in helping with  
that.  Although with the mac bonehead decision to go to intel I can't  
see that anyone is going to be very motivated to optimize anyting for  
power pc.

> It is possible to turn off overflow detection in the entire program by
> compiling with -detect-overflow false.  If this shows that the index
> computation is your bottleneck, and you don't rely on overflow
> detection there (presumably because you do your own bounds checking on
> each index), then you could replace the index computation's Int
> arithmetic with Word arithmetic, because that doesn't detect
> overflow.

I did that and can see the difference.  It's fairly minor, about  
15%.  However this is a really good suggestion !  There is no way any  
of my integer calcs can overflow, so the checks are competely  
superfluous.

I'll code up some simple examples to see if I can understand the  
intermediate files a little better.  Also smaller examples might help  
me narrow in on what is going on.  There must be something going on -  
the difference is just too big.

Brian