[MLton-user] more optimization questions
brian denheyer
briand@aracnet.com
Sat, 19 Nov 2005 10:27:47 -0800
On Nov 18, 2005, at 11:46 PM, Stephen Weeks wrote:
>
>> I'm implementing my nested loops using something like :
> ...
>> So the loops are implemented as for(1, n (fn (i)=> (for (1, m, (fn
>> (j) => etc...
>>
>> Any reason why this might cause slowdowns ?
>
> No. That looks like a fine way to implement for loops to me.
>
That's what I figured - just checking :-)
>> Surprise, surprise, my counting profiling says I'm spending a LOT of
>> time in my array sub and index routines :
>>
>> fun index (A3{n1, n2, n3, ...}, i, j, k) =
>> (* 2 mults/2 adds for each index count *)
>> k + n3 * (j + n2 * i)
>>
>> fun sub (arr as A3{elems, ...}, i, j, k) =
>> Array.sub(elems, index(arr, i, j, k))
>
> These also look fine.
>
Good to know.
> To confirm that there is no record construction and no boxing, you
> could do allocation profiling. Even though time profiling fails,
> allocation profiling should still work because it uses a different
> mechanism.
>
> If there's not much expected allocation other than some arrays, you
> could do tests with various sizes and measure total program allocation
> using @MLton gc-summary -- to verify that the amount of memory
> allocated is roughly what you would expect.
Here's the results. I think they confirm what is expected :
3,602,876 bytes allocated (5,148 bytes by GC)
function
cur
-------------------------------------------------------------------
-----
Array3.array array3d.sml: 70
99.3%
<main>
0.3%
<gc>
0.1%
Char0.memoize <basis>/text/char0.sml: 60
0.1%
GC type time ms number bytes bytes/sec
------------- ------- ------- --------------- ---------------
copying 10 2 2,070,232 207,023,200
mark-compact 0 0 0 -
minor 0 0 0 -
total GC time: 19 ms (0.4%)
max pause: 19 ms
total allocated: 3,608,068 bytes
max live: 2,059,164 bytes
max semispace: 18,522,112 bytes
max stack size: 1,080 bytes
marked cards: 0
minor scanned: 0 bytes
minor skipped: 0 bytes
Total memory allocation is pretty much exactly what I get counting up
array space in the program.
>> Also, is it possible that the integer calculations (for the index)
>> might cause any problems ? Like maybe they are boxed or converting
>> back and forth to tagged and un-tagged or something similar.
>
> Because you are using the C codegen and because you're using PowerPC,
> which hasn't received anything like the amount of tuning we've done on
> x86, you may see different results there than what you or others have
> seen on x86.
>
That may be, but as Mathew Fluet pointed out on c.l.f, there is not
much to optimize. This is a finite difference code. Loop, do fp,
loop some more. No transcendental functions or any complex logic, or
even bit-twiddling.
That's what makes this so confusing. IME, optimizing compilers like
mlton and gambit-c for scheme do very, very well on simple loop/float
constructs like what I am doing.
As for power-pc optimization, I'm really interested in helping with
that. Although with the mac bonehead decision to go to intel I can't
see that anyone is going to be very motivated to optimize anyting for
power pc.
> It is possible to turn off overflow detection in the entire program by
> compiling with -detect-overflow false. If this shows that the index
> computation is your bottleneck, and you don't rely on overflow
> detection there (presumably because you do your own bounds checking on
> each index), then you could replace the index computation's Int
> arithmetic with Word arithmetic, because that doesn't detect
> overflow.
I did that and can see the difference. It's fairly minor, about
15%. However this is a really good suggestion ! There is no way any
of my integer calcs can overflow, so the checks are competely
superfluous.
I'll code up some simple examples to see if I can understand the
intermediate files a little better. Also smaller examples might help
me narrow in on what is going on. There must be something going on -
the difference is just too big.
Brian