From henry.cejtin at sbcglobal.net Wed Jan 7 10:31:20 2009 From: henry.cejtin at sbcglobal.net (Henry Cejtin) Date: Wed Jan 7 10:31:54 2009 Subject: [MLton] Re: [MLton-user] How to write performant network code Message-ID: <895747.66489.qm@web82408.mail.mud.yahoo.com> I certainly am not arguing against calling memcpy(), but note that for small chunks of memory, the overhead of the the extra call to C (including passing parameters on the stack) could slow things down. Still, probably the way to go. For large copies, it can often be worth while checking the alignment and then moving a word at a time. I don't have any idea on the hashing problem. From wesley at terpstra.ca Thu Jan 15 07:52:35 2009 From: wesley at terpstra.ca (Wesley W. Terpstra) Date: Thu Jan 15 07:53:13 2009 Subject: [MLton] Re: [MLton-user] How to write performant network code In-Reply-To: <162de7480901141324x722066f7v16e6accc153cf352@mail.gmail.com> References: <162de7480901070749y1cbe1411gea793f7ee66300a6@mail.gmail.com> <162de7480901141324x722066f7v16e6accc153cf352@mail.gmail.com> Message-ID: <162de7480901150752h5f90be2eu60bf3e26b1ccd2a8@mail.gmail.com> (moved from mlton-user) On Wed, Jan 14, 2009 at 10:24 PM, Wesley W. Terpstra wrote: > Have you noticed that calling Word32.fromLarge o > PackWord32Little.subVec will generate this: > call WordU32_extdToWord64 > call WordU64_extdToWord32 In general, 64-bit Words/Ints suck in MLton 32-bit because it just passes the work to a C call. Wouldn't it make more sense to implement a Word64 using Word32 * Word32 and do the arithmetic in the basis library? The conversion to/from LargeWord would then be automatically detected by the optimizer as being useless. Then we would just pick to use the real Word64 for 64-bit machines and the fake Word64 on 32-bit. The problem with my proposal is of course that tuples are not FFI-friendly. I looked into the ssa directory to see how to implement an optimization pass that detects and simplifies these cases: x_1107: word64 = WordU8_extdToWord64 (x_1108) x_1106: word32 = WordU64_extdToWord32 (x_1107) x_1227: word32 = Word8Vector_subWord32 (x_1072, x_1074) x_1226: word64 = WordU32_extdToWord64 (x_1227) x_1225: word32 = WordU64_extdToWord32 (x_1226) I'm not really sure how to do this. It seems fairly easy to detect two lines one after another that can be combined (like the above example), but I don't know how to be sure that x_122[56] are used nowhere else. Also, this approach wouldn't result in nearly the same performance gains as the Word64 = Word32 * Word32 approach. One could also implement the Word64_ primapps in the x86 codegen to avoid some of the overhead (seems fairly straight-forward). > How about I just add a MLton.Socket.Address.toVector which simply > exposes the underlying Word8Vector.vector in network byte order? The (completely trivial) patch is attached. Ok to commit? -------------- next part -------------- A non-text attachment was scrubbed... Name: addressToVecctor.patch Type: text/x-patch Size: 739 bytes Desc: not available Url : http://mlton.org/pipermail/mlton/attachments/20090115/2b55d95f/addressToVecctor.bin From fluet at tti-c.org Thu Jan 15 20:43:12 2009 From: fluet at tti-c.org (Matthew Fluet) Date: Thu Jan 15 20:46:52 2009 Subject: [MLton] Re: [MLton-user] How to write performant network code In-Reply-To: <162de7480901141324x722066f7v16e6accc153cf352@mail.gmail.com> References: <162de7480901070749y1cbe1411gea793f7ee66300a6@mail.gmail.com> <162de7480901141324x722066f7v16e6accc153cf352@mail.gmail.com> Message-ID: (moved from mlton-user) On Wed, 14 Jan 2009, Wesley W. Terpstra wrote: > On Mon, Jan 12, 2009 at 5:13 AM, Matthew Fluet wrote: >> Does memcpy (or memmove, since the *Array{,Slice}.copy functions needs to >> work with potentially overlapping regions) do anything more than a >> word-by-word copy? > > Yes. memcpy is usually hand-crafted and extremely fast assembler. It > uses SSE and other tricks. Is it safe to also modify Word8Array.vector > to use memcpy? You would want to modify the implementation of Word8Array.vector to create an uninitialized array, memcpy into the new array, and then cast from array to vector. So, yes, that would be safe. > What about polymorphic Array.vector? That gets a bit trickier. You want to be careful about using memcpy on polymorphic arrays. The issue is that it constrains the to and from arrays to be (permanently) of the same type. For example, if you have two "(int * bool) array"s and copy from the first into the second, but the second never uses the bool component, then under the element-by-element copy, MLton could drop the bool component of the second array and compensate during the element-by-element copy by only writing the int component. But, if you require a memcpy, then the src and dst need to be of exactly the same type. This applies as well to the Word8Array case, but it seems less likely that you copy from a Word8Array.array to a Word8Array.array and never use the destination Word8Array.array. On the other hand, with a polymorphic array instantiated with an abstract type, there seems to be a lot more opportunities for pruning unused components. So, I would limit it to WordArray{,Slice} for now. Another difficulty with polymorphic arrays is that it isn't until late in the compile time that you know the size of the array elements. The memcpy needs that information to know how much to copy. BTW, since we don't support interior pointers, the copy needs to have types like: Word8Array_copy : (Word8.t array (* src *) * SeqIndex.t (* src offset *) * Word8.t array (* dst *) * SeqIndex.t (* dst offset *) * SeqIndex.t (* count *)) -> unit Word8Vector_copy : (Word8.t vector (* src *) * SeqIndex.t (* src offset *) * Word8.t array (* dst *) * SeqIndex.t (* dst offset *) * SeqIndex.t (* count *)) -> unit It might be worth adding these as primitives, though it isn't clear that we can optimize much with regards to them. If, for instance, the destination array is never read from, then we could drop the copy. But, that seems unlikely to arise in realistic code. >> A while ago, I added a primitive (structural) polymorphic hash: >> http://mlton.org/cgi-bin/viewsvn.cgi?view=rev&rev=6352 >> It would seem to suit your purposes: you can use it to hash any value, >> including datatypes. > > This is very nice and I didn't know about it. Unfortunately, it's not > enough because I need a universal hash function (one that takes a > 'seed' with the value to hash). For a given program, MLton.hash is a function (that is, it always returns the same hash value for structurally equivalent inputs). So, why can't you take the result of MLton.hash and munge it with your 'seed'? Or, better, you can always use (fn x => MLton.hash (seed, x)), so that you hash you seed along with the structure of interest. > Also, one still needs to be able to > serialize a network address out to the network in some situations. (eg > to say: send reply message to X, not me) Fair enough. Though, in that situation, isn't it better to go through the Basis Library functions? Blast writing a struct sockaddr to the network might not be blast read by another arch/os unless you guarantee that the sizes, alignment, padding are all the same. From fluet at tti-c.org Thu Jan 15 21:02:11 2009 From: fluet at tti-c.org (Matthew Fluet) Date: Thu Jan 15 21:05:50 2009 Subject: [MLton] Re: [MLton-user] How to write performant network code In-Reply-To: <162de7480901150752h5f90be2eu60bf3e26b1ccd2a8@mail.gmail.com> References: <162de7480901070749y1cbe1411gea793f7ee66300a6@mail.gmail.com> <162de7480901141324x722066f7v16e6accc153cf352@mail.gmail.com> <162de7480901150752h5f90be2eu60bf3e26b1ccd2a8@mail.gmail.com> Message-ID: On Thu, 15 Jan 2009, Wesley W. Terpstra wrote: > (moved from mlton-user) > > On Wed, Jan 14, 2009 at 10:24 PM, Wesley W. Terpstra wrote: >> Have you noticed that calling Word32.fromLarge o >> PackWord32Little.subVec will generate this: >> call WordU32_extdToWord64 >> call WordU64_extdToWord32 > > In general, 64-bit Words/Ints suck in MLton 32-bit because it just > passes the work to a C call. A lot of the 64-bit ops could be done by the codegen. It does 64-bit add/andb/neg/notb/orb/sub/orb natively. The comparisons and extensions should really be done by the codegen as well. > Wouldn't it make more sense to implement > a Word64 using Word32 * Word32 and do the arithmetic in the basis > library? The conversion to/from LargeWord would then be automatically > detected by the optimizer as being useless. Then we would just pick to > use the real Word64 for 64-bit machines and the fake Word64 on 32-bit. > The problem with my proposal is of course that tuples are not > FFI-friendly. I think the FFI-unfriendliness is a show stopper. Position.int is 64-bit (even on a 32-bit platform), and gets passed back and forth across the FFI for I/O. > I looked into the ssa directory to see how to implement an > optimization pass that detects and simplifies these cases: > > x_1107: word64 = WordU8_extdToWord64 (x_1108) > x_1106: word32 = WordU64_extdToWord32 (x_1107) > > x_1227: word32 = Word8Vector_subWord32 (x_1072, x_1074) > x_1226: word64 = WordU32_extdToWord64 (x_1227) > x_1225: word32 = WordU64_extdToWord32 (x_1226) > > I'm not really sure how to do this. It seems fairly easy to detect two > lines one after another that can be combined (like the above example), > but I don't know how to be sure that x_122[56] are used nowhere else. Well, clearly x_1225 is being used somewhere else --- it is a pure operation, so would be dropped (by the removeUnused pass, if not by the shrink sub-pass (that is run as a cleanup sub-pass of all the optimization passes)) if it were unused. It is true that x_1226 might or might not be unused. But, you can always introduce dead code and allow one of the aforementioned passes clean up. That is, with regards to the second example, it suffices to transform it to: x_1227: word32 = Word8Vector_subWord32 (x_1072, x_1074) x_1226: word64 = WordU32_extdToWord64 (x_1227) x_1225: word32 = x_1227 This local change hasn't changed the meaning of the program, so you can be confident that any uses of x_1226 and x_1225 are unaffected. If it turns out that there are no longer any uses of x_1226, then removeUnused (or shrink) will drop it from the program. Similarly, in the first example, it suffices to transform it to: x_1107: word64 = WordU8_extdToWord64 (x_1108) x_1106: word32 = WordU8_extdToWord32 (x_1108) And, it is likely that x_1107 will be unused and subsequently dropped. > Also, this approach wouldn't result in nearly the same performance > gains as the Word64 = Word32 * Word32 approach. One could also > implement the Word64_ primapps in the x86 codegen to avoid some of the > overhead (seems fairly straight-forward). > >> How about I just add a MLton.Socket.Address.toVector which simply >> exposes the underlying Word8Vector.vector in network byte order? > > The (completely trivial) patch is attached. Ok to commit? Looks fine. The MLton.Socket interface is supposedly deprecated (a holdover from the pre Basis 2002 days when the Basis Library networking modules weren't finalized), but there doesn't seem to be a particular need to purge it. From tuulos at gmail.com Mon Jan 19 17:39:06 2009 From: tuulos at gmail.com (Ville Tuulos) Date: Mon Jan 19 17:39:09 2009 Subject: [MLton] Is it safe to use an alternative malloc with Mlton programs? Message-ID: <12fdabad0901191739l351201c7ofae65ec077edb726@mail.gmail.com> Hi I'm working on a small internal project which uses MLton. We have some extensions written in C which use Judy arrays (http://judy.sourceforge.net/). After handling large amounts of data with Judy, the process' memory space gets really fragmented which slows down any subsequent malloc() calls (this is not MLton-related per se). It seems that TCMalloc by Google (http://goog-perftools.sourceforge.net/doc/tcmalloc.html) handles small allocations, which Judy does all the time, much better than the glibc's standard malloc. Based on quick grepping of MLton's sources, it appears that malloc() is not used in many places - I assume that internal memory handling is done by mmap(). If this is the case, is it safe to link MLton / C code against TCMalloc? Ville Tuulos Nokia Research Palo Alto From vesa.a.j.k at gmail.com Tue Jan 20 04:03:53 2009 From: vesa.a.j.k at gmail.com (Vesa Karvonen) Date: Tue Jan 20 04:03:56 2009 Subject: [MLton] Is it safe to use an alternative malloc with Mlton programs? In-Reply-To: <12fdabad0901191739l351201c7ofae65ec077edb726@mail.gmail.com> References: <12fdabad0901191739l351201c7ofae65ec077edb726@mail.gmail.com> Message-ID: <9e43b9a0901200403x379e39ebj7dab16cea6ef930f@mail.gmail.com> On Tue, Jan 20, 2009 at 3:39 AM, Ville Tuulos wrote: [...] > Based on quick grepping of MLton's sources, it appears that malloc() > is not used in many places - I assume that internal memory handling is > done by mmap(). AFAIK that is the case. MLton's GC uses mmap to allocate memory for the ML heap. > If this is the case, is it safe to link MLton / C code against TCMalloc? I don't see how it could be unsafe, but I don't know MLton's GC implementation very well. The only potential problem that comes into mind might be that if you incrementally allocate lots of memory from both the "C heap" (allocated with malloc (whether glibc malloc or TCMalloc)) and the ML heap, then, if/when a machine has a small address space relative to the total amount of memory used, you might fragment the address space so that MLton's GC fails to allocate (via mmap) a sufficiently large contiguous block of memory for the ML heap (when the heap needs to grow (or shrink)). -Vesa Karvonen From fluet at tti-c.org Tue Jan 20 19:57:30 2009 From: fluet at tti-c.org (Matthew Fluet) Date: Tue Jan 20 20:01:28 2009 Subject: [MLton] Is it safe to use an alternative malloc with Mlton programs? In-Reply-To: <9e43b9a0901200403x379e39ebj7dab16cea6ef930f@mail.gmail.com> References: <12fdabad0901191739l351201c7ofae65ec077edb726@mail.gmail.com> <9e43b9a0901200403x379e39ebj7dab16cea6ef930f@mail.gmail.com> Message-ID: On Tue, 20 Jan 2009, Vesa Karvonen wrote: > On Tue, Jan 20, 2009 at 3:39 AM, Ville Tuulos wrote: > [...] >> Based on quick grepping of MLton's sources, it appears that malloc() >> is not used in many places - I assume that internal memory handling is >> done by mmap(). > > AFAIK that is the case. MLton's GC uses mmap to allocate memory for > the ML heap. Correct. Profiling and hash-consing will use malloc to allocate memory outside the ML heap (and unmanaged by the MLton GC). >> If this is the case, is it safe to link MLton / C code against TCMalloc? > > I don't see how it could be unsafe, but I don't know MLton's GC > implementation very well. I agree that, if TCMalloc malloc provides the same behavior as glibc malloc, then it should be safe. > The only potential problem that comes into > mind might be that if you incrementally allocate lots of memory from > both the "C heap" (allocated with malloc (whether glibc malloc or > TCMalloc)) and the ML heap, then, if/when a machine has a small > address space relative to the total amount of memory used, you might > fragment the address space so that MLton's GC fails to allocate (via > mmap) a sufficiently large contiguous block of memory for the ML heap > (when the heap needs to grow (or shrink)). That is an issue, but as you note it is an issue with any C-side allocation. -Matthew From spoons at cmu.edu Thu Jan 29 05:15:39 2009 From: spoons at cmu.edu (Daniel Spoonhower) Date: Fri Jan 30 15:01:39 2009 Subject: [MLton] Re: Signal Handlers in multiMlton In-Reply-To: <498135B0.20800@cs.purdue.edu> References: <496BF4E8.3000501@cs.purdue.edu> <496C77CE.7040307@cmu.edu> <49802289.4040309@cs.purdue.edu> <498031CB.5020000@cmu.edu> <498135B0.20800@cs.purdue.edu> Message-ID: <4981ABFB.4070605@cmu.edu> Hi, MLton developers. KC and I have been having a conversation that was initially pretty specific to my initial multiprocessor implementation, but has drifted toward something that might be of more general interest. Start from the bottom if you want to follow along. KC: just one small comment, inline, below. --spoons Sivaramakrishnan KC wrote: > Daniel Spoonhower wrote: >> My (limited) understanding of signals leads me to think that the signal >> should be delivered to exactly one thread. >> > That is correct. > >> In my version of things, >> there is no way to set signals differently for different pthreads (as >> pthreads are not exposed to the programmer) so it wouldn't matter which >> thread handled it. Perhaps you have different needs, however. >> >> > I might. For parallelizing CML, I would like to run preemptive > schedulers for threadlets on top of every pthread. Apparently, the > interval timers are shared between the pthreads of a process. So I am > planning to deliver the alrm signal to every gcstate when I get it. > >> So as for how to find the correct gcState, I'm not sure. One strategy >> would be to set the signalIsPending flag for all pthreads and set the >> limit pointer to zero for those that with atomicState == 0. Each thread >> would then try to enter() wait for the global runtime lock. (A new >> check would be needed in ensureHasHeapBytesFreeAndOrInvariantForMutator >> to force threads to do an enter() when a signal is pending.) The first >> thread through the barrier that was waiting for that signal could then >> set up a handler to be run after leave(). This is the easiest way I can >> see to accomplish this, but perhaps it is too expensive. >> >> > Yes. Since signals are delivered to exactly one thread in C (and not > necessarily to the thread that generated it), I could just look up > signalsInfo.signalsHandled for every gcState and find a thread that is > registered to receive the signal and set its flag and limit. That way > we'd have the same behavior as C. > That sounds reasonable to me. There is, however, already a layer of multiplexing happening in basis-library/mlton/signal.sml, so this would add an additional (and perhaps necessary) layer. >> Having written all of that, it occurs to me that I've never tried to >> call pthread_getspecific inside of a signal handler. Maybe that will >> work? If it does, you could zero the limit of only that thread, and >> then look for signals in maybeSatisfyAllocationRequestLocally (and avoid >> the big runtime lock). >> >> > Sadly, calling pthread_getspecific() inside the signal does not get the > correct gcState. > >> --spoons >> >> >> > - kc. >> Sivaramakrishnan KC wrote: >> >>> I've made some headway into getting signal handling in multicore mlton. >>> I have made the global data structures in signals.sml, threads.sml and >>> Posix/signal.c, local for each thread. But I haven't been able to get >>> around the problem of getting the correct GC_state after the signal is >>> handled. One possible work around is to deliver the signal to all the >>> pthreads that have registered to receive the signal and not currently >>> masking it. This is not as bad as it sounds because when multiple >>> threads are registered for a particular signal in a process, the signal >>> is delivered to any one of the thread. We'd just be delivering the >>> signal to all of the threads. Or we could deliver the signal to any one >>> such thread. >>> >>> so Is there a way to get the correct GC_state after a signal is handled? >>> >>> And, the regression tests that had been failing because of signals, now >>> work on a single core. >>> >>> Daniel Spoonhower wrote: >>> >>>> The short answer is that I didn't need signals, and I was trying to >>>> make >>>> the smallest set of changes (especially since I was learning how the >>>> MLton runtime worked). >>>> >>>> One obstacle to getting signals working is that basis/Posic/Signal.c >>>> uses the symbol gcState frequently. In the trunk of MLton, this refers >>>> to a single, global data structure. In multiMLton, there is one >>>> GC_state structure per pthread (accessed via thread-local state) so >>>> some >>>> work would have to be done to get a handle on the correct GC_state >>>> after >>>> a signal is handled. >>>> >>>> Another problem is that some of that code assumes it is run in a >>>> critical section. In the trunk this is accomplished using >>>> MLton.Thread.atomic*. In multiMLton, these functions only guarantee >>>> atomicity on a given pthread, not global atomicity. In multiMLton, >>>> global critical sections are created in the runtime using enter and >>>> leave (or the macros ENTER* and LEAVE*). >>>> >>>> Beyond that I haven't looked too much into signals. If you can figure >>>> out what needs to happen, I can help you make it work in multiMLton. >>>> >>>> --spoons >>>> >>>> Sivaramakrishnan KC wrote: >>>> >>>> >>>>> Hey >>>>> >>>>> Is there a reason why signal handlers are not yet supported in >>>>> multiMlton? What would have to be done to support signal handlers? >>>>> >>>>> Thanks >>>>> kc >>>>> >>>>> >>> > >