With the latest release of Core, I've had occasion to think about how
our libraries differ from INRIA's. One difference
that's been there from the very beginning is
in the List module: INRIA's list functions are not tail-recursive, and ours are.
The obvious reason to prefer tail-recursive solutions is that they are able to handle lists of arbitrary length, but they have a downside as well. INRIA's list implementation is not tail recursive largely because of performance, as described by Xavier here. All that heap allocation you get by mapping and then reversing the list really costs.
The key tradeoff that Xavier points to is the choice between running fast on small-to-medium lists, and running at all on long lists. A few different people have opined that lists don't make sense for large datasets anyway, which argues in favor of the choice made in the standard library of using the tail-recursive version. But I've never bought this argument. It's hard to predict what your code is going to be used for, and you don't want to have landmines where your code simply blows up (rather than degrading gracefully in performance) when run on an unexpectedly large dataset. Using a non-tail-recursive list implementation leads to such brittle behavior.
So, it occurred to me, can we have the best of both worlds? As Xavier
points out, there are "magic" implementation that you can do that use
the dreaded Obj.magic, but Obj.magic is notoriously difficult to
reason about, and is really the same thing as hacking the compiler.
Among other things, it leaves you with no guarantees when new versions
of the compiler are released.
ExtLib takes this approach,
but it's never been something we've been comfortable with.
But what if we write a version that detects when we're dealing with a big list, and dynamically switches implementations? Here's a simple version.
open Core.Std let rec count_map ~f l ctr = match l with | [] -> [] | hd :: tl -> f hd :: (if ctr < 5000 then count_map ~f tl (ctr + 1) else List.map ~f tl) let map ~f l = count_map ~f l 0
This works a lot better. It's a little bit slower than the standard
List.map for small lists, and about the same as the tail-recursive
List.map for large lists. But we can do better still. There are
two more optimizations I played with. The first is to do a little
loop unrolling on the recursion, and the second is to to deal with the
large list case going through arrays, as suggested in a post by
Christophe
Troestler.
Here's the resulting code:
open Core.Std let list_array_map ~f l = Array.to_list (Array.map ~f (Array.of_list l)) let rec count_map ~f l ctr = match l with | [] -> [] | [x] -> [f x] | [x;y] -> [f x; f y] | [x;y;z] -> [f x; f y; f z] | x :: y :: z :: w :: tl -> f x :: f y :: f z :: f w :: (if ctr > 500 then list_array_map ~f tl else count_map ~f tl (ctr + 1)) let map ~f l = count_map ~f l 0
This implementation does better still. It's actually faster than the
standard implementation on short lists, and only a little slower on
long lists. Here are some very rough benchmarks (done on an x86-64
box). Here, the mean and standard deviations are of the ratio of the
implementation versus the implementation in the standard library.
"core" is the implementation currently in Core, and "fast" is the
above implementation.
## list length 0 ##
core: mean 1.648838, nstdev 0.043502
fast: mean 0.717259, nstdev 0.043177
## list length 1 ##
core: mean 2.113085, nstdev 0.075585
fast: mean 0.596140, nstdev 0.049489
## list length 2 ##
core: mean 1.989603, nstdev 0.044707
fast: mean 0.636450, nstdev 0.003376
## list length 5 ##
core: mean 2.003528, nstdev 0.043638
fast: mean 0.821950, nstdev 0.024802
## list length 10 ##
core: mean 1.428904, nstdev 0.016536
fast: mean 0.729491, nstdev 0.018766
## list length 20 ##
core: mean 1.443628, nstdev 0.062703
fast: mean 0.743741, nstdev 0.018308
## list length 100 ##
core: mean 1.301089, nstdev 0.019097
fast: mean 0.898968, nstdev 0.017839
## list length 1000 ##
core: mean 1.719725, nstdev 0.025758
fast: mean 0.950799, nstdev 0.018624
## list length 25000 ##
core: mean 1.721275, nstdev 0.044541
fast: mean 1.188690, nstdev 0.031437
The performance improvement seems worth the ugly code given how common map is. I suspect some hack like this will make it's way to a future version of Core.
Comments
Error in first snippet?
The first count_map never applies
ftohd.Good catch
It's fixed.
tail allocation
Probably the right thing to do is hack Ocaml to do tail-allocation. We did this in TIL precisely for this reason -- you get a fast, tail-recursive map. The only downside is coordination with the garbage collector, since you end up needing write-barriers. (But maybe this is what ExtLib does using magic?)
tail allocation
This is exactly what ExtLib does. I don't like playing around with
Obj.magicunnecessarily, though, which is why we didn't go that way. Also, there's a runtime cost to the write-barrier, which makes me suspect that my solution above is faster than ExtLib's. What I really should do is add the ExtLib implementation to this benchmark and see how well it performs.F# implements map similarly
F# implements map similarly under the hood, and I think the F# guys got it right.
Compromise
There is a compromise between the "no tail-recursion" implementation and the List.rev/Array.of_list workaround. You can process the result by "chunks" : partition the list into sublists of length K, recursively, compute "map" on them them accumulate the results. If K is small enough, you can use the fast non-tailrec method to map each sublist, and the call depth of your function is divided by K (compared to the non-tailrec map wich process elements one by one) so with a K large enough you avoid overflow on large lists.
You can also add your "standard behaviour on small list" trick if necessary :
(I used chunk_size for the "count_map" limit but it could be an independent parameter)
The maximum call stack depth of chunk_map on a list of size N (and K = chunk_size) is K + N / K. With K = 5000, it effectively protects from overflow on systems with a reasonable call stack limit. The "count_map" trick adds another K to the call depth, and may not be worth it.
According to my measurements, this method (wich I originally developped on fold_right for thelema on #ocaml) is faster than both List.rev or Array.of_list. It is not surprising as it does not allocate anything.
A more interesting phenomenon is that is also outperforms the non-tailrec List.map on medium to large lists. I think this has to do with the GC : when the non-tailrec List.map or map_prefix iterate through a List, they have to keep every element in memory; when the list is a bit large, this has a non-neglectible cost (I have not tried but I think you could see sensible improvements by tweaking the GC parameters to have a bigger minor GC heap, that sort of things). What chunk_map does is that it ignores the first K items, then process the tail : after the tail result is processed, the tail elements are no longer useful and can be collected. In effect, there are at most K items alive during a chunk_map run, instead of all the elements of the list. When K is too big (I suppose bigger than the minor heap), performances will degrade.
Edit : I forgot to mention that this algorithm relies on the fact that we can compute the map of the tail before the map on the first K elements. If one wanted to enforce a left-to-right order of applications of the mapping function (as is done in the stdlib List.map, but isn't with the List.rev workaround), she would have to modify the function to first compute the map of the K elements, then the tail, then concatenate the two. I expect this solution to be less efficient than the Array.of_list workaround, wich also allows a correct order (beauties of arbitrary index access).
If the order of application is unimportant, it is also possible to reuse my (similar) implementation of fold_right, to just have :
let map f li = fold_right (fun a tl -> f a :: tl)It is a less efficient than the inlined version (I measure a ~15% slowdown, but that's still faster than List.rev/Array.fold_right), but allows to localize the "code ugliness" in fold_right.