<div dir="ltr"><div>Hi all,</div><div>For image based Karatsuba on Spur64 VM, the breakpoint is at about 500 bytes (4k-bits), but it's a very flat transition rather than a break.</div><div>This is all implemented in Squeak trunk if you want to play with.</div><div>Note that I already gained a huge speed up for naive multiplication by using 32-bits limbs instead of 8-bits limbs in the VM.</div><div>I'm curious, what was the limb size on HPS giving the 384 bits breakpoint?</div><div>For VM implementation of Karatsuba, we can do horrible things like reusing pre-allocated memory...<br></div><div><br></div><div>I had fun with more divide and conquer methods for / (Burnikel Ziegler) and the most beautiful is sqrt (Zimmerman) which beats squared for huge numbers. See posts in this serie: <a href="http://smallissimo.blogspot.com/2019/05/tuning-large-arithmetic-thresholds.html">http://smallissimo.blogspot.com/2019/05/tuning-large-arithmetic-thresholds.html</a>.</div><div><br></div><div>It would be nice to make those accelerated arithmetic optional, and it would be nice too if other pairs of eyes could review and improve code in this area.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le sam. 7 déc. 2019 à 07:11, Andres Valloud via Cuis-dev <<a href="mailto:cuis-dev@lists.cuis.st">cuis-dev@lists.cuis.st</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I played with these things quite a bit over the years. The identity <br> F(a+b) = F(a+1) F(b) + F(a) F(b-1) can be used to speed up calculations. <br> In particular, the idea is to split the index n = a+b so that a >= b <br> and a = 2^k. Then you can cache F(2^k + 1), F(2^k), and F(2^k - 1) to <br> speed things up (once you have two in any of these triplets, the other <br> one is essentially free). Since these indices are known in advance, no <br> dictionary is necessary. Note this approach requires very little <br> storage space, as opposed to the more naive way of breaking n into e.g. <br> (n / 2) ceiling and (n / 2) floor.<br> <br> I saw this about 20 years ago and implemented it in Squeak 1.x or 2.x. <br> Another bit that came from that work: look at F(5^n) mod 100 and 1000. <br> Fascinating, no?<br> <br> Calculating Fn using that recursion plus caching was a bit faster than <br> Dijkstra's recursion (which, by virtue of its shape, is not as cache <br> friendly). However, when starting from scratch, Dijsktra's recursion <br> was a bit faster. I did some of this around 2008.<br> <br> A very important implementation strategy for this kind of work is to <br> implement Karatsuba's multiplication. I did that in the HPS world, and <br> it paid off once you got past 384 bit x 384 bit multiplications or so. <br> Very quickly, Karatsuba's multiplication complexity of about n^1.6 <br> soundly beats the usual elementary n^2 multiplication algorithm, even if <br> Karatsuba is implemented in the image while the n^2 algorithm is a VM <br> primitive (that, in the HPS case, I also (re)wrote). I have no idea <br> what the breakeven point is in the Cog world.<br> <br> By the way, the simple grade school multiplication algorithm is just <br> fine as a default VM implementation because the usual multiplication <br> case involves relatively small large integers, and in that case the <br> simple algorithm wins because of its simplicity.<br> <br> As a suggestion, you can reimplement LargeInteger>>* to do a size test <br> then delegate to the usual primitives if the sizes are small enough, <br> otherwise you do Karatsuba to get smaller integers then try again. <br> However, I think it's cleaner to implement something like <br> Integer>>karatsubaTimes: (refining it in the subclasses appropriately), <br> then you can send that new message from the places you know matter.<br> <br> A VM implementation of Karatsuba's algorithm comes with subtle problems <br> because it needs to allocate multiple intermediate objects. This means <br> such multiplications will tax the memory manager. It's possible that <br> allocations will fail at any point, and having to stop the calculation <br> in the middle will induce stability problems (e.g. Karatsuba can almost <br> fill memory completely and then fail --- now the image doesn't have <br> enough space to even react to the memory emergency, and your system <br> crashes). It's not that trivial to make a robust VM implementation. In <br> contrast, the grade school multiplication has optimal memory allocation <br> behavior, so if it fails due to an out of memory condition then you know <br> why and what to do. Moreover, it's not that hard to implement Karatsuba <br> in the image, and again the big performance gain comes from n^1.6 vs <br> n^2. Soon enough even the constants won't matter.<br> <br> On 12/6/19 20:26, Agustín Sansone via Cuis-dev wrote:<br> > Hi all,<br> > <br> > I was exploring the /mathematical functions/ category of the class <br> > /Integer/ and a I didn't find the very well-known Fibonacci function <br> > ("/f(n)=f(n-1)+f(n-2)"/) , so I decided to implement it. I think this <br> > will be interesting due to the complexity I found (wich is O(log(n))!!).<br> > <br> > * A simple way to implement it is by writing the direct recursive<br> > mathematical recurrence relation. This would be exponential.<br> > * An iterative solution wouldn't do all this repeated work, therefore<br> > it would be lineal.<br> > * I used a dynamic-programming-divide-and-conquer-aproach wich<br> > computes the n-th Fibonacci number in O(log(n)). I think from this<br> > comment you can see why it works:<br> > <a href="https://math.stackexchange.com/questions/1834242/derivation-of-fibonacci-logn-time-sequence" rel="noreferrer" target="_blank">https://math.stackexchange.com/questions/1834242/derivation-of-fibonacci-logn-time-sequence</a>.<br> > <br> > Here are some running times:<br> > exponential --> [1 to: 40 do: [:elem | elem fibonacciNumber]] timeToRun <br> > 38536<br> > lineal --> [1 to: 10000 do: [:elem | elem fibonacciNumber]] <br> > timeToRun. 25558<br> > logarithmic --> [1 to: 10000 do: [:elem | elem fibonacciNumber]] <br> > timeToRun. 1471<br> > <br> > I leave you all three implementations, let me know what you think,<br> > Agustín<br> > <br> > <br> > <br> > <br> -- <br> Cuis-dev mailing list<br> <a href="mailto:Cuis-dev@lists.cuis.st" target="_blank">Cuis-dev@lists.cuis.st</a><br> <a href="https://lists.cuis.st/mailman/listinfo/cuis-dev" rel="noreferrer" target="_blank">https://lists.cuis.st/mailman/listinfo/cuis-dev</a><br> </blockquote></div>