The previous post described a performance-critical pricer for American options. Upon reflection, the optimized Mathematica code had introduced many redundant computations in order to vectorize the code because vectorization offers considerable performance improvements in such a slow language. Our simple translation of the optimized Mathematica had not taken advantage of this by removing these redundant computations.

This new optimized and parallelized F# solution is now a whopping *960× faster than the original Mathematica* code from Sal Mangano's Mathematica Cookbook:

let americanPut kk r sigma tt = let sqr x = x * x let ikk, a, nn, mm, tt0 = 1.0 / kk, 5.0, 1000, 200, sqr sigma * tt / 2.0 let k, h, s = 2.0 * r / sqr sigma, 2.0 * a / float nn, tt0 / float mm let alpha, k0, k1 = s / sqr h, 0.5 * (k - 1.0), 0.25 * sqr (k + 1.0) let x i = float i * h - a let ex = Array.init (nn+1) (fun i -> exp(x i)) let ek0x = Array.init (nn+1) (fun i -> exp(k0 * x i)) let pp0 i = max 0.0 (kk - kk * ex.[i]) let ek0a, k1s, a' = exp(k0 * a), k1 * s, 2.0 * alpha - 1.0 let mutable u = Array.init (nn+1) (fun i -> ek0x.[i] * pp0 i * ikk) let mutable u' = Array.create (nn+1) 0.0 for j=0 to mm-1 do let ek1sj = exp(k1s * float j) let k2 = alpha * ikk * ek0a * ek1sj u'.[0] <- alpha * u.[1] - a' * u.[0] + k2 |> max (ek0x.[0] * ek1sj * (1.0 - ex.[0])) let inline get (u: _ []) i m = max m (alpha * (u.[i+1] + u.[i-1]) - a' * u.[i]) for i=1 to nn/2 do u'.[i] <- get u i (ek0x.[i] * ek1sj * (1.0 - ex.[i])) for i=nn/2 to nn-1 do u'.[i] <- get u i 0.0 let u'' = u u <- u' u' <- u'' let k3 = kk * exp(-k1 * tt0) Seq.mapi (fun i u -> kk * ex.[i], k3 / ek0x.[i] * u) u Array.Parallel.init 1000 (fun i -> americanPut (float(i+1)) 0.05 0.14 1.0)

Note that the hardcoded constants nn and mm have been increased, sigma has been decreased to keep alpha just below 0.5 and strikes from 1 to 1,000 are computed in order to make the F# code take a significant fraction of the second (when the Mathematica takes several minutes!).

## 11 comments:

This is much better than your previous post (and your comments at the bottom), however some of your claims are wrong again.

“… parallelized F# solution is now a whopping 960× faster than the original Mathematica code”

You didn’t parallelize your code at all. You are just RUNNING it 1000 times in parallel. The original M code isn’t run in parallel to begin with. Of course things get faster when you run it in parallel a few times. Although an F# version is, of course, faster than a M version, which is something we all knew before, your claim that it’s faster than the original M *code* (emphasis added by me) is wrong again. The original *code* was not run in parallel, and your code isn’t parallelized. You just run it in parallel.

If you want to show that running it in parallel in F# is faster than running it in parallel kernels in M, say so. But you say something different. And this would also not be anything new to anyone, because a kernel is a relatively “heavy” process compared to a thread (as it can do much more than a thread). As the thread is the lowest level of scheduling, it’s no surprise that things that require only a thread, and not a full kernel, are faster in threads. Tell us something new.

Unrelated to the speed argument, just setting a to 5, N to 1000 and M to 200 is pure nonsense in terms of the actual math matter. That means alpha is 4, and as I wrote in the text accompanying the recipe, alpha has to be <=½ for all 100% explicit grid schemes. What happens when you violate this constraint can be seen on my webpage www.lauschkeconsulting.com/explicit.jsp. Just plot the output you are generating with alpha=4 as a function of stock price (spot), and you can see the garbage you are producing. Sometimes it helps to understand the actual subject matter one is dealing with to prevent utter nonsense.

I generally recommend using only a and N or a and M independently from each other, and computing the other as the value that is as close as possible to ½, but never greater, as maximum efficiency is attained for alpha = ½, or near there. The formula for that would be Ceiling[N^2 vol^2/4/a^2]. alpha>½ is unstable and alpha << ½ is inefficient. One should “shoot for” alpha=½ (or slightly less).

@Andreas:

"You are just RUNNING it 1000 times in parallel". I changed the serial loop in the original into a parallel loop and improved scalability by avoiding contention for shared resources.

"The original M code isn’t run in parallel to begin with". Yes, I wonder why the author of the original "performance critical" Mathematica code did not parallelize it.

"it’s no surprise that things that require only a thread, and not a full kernel, are faster in threads". Like most modern solutions, F# uses tasks and not threads.

"Sometimes it helps to understand the actual subject matter one is dealing with to prevent utter nonsense". On the contrary, I found it very useful to learn that Mathematica becomes 100× slower when running a program that produces such numbers. That was actually another one of the reasons why I chose to stop using Mathematica for serious work.

@Andreas: I have updated the code to use sigma=0.14 so that alpha<0.5. The performance results and conclusions are unaffected.

I'm not an options trading expert, nor an F# expert.

Is this the sort of thing that "Flash" trad-ers/ing are/is using? If so is faster better in real usage vs study?

In any case I enjoy the the Zen of F# approach Jon brings to all his blog and article writing.

Jon,

I was thinking to stop posting on this topic, but your numbers really puzzled me, so I spent an hour and did the tests. Here are my findings on M side: to remind, here is my original (improved) version:

americanPutCompiledAlt = Compile[{kk, r, sigma, tt},

With[{a = 5, nn = 1000, mm = 200, tt0 = sigma^2 tt/2,

k = 2 r/sigma^2},

Module[{alpha, h = 2 a/nn, s = tt0/mm, x, ss, tmax, f, pp0, u, z,

mflags, fparts, fconst},

alpha = s/h^2;

x = Range[-a, a, h];

ss = kk* Exp[x];

tmax = Clip[1 - Exp[x ], {0, 1}];

pp0 = Clip[#, {0, Max[#]}] &[kk - ss];

z = u = Exp[0.5 (k - 1) x] pp0/kk;

fparts =

alpha/kk Exp[0.5 (k - 1) a + 0.25 (k + 1)^2 Range[0, mm - 1] s];

fconst = Exp[1/2 (k - 1)*x]*tmax;

Do[

z[[1]] = alpha*u[[2]] - (2 alpha - 1 ) u[[1]] + fparts[[j]];

z[[2 ;; -2]] =

alpha (Take[u, {3, nn + 1}] +

Take[u, {1, nn - 1}]) - (2 alpha - 1)*Take[u, {2, nn}];

z[[-1]] = 0;

f = fconst*Exp[1/4 (k + 1)^2*(j - 1)*s];

mflags = UnitStep[z - f];

u = mflags * z + (1 - mflags) f;

, {j, mm}];

{ss, kk*Exp[-0.25 (k + 1)^2 tt0] Exp[-0.5 (k - 1) x] u}]]];

I modified the code somewhat upon reflection. This increased speed by only about 10 percent, but the code is now significantly shorter as well:

americanPutCompiledAltNew =

Compile[{kk, r, sigma, tt},

With[{a = 5, nn = 1000, mm = 200, tt0 = sigma^2 tt/2,

k = 2 r/sigma^2},

Module[{alpha, h = 2 a/nn, s = tt0/mm , ss , tk , u, fparts,

fconst, t},

alpha = s/h^2;

t = Exp[Range[-a, a, h]];

tk = t^(0.5 (k - 1));

u = fconst = tk*Clip[1 - t, {0, 1}];

fparts =

alpha/kk Exp[0.5 (k - 1) a + 0.25 (k + 1)^2 Range[0, mm - 1] s];

Do[u =

(#2 + (# - #2)*UnitStep[# - #2] &)[

Join[{alpha*u[[2]] - (2 alpha - 1 ) u[[1]] + fparts[[j]]},

alpha (u[[3 ;; nn + 1]] + u[[;; nn - 1]]) - (2 alpha - 1)*

u[[2 ;; nn]],

{0}], fconst*Exp[1/4 (k + 1)^2*(j - 1)*s]], {j, mm}];

kk {t, Exp[-0.25 (k + 1)^2 tt0] u/tk}]]];

Here are the timings on my 6 -core 2.8Ghz AMD Phenom (without parallelization):

In[12]:= (testOld =

Table[americanPutCompiledAlt[i, 0.05, 0.14, 1], {i,

1000}]); // AbsoluteTiming

Out[12]= {19.2968750, Null}

In[13]:= (testNew =

Table[americanPutCompiledAltNew[i, 0.05, 0.14, 1], {i,

1000}]); // AbsoluteTiming

Out[13]= {17.3593750, Null}

In[14]:= testOld == testNew

Out[14]= True

Now I parallelize the new version (does not matter much new or old):

In[15]:= DistributeDefinitions[americanPutCompiledAltNew];

In[16]:= (ptestNew =

ParallelTable[

americanPutCompiledAltNew[i, 0.05, 0.14, 1], {i,

1000}]); // AbsoluteTiming

Out[16]= {3.5000000, Null}

In[17]:= ptestNew == testNew

Out[17]= True

I am positive that Mathematica somehow used all my 6 kernels (it said launching 6 kernels, and the timing is also suggesting it - I did not get any errors either). Don't know how this connects with the 4 kernel per license rule - perhaps 6 -core processors are still very new and M did not account for it. Anyways, this suggests that I'd get 2.5 seconds did I have 8 kernels as you do. Now, you said that my original code takes ~36 seconds on your machine. Ok, let us assume then that your machine is twice slower than mine, so parallelized code would then take about 5 seconds (you can easily check that). You had 0.18 seconds, so here we go: 25x speed-up roughly, which is not 196x speed-up you mentioned. Where did you get your numbers from? Now, 25x difference between heavily optimized M code and heavily optimized F# is normal - this is what I would expect from a high-level tool like M from the start. But 200 times or 1000 times seems a total misrepresentation. Or did you not parallelize you version ( I think you did)?

@Leonid:

This blog post compared my latest parallel F# code with the original (serial) Mathematica code from Sal Mangano's book and found the F# to be 960× faster in this case.

Comparing my latest F# code with your latest parallel Mathematica code I still get a 45× performance difference (0.186s vs 8.4s). Moreover, the Mathematica does not see linear speedup from more cores on this machine so I cannot extrapolate how quickly the Mathematica might run if it worked correctly.

Interesting that you do not see this bug I'm seeing with Mathematica 7.0.1 trying to spawn 8 kernels and failing on half of them. Perhaps the difference is that this is a 2-socket machine containing two quadcore CPUs?

Anyway, this is still a huge performance difference in the context of a performance-critical application and the code Sal presented in his book far slower again. This example shows Mathematica in a poor light by playing to its weaknesses instead of its strengths.

@Jon,

Part I

Jon, since you can only use 4 cores out of 8, my guess would be that, should it be possible to use 8 cores, and you'd get those 5 seconds I have estimated instead of 8.4, which would give the 25 - 30 x speedup that I estimated before.

Your guess about 2-socket vs 1-socket machines sounds right on target.

Regarding general performance, I both agree and disagree with what you posted. I agree that the performance difference of the order of 30 is significant. But I think that for really performance-critical tasks those who implement them will know whether or not given performance level is sufficient in the context of a specific problem. When people choose Mathematica, it is mostly for research and rapid prototyping, and not for making performance - critical production systems entirely in it. For a prototype, I think that the timings of the tuned M code are decent. OTOH, performance bottlenecks can be dealt with by calling routines written in more performant languages.

Perhaps, the code in the book could have been tuned more (we actually know now that it is possible). We already agreed with Andreas that the improved version will appear on the book site (I think Sal would not mind). But the point of the original code in the book was probably not to present the ultimately tuned code full of tricks but to show which rather obvious optimizations can be done without spending too much time and effort. And let's give it some credit - without compiling and partially vectorizing the problem (this was already done in the original code) it would be perhaps yet 100 times slower or so, and then it will indeed be pretty much unusable.

@Jon,

Part II

I think it is fair to say that best-written M code on the average will be 15-30 times slower than best-written F# code solving the same problem. Sometimes (much) more slower, sometimes less slower. I think that, at least on this example, I managed to prove it to you and anyone else interested. Whether or not this speed difference is acceptable depends on the context of the problem. And it only makes sense to compare best-written implementations, if you want to draw conclusions about languages. I don't think it is fair to put into the heading the numbers like 960 and say that this is F# vs M - this is one particular optimized and parallelized implementation in F# vs one particular non-parallelized and not fully optimized implementation in M, no more, no less. While you do say that, you tend to make more general conclusions on F# vs M on the basis of this not quite fair comparison.

A speed difference of the order of a 1000 when mentioned in a generic statement would pretty much mean that the slower language is just a toy. A more realistic speed difference of the order of 30 means that it is a decent prototype-building tool. To make such generic statements though, you need a pretty large statistics of cases when best-written F# code and best-written M code differ 1000 times in speed. And that should include *most* important problems, not just a few special cases like a ray tracer. Frankly, I don't think you will have that statistics any time soon since I don't think that for most important (and important I define here as needed in real-world production systems with the best performance possible) problems the difference between best-written F# and M code is so dramatic.

Finally, I also don't understand why there should be a single winner, or a loser for that matter. I think you have demonstrated quite convincingly that translating M code into F# can give a major speed-up, while the code structure and used idioms are not very different. While niches for M and F# do overlap, I don't see them as direct competitors. M is well suited for explorations and estimates, I think more so than F# since strong typing gets in the way (the thinking mode is different) when you don't have a precise idea what you want, plus M provides a huge functionality out of the box to play with. What I am taking home from this discussion is that a hybrid M/F# development can be quite a thing. So, IMO discussions with a title *M and F#* would be far more productive and beneficial for everyone (including you) than *M vs F#*. Both tools have their strengths, why not concentrate on how one can use them together productively? Looks like this is a largely unexplored territory.

@Leonid: I do not understand how you can come to those conclusions given the evidence that we have accumulated here.

Without compilation, the original vectorized version is 5,000× slower than my F#. With compilation (and the optimization efforts of two experts), it was 960× slower. After one round of your own expert attention, that fell to 196× slower. Finally, with the combined efforts of three experts over the best part of a week, the fastest Mathematica code still runs 45× slower than my F# on this machine.

In constrast, a completely naive translation of the original Mathematica code to F# was immediately 38× slower than the fastest (F#) solution. So the best Mathematica performance was beaten in minutes with F# without having to do any optimization at all.

So a little expertise gets you 1,000× slowdown and a lot of expertise gets you 40× slowdown. Firstly, if Mathematica is for the non-programmer then we see immediately that it will be a toy for them because they have no hope of attaining adequate performance. Secondly, if Mathematica is for rapid prototyping then we can see immediately that it will either bog the programmer down with appalling run times or demand a herculean effort to obtain merely awful performance.

So I cannot see how anyone can take Mathematica seriously but I do still love it for "edutainment".

Finally, you say that I should concentrate on marrying F# and Mathematica. Ironically, that is how this all began: I described how you can interoperate between F# and Mathematica in my book F# for Scientists citing Mathematica's awful performance as a strong motivation for doing so. Sal Mangano took issue with my claim that F# has huge performance benefits over Mathematica and that motivated me to draw this comparison with his own optimized Mathematica code. Suffice to say, I think this case study has clearly vindicated my statements!

Jon, Leonid -- I'd be very curious to see the new Mathematica 8 compilation options in use to see how it stacks against F# now.

@Alexy: I'd expect to see huge performance gains from Mathematica 8's new JIT technology. Not least because Wolfram Research bought Mathematica-specific JIT know-how from us back in 2004... ;-)

Post a Comment