Friday, 14 May 2010

Java vs F#

Dr Cliff Click of Azul Systems, specialists in manycore JVM systems, recently published a blog post about the performance of Java compared primarily to C and C++ but also discussing C# and .NET. Three of Cliff's comments are of particular interest:

Under the heading "Places where C/C++ beats Java for obvious reasons":

"Value Types, such as a 'Complex' type require a full object in Java." - Dr Cliff Click

What Cliff forgot to mention is that .NET also provides value types and a far more compelling example than complex numbers is the humble hash table.

Consider the task of filling a hash table with 10,000,000 associations from integers to single-precision floating point numbers. This may be accomplished in Java as follows:

package hashtablebenchmark; import java.util.HashMap; public class Main { public static void main(String[] args) { int n = 10000000; for (int j=0; j<10; ++j) { long startTime = System.currentTimeMillis(); HashMap hashtable = new HashMap(n); for(int i=1; i<=n; ++i) { hashtable.put(i, 1.0f / i); } System.out.println("m[100] = " + hashtable.get(100)); long time = System.currentTimeMillis() - startTime; System.out.println("Took: " + time / 1e3 + "s"); } } }

The equivalent program in F# is not only shorter but also 17× faster:

let n = 10000000 let m = System.Collections.Generic.Dictionary(n) for i=1 to n do m.[i] <- 1.0f / float32 i printf "m[100] = %f\n" m.[100]

Specifically, Java takes 6.967s initially and 5.733s steady state whereas F# takes only 0.414s.

In fact, F# completes this benchmark so quickly that we would like to give it more work but Java cannot handle any more work without running out of memory on this 4Gb machine.

Elsewhere, Cliff also writes of Java:

"Very Good Multi-Threading Support. Parallel programming is just easier in Java." - Dr Cliff Click

and later:

"Not that I track C# all that closely but... I believe the JIT produces substantially slower code than Java" - Dr Cliff Click

Allow us to demonstrate the otherwise. The Computer Language Shootout contains a well-formed benchmark called spectral-norm for which the fastest Java solution is a 173-line parallel program. This implementation may be rewritten in only 24 lines of F#:

let A i j = 1.0 / float((i + j) * (i + j + 1) / 2 + i + 1) let inline mul A (u: _ []) (v: _ []) = System.Threading.Tasks.Parallel.For(0, v.Length, fun i -> let mutable vi = 0.0 for j = 0 to v.Length - 1 do vi <- vi + A i j * u.[j] v.[i] <- vi) |> ignore let AtAu u v = let w = Array.create (Array.length u) 0.0 mul (fun i j -> A i j) u w mul (fun i j -> A j i) w v do let n = 5500 let u, v = Array.create n 1.0, Array.create n 0.0 for i = 0 to 9 do AtAu u v AtAu v u let u, v = vector u, vector v printf "%0.9f\n" (sqrt(Vector.dot u v / Vector.dot v v))

In the Java, dozens of lines of code were devoted to parallelism. In contrast, only two lines of code were altered to parallelize the F#. So it would seem that parallel programming is not "just easier in Java".

The the serial Java takes 12.722s initially and 12.299s steady state whereas F# takes only 12.18s from a cold start. On this 8-core 2xE5405 2.0GHz Xeon, the parallel Java takes 1.839s initially and 1.820s steady state whereas the parallel F# takes only 1.60s from a cold start. The fact that Java is slower in every single case indicates that the CLR's JIT is not "producing substantially slower code than Java".

Finally, Cliff made no mention of two other design deficiencies that cripple Java's performance. Firstly, their type erasure approach to generics incurs massive performance degradation for a lot of generic code because it leads to unnecessary boxing. Secondly, the JVM's lack of tail call elimination is not only a growing hindrance in the era of functional programming but the only general-purpose workaround, trampolines, is typically 10× slower that necessary.

If you want to learn how to write fast parallel programs in F#, read our book Visual F# 2010 for Technical Computing.

29 comments:

Yin said...
This comment has been removed by the author.
Jan said...

As a fan of FP I have to agree it's a pity jvm doesn't have tail recursion yet. Otherwise you're completely wrong in this post!

1) equivalent java to your F# is actually half the size:
public class Main {
public static void main(String[] args) {
int n = 10000000;
java.util.HashMap hashtable = new java.util.HashMap(n);
for(int i=1; i<=n; ++i) {
hashtable.put(i, 1.0f / i);
}
System.out.println("m[100] = " + hashtable.get(100));
}
}

note that for interop with C# you would have to put F# code into class as well - that makes it the same number of lines (plus the two closing brackets in java)

2) let's compare apples to apples!
either:
a) use a HASHMAP instead of DICTIONARY into your F# and see your GC goes nuts.
or
b) google "java hashmap primitives" - it gives you http://b010.blogspot.com/2009/05/speed-comparison-of-1-javas-built-in.html
use colt - a drop in replacement for you're java code. Just replace java.util.HashMap with cern.colt.map.OpenIntDoubleHashMap.

That's all!!!! On my machine it goes down from about 5s to 0.3s on average!!!! (ubuntu, 64 bit jvm)

No more full gc trying to get a rid of exploding number of boxed objects :-)

If I need more performance I can get it much faster with a single one-liner but that's for another blog post.


If someone claim that java is slow - it means it's A LIE! It is
a) an intention
or
b) the author has no idea what he's talking about

Flying Frog Consultancy Ltd. said...

@Jan: Actually even with your shorter Java and making the F# interoperable with C# (for no reason) and ignoring the closing braces in the Java, F# still comes out one line shorter because there is no main function.

Crippling the F# implementation by using the hash table implementation from .NET 1.0 is not an "apples to apples comparison".

You can Google for elaborate workarounds from third parties to these design flaws in the JVM one special case at a time and bugger about with CLASSPATH trying to install them but you've lost genericity and can never fully recover the efficiency lost through this flaw in the VM. For example, all of the workarounds you cited still use 4× as much memory as the F#.

Eric Hauser said...

As far as multi-threaded code being easier in Java vs. C#, I think your argument would be enhanced by comparing some fork-join code in .NET vs Java. This is an area where the lack of closures really hurts Java readability and mainability.

Jan said...

@Frog: I've read a few post on your blog. They're good. You should talk about F# but not java. That's not your area of expertise.

There's nothing wrong with JVM, it's so robust and powerful ... you have no idea ...

There will never be all the libraries in JRE itself. If you put them in people will complain about the size - tday, .NET 4.0 is 48 MB, JRE 6.0 is 16MB. See the difference?

It's solid baseline. Need more for your problem? Google is here for answer.Use 3rd party lib in my project ... I do that all day long ... I'm building on shloulders of giant ... it takes seconds to add new dependency ... ever heard of maven?

concerning memory, you're right, the implementation I send you takes a lot of memory ... it's open source we can peak inside, but why would we? just take the other one (gnu throve) and you have memory consumption:
gnu.trove.map.hash.TIntFloatHashMap - 186MB
well, int is 4 bytes, float is 4 bytes - that makes it 80MB data written to the map and every map has to notably bigger than data it contains so it maintain reasonable number of collisions. So 180 MB is fair enough for me. How much is your F#?

BTW: There is yet another drop-in replacement for you: com.carrotsearch.hppc.IntFloatOpenHashMap from http://labs.carrotsearch.com/hppc-download.html

BTW: Collections are a great deal. There are apache commons collections, google collections just to name a few. There are thousands of maps implemented. Sometimes you need one with primitives, sometimes you need a concurrent one, sparse one, bidirectional, just name it! Each one is good for something else :-)

There's nothing I know of where jvm is slow. JVM is speed equivalent to C. Sometimes slower, sometimes faster. Slower on average. True. But say 0% -30% depending on the problem you're solving. If it's twice slower, you're doing it wrong.

Flying Frog Consultancy Ltd. said...

@Jan:

"There will never be all the libraries in JRE itself". Libraries are not a substitute for VM features like value types.

"If you put them in people will complain about the size - tday, .NET 4.0 is 48 MB, JRE 6.0 is 16MB. See the difference?". The difference is <1% of the RAM in an ordinary desktop computer.

"How much is your F#?". F# offers the best of both worlds out of the box: the memory footprint of the smallest Java and the speed of the fastest Java solution.

"There's nothing I know of where jvm is slow. JVM is speed equivalent to C. Sometimes slower, sometimes faster. Slower on average. True. But say 0% -30% depending on the problem you're solving.". That is contrary to almost every benchmark ever done. Even if you give Java every advantage with a warm start and plenty of opportunity for its profiling-based optimizations to kick in, it is still 2x slower than C on 5 of the 7 shootout benchmarks. Java is 95% slower than C++ on this ray tracer benchmark. Perhaps the most obvious example is the FFI, where Java is orders of magnitude slower than C. And so on...

"There's nothing wrong with JVM". There is plenty of room for improvement.

Jan said...

@Frog

The size of installation has nothing to do with RAM, it's about downloading - what the memory usage will be depend on program you run.

There's no reason for anything on JVM being any slower than .NET and just a little slower C. I'd really love to see a prove. I've never seen one.

There's no code for ray-tracer on the two links you posted.

I took "chameneos-redux", because it is claimed to be 7x slower in java. I've downloaded the code, run it - 1.7s for c, 0.19s for java! The java code gives correct result whereas C doesn't ... that's about the quality of benchmarks.

Just for fun I gave a try another one - spectral norm. 1.67s for java (the last iteration with 5500 argument); 1.61s for gcc. (Phenom II 4x3.1GHz - java 1.6.0_20 vs gcc 4.4 and 4.5 on 64bit linux) The implementation might be correct as they both printed the same result :-) I have no idea where "2x slower in java" on the web came from.

BTW: I haven't read the code in either language - maybe both of them can be optimized.

Room for JVM improvement? Sure. There always will be improvements. I just doubt you have any idea what it's capable of already: let's see ... make at most one line change/addition to your naive java code and make it run well bellow 0.5s per cycle. It may not be a nice solution, but it has to be rock solid - yep, that's right even your bad code can run on jvm your "F# speed" - a trivial task for java developers I know ;-)

Flying Frog Consultancy Ltd. said...

@Jan:

"There's no code for ray-tracer on the two links you posted". The code is here.

"It may not be a nice solution". Indeed, it isn't even a solution: you are still using over 4× as much memory as F# and you have lost the ability to write generic code. For example, F# provides a generic groupBy function that uses a hash table to accumulate elements from a sequence that share the same key. Even with your Java solution, you would have to rewrite the groupBy function for each and every separate type-specialized hash table implementation you downloaded from the web. If you want to store complex numbers or 3D vectors in your hash table then you need to use a completely different hash table implementation and reimplement all of your standard library functions over them. Moreover, your non-generic solution is not composable. What if you want to implement a trie out of hash tables? You will not find a pre-existing hash table implementation specialized over your trie data structure because you haven't written it yet. Maybe you think you can get by trying to work like that but there is no merit in pretending this is anything but another crippling deficiency in the JVM.

In effect, generics and value types in the CLR automate this process so you reap the same rewards using the built-in Dictionary.

Following Eric Hauser's advice here: perhaps we should benchmark parallel quicksort. I have an implementation in F# that is a few lines long but I cannot find a Java implementation.

Jan said...

@Frog:

You seem to misunderstand the distinction between Java and JVM. You mix the two every other sentence. Java is a language and JVM is a virtual machine - similar C# vs. CLR.

JVM is not crippled, so called "programmers" are. JVM can use relevant Collections even when its not bundled within JRE - just add a dependency.

JVM has tons of great features including freaking fast JIT. It's C/C++ speed equivalent within say 30%. Anytime you say JVM is twice slower or takes twice the memory you LIE or you do a tiny benchmark (below 1s, below 10M heap) or you're using specialized hw (not CPU). The feature for missing in JMV so far is missing tail call optimization within JIT. On the other hand it has what other can only dream about :-)

More specialization on top of any collection? Adding features to a class? Isn't there inheritance to do that?

There are also other languges like Scala (or Clojure) to get functional programming (and ton of other features) while keeping performance and robustness of JVM. Concise quick-sort? Composability? F# level in those languages. C# level in Java itself. None of those are any match to Haskell in composability, true.

Specialized hash map for complex data types? Just use the standard java.util.hashmap. Object boxing cost get amortized with growing object (and algorithm) complexity.

ray-tracing (on Phenom II 4x3.1GHz, ubuntu 10.04, jre 1.6.0_20, gcc 4.5):
g++-4.5 -O3 -ffast-math ray.cpp -o ray && time ./ray 9 512 >/dev/null
real 0m2.137s
javac ray.java && time java -Xms64m ray 9 512 >/dev/null
real 0m2.837s

Java 30% slower than latest gcc in this case.

We're talking about performance - so I tested the "Low-level optimizations" version. No idea where you took your "95%". I made no modifications with a single exception of adding "#include " to the cpp version. Where you get your numbers? Is it something to be found in "for scientists" book? :-)

"a generic groupBy function that uses a hash table" - where is a goupBy? There's no groupBy in map http://msdn.microsoft.com/en-us/library/ee353880%28v=VS.100%29.aspx
MSDN gives groupBy from linq or sequences. Do you mean Seq.groupBy. It takes and returns sequences! Not maps! Does it have a constant time lookup? Isn't that sole reason for hashmap to exists? If not than why would I put it into hashmap? It does not belong there at the first place!

Flying Frog Consultancy Ltd. said...

@Jan:

"More specialization on top of any collection? Adding features to a class? Isn't there inheritance to do that?". You sacrificed that ability when you introduced incompatible classes to work around the JVM's performance deficiencies.

"Specialized hash map for complex data types? Just use the standard java.util.hashmap. Object boxing cost get amortized with growing object (and algorithm) complexity". The original article already disproved this: the JVM was 17× slower precisely because boxing was not amortized.

"No idea where you took your '95%'". The exact hardware and software specs are given on the site. This 2x 2.1GHz 2352 quadcore Opterions with JDK 1.6 and g++ 4.4 gets 5.55s vs 3.59s, respectively. So Java is still over 50% slower.

"Do you mean Seq.groupBy. It takes and returns sequences! Not maps! Does it have a constant time lookup? Isn't that sole reason for hashmap to exists? If not than why would I put it into hashmap? It does not belong there at the first place!". The groupBy function uses a hash table internally and it is generic over the type of element. An equivalent generic function on the JVM would be vastly slower. Your "solution" of hand-rolling data structures for every combination of generic types used requires massive code duplication and only solves either performance or memory consumption but not both at the same time.

FWIW, I asked for parallel quicksort implementations in Java here and benchmarked them again a simple F# implementation. Even the type-specialized Java was almost 2× slower than the fully generic F#: 11.1s vs 5.5s.

Jan said...

@Frog:

When I need to write groupBy in java then yes, I have to make several specialized implementations to threat primitives without boxing and also generic objects. Java the language is not as feature rich as others.

Rectification of generics and references on the VM level would be nice. It will probably happen in future. However there's no problem for a language on top of JVM to automatically generate several implementations from the same source. They can do it today, they could do it a decade ago.

You still don't seem to understand the difference between a language and virtual machine :(

The more complex your object is the less you lose by boxing it. If you box object with two floats (a complex number) it takes the same memory as with one Float itself!!! You can also decrease it on 64-bit JVM by compressing pointers (unless you need more than 32GB heap) - the object "overhead" goes from 16 to 8 bytes per object (always 8B on 32bit JVM). Of course the right thing to do is to use specialized collection for primitives for large collections.

Concerning the speed - I make one line addition to your naive code, run it and it will run your "F# speed". The solution is to say "hey garbage collector, this code was written by lame, so please don't use settings for normal people here". That's it.

The biggest problem of Java by far is - it's so simple language that everyone believes he's the master of programming universe after four pages of "java for dummies". Sadly, it's not so.

50% on ? What have you used? JRE 1.6 first version is from 2006? Use newer one (current is update 20) and use 64bit if you're running 64bit machine.

BTW: I' confused - what have you "measured" 50 or 95? Was it ancient java vs. modern c? That way you can cook up your "benchmarks" even 100x worse for java. Just take java 1.0 or 1.1 - an infant implementation before hotspot (the JIT) came. That way you can "prove" java is 1000x slower than pretty much anything :-)

What's the code you used in the end for your quicksort "benchmark"? What have you compared? Code + arguments for both, please. Let me see it, let me run it... knowing hotspot - it optimize operations with arrays of primitives better gcc 4.4, threading is perfect - there's simply no way hotspot being outperformed on x86 (32 and 64 bit) by anything but GCC 4.5, probably ICC and maybe VS2010. The difference for those will fit well in the 30% range I said.

Eric Mariacher said...

From my point of view, I think it's wrong to compare F# with Java.

You should compare C# with Java and
F# with Scala.

C# and Java are both imperative programming languages while F# and Scala are both Functional programming languages.

C# and F# are running on .NET while
Java and Scala are running on the JVM.

C# and F# are "compatible" between themselves.

Java and Scala are "compatible" between themselves.

I guess in your example Scala is as concise as F#.

Flying Frog Consultancy Ltd. said...

@Eric Mariacher:

There is certainly some merit in comparing functional languages with functional languages but F# is a first-class fully-supported product from Microsoft now so its only equivalent on the JVM is, in that sense, Java.

The shootout actually already contains an 80-line Scala solution that is slower than the Java (over 3× slower than Java in one case!). So it looks like even Scala is over 3× more verbose and still slower than F#.

Flying Frog Consultancy Ltd. said...

@Jan:

"Rectification of generics and references on the VM level would be nice. It will probably happen in future". From what I hear, value types are a high priority but generics will never be fixed because it would require breaking changes to reflection.

"However there's no problem for a language on top of JVM to automatically generate several implementations from the same source. They can do it today, they could do it a decade ago". They can but they do not in general because you have to sacrifice a lot (e.g. incremental compilation and DLLs because it is a whole-program analysis, and interoperability because you are auto-generating incompatible classes). Hence knucleotide is 7× slower in Scala than C++.

"What have you used? JRE 1.6 first version is from 2006? Use newer one (current is update 20) and use 64bit if you're running 64bit machine.". This is 20.

"I' confused - what have you "measured" 50 or 95?". 50% on one machine and 95% on another.

"What's the code you used in the end for your quicksort "benchmark"?". I'll post my F# code in a new blog post.

Anonymous said...

If I can run F# in Linux with all supporting libs and tools, I will use it while FP programming shines.

Pushing one idea without considering its side effect or limits, is not what a good consultant should do to the customer.

Anonymous said...

I don't use Java and will certainly use F#. I think if you want to use Java feel free. I intend to use F# for functional programming in Microsoft Business environment. These tools are just that not religions.
I like Erlang (working on a use but it seems that Amazon has figured out one (DB), VB.NET R for stats and whatever else fits the bill for the Docs(PHD Types) is Clojure javas functional equivelent?

Bionic

Anonymous said...

As much as I think that @Jan is a senseless Java fanboy, for LOC count you should really compare Java to C#. The speed difference would probably be even wider as the C# compiler is optimised better than the F# one.

Anonymous said...

Parallel programming? I think you meant concurrent programming, using multithreading does not make program
parallel my Dr. Cliff , because every parallel program is concurrent but every concurrent is not parallel, the thing is OS is scheduling processes on cores not threads so in your test it does not matter if you use 8 cores processes one thread at a time will be executing at a core while the other threads will be waiting although you have 7 core free to use. 2 threads of one process CAN'T
executing on two cores at the same time. only a few OS do that and i think you didn't use one that schedule threads. so if the result are not reliable.

Anonymous said...

Perhaps I'm stupid but doesn't lazy evaluation mean your test is worthless ?

Are you sure F# actually does the work.

Jon Harrop said...

@Anonymous "Perhaps I'm stupid but doesn't lazy evaluation mean your test is worthless?". There is no lazy evaluation anywhere in either solution.

Unknown said...

Huh? Only a very inexperienced Java developer would use the default HashMap when values are primitives -- of course additional boxing will add significant memory and performance overhead.

Test should use one of multiple available Primitive Collections libraries (Colt, Guava, just to name two), and use one with double values.

MovGP0 said...

The book looks very interesting. Any plans to release it for Amazon Kindle? The paper book is not portable and expensive, specially when i need to pay for international shipping...

MovGP0 said...

what you probably wanted to write is "Every parallel program is concurrent, but not every concurrent program is parallel."

By putting the 'not' in the wrong place of the sentence, you make a logical mistake.

Brad Phelan said...

I guess the point is that only a very inexperienced Java programmer would make the mistake of using the default hash map. An inexperienced F# programmer would get the right thing automatically.

Jon Harrop said...

"Only a very inexperienced Java developer would use the default HashMap when values are primitives". You're restricting consideration to the JVM's primitive types. .NET has no such limitation.

"Test should use one of multiple available Primitive Collections libraries (Colt, Guava, just to name two), and use one with double values". Then it isn't a generic solution whereas the F# is.

"The book looks very interesting. Any plans to release it for Amazon Kindle? The paper book is not portable and expensive, specially when i need to pay for international shipping". We ship to anywhere in the world for free.

"I guess the point is that only a very inexperienced Java programmer would make the mistake of using the default hash map. An inexperienced F# programmer would get the right thing automatically". More importantly, .NET can express and efficient generic solution where as the JVM cannot because it lacks value types.

Jem said...

> F# is a first-class fully-supported product from Microsoft now so its only equivalent on the JVM is, in that sense, Java

That's is a non-argument. JVM bytecode is fully supported by Oracle and the Scala compiler is fully supported by Typesafe.

> So it looks like even Scala is over 3× more verbose

That sounds completely wrong. I haven't looked yet, but there's no way, *no way at all* anything but the worst-written Scala example could be more verbose than Java.

Jon Harrop said...

"the Scala compiler is fully supported by Typesafe". It is today but it wasn't when I wrote that comment 3 years ago.

"*no way at all* anything but the worst-written Scala example could be more verbose than Java". I wrote "than F#" not "than Java". The Scala was more verbose than the F#.

mrchief said...

Slow or not, I would stay away from Java just to reduce my exposure to potential viruses. Uninstall, Java/Flash and your machine becomes substantially safe. And in PC world, you can survive without JRE or Flash on your desktop.

http://securitywatch.pcmag.com/none/302019-security-warning-disable-java-now

Ken McWilliams said...

@mrchief That is ridiculous. If we're going to be asinine you can survive perfectly well without any Microsoft or proprietary code too!

The security issues people face come from the complexity of using a systems language in the browser.

If anything Java has had more security focus than any other language to date. Take your favorite systems language and let it execute code int the browser... see how many security issues pop up!