Although R and Python are the dominant languages for developing AI and data science applications, it turns out that Java is one of the most promising alternatives, particularly for big data applications. Java has better integration into the most popular stream processing tools. In addition, the core capabilities for doing data science directly in Java have been evolving as well.
“While Java is not very popularly used in data science, it is much closer to the data science world as compared to the rest of the languages except for Python and R,” said Sachin Gupta, CEO of HackerEarth, a developer hiring service. Big data and data science go together and all major big data frameworks like Hadoop, Hive, and Spark, are written in Java. The release of Java 8 has also introduced Lambdas, which simplifies the development of Data Science projects. It’s also one of the oldest programming languages and is used in enterprise systems all over the world.
Java has been incredibly successful in the space of server-side applications. Its ability to scale to support global workloads (it is used by the likes of Netflix, Apple and eBay to run many of their services) is only one reason for its popularity in this space. “This success may, however, have worked against the widespread adoption of Java for data science applications,” said Simon Ritter, Deputy CTO of Azul Systems, which develops Java runtimes. “Developers in this field made the incorrect assumption that given Java was used for enterprise applications, it wasn’t suited to data science and they looked elsewhere.”
Ritter points to the success of many of the data science platforms that have been built on Java including SAS, MATLAB, KNIME and Rapid Miner. Other platforms like Alteryx readily integrate with Java. “Java has always been there behind the scenes, just not front-and-center,” Ritter said.
Advantages of Java
There are several benefits to using Java for data science including ease of hiring talent, extensive availability of libraries, the power of the JVM, a promising roadmap, and ease of integration. Java has one of the largest communities of skilled developers. This makes it easier than finding developers with in-depth knowledge of R for data science projects, said Ritter. It’s also has one of the richest ecosystems of development tools and cloud providers all teat Java as a first-class platform.
Java also includes libraries and frameworks for every conceivable area of application development. This includes free and commercial libraries for data science and machine learning along with tools for simplify other areas of applications development.
The JVM also includes a managed runtime environment, which can free developers from the burden of memory management. Ritter said that using just-in-time compilation to generate native code can result in significant improvement in performance over interpreted language like Python and R or statically complied languages like C or C++. Future developments like Project Panama under the OpenJDK could make it easier to developers to scale apps across GPUs and CPUs.
Most of the popular big data frameworks such as Apache Spark and Apache Hadoop are also written in Java, which makes it easier to get started and integrate with these. “If you’re going to integrate your data processing with existing enterprise systems written in Java, using the same stack will help you minimize compatibility issues and context switching, said Maria Khalusova, product manager of Kotlin for Data Science at JetBrains.
Although using Java for data science or developing AI models has many advantages, it is not a cakewalk. It can take longer for newer data scientists to get up to speed on Java, data science APIs can add more complexity for Java, and some tasks like writing notebooks can be more of a chore.
One of the attractions of Java is its readability as a programming language, which can be an advantage for projects with large teams and thousands of lines of code. But this can be a disadvantage since the time taken to become proficient in Java is longer than other languages like Python that use a simpler yet less readable syntax, Ritter argued.
Another concern is that some of the most powerful libraries for data science only support Java through a set of API wrappers. “This can add some additional complexity to the initial setup of a Java environment for developing data science applications,” Ritter said.
Java does not have native support for many types of data frame operations and as a result, developers are forces to use Apache Spark or other third-party libraries, said Vivek Ravisankar, CEO and co-founder of HackerRank, a developer hiring service. Another concern is that Java does not have native support for REPL which makes it harder to work with Jupyter notebooks. Although Scala and Kotlin do support REPL, Python tends to work better with notebooks, Ravisankar said.
Maria Khalusova, product manager of Kotlin for Data Science at JetBrains said that another difficulty of using Java for data science is the lack of data visualization libraries. Its also more challenging to discover the appropriate data science library in Java. “While there are many libraries available at your disposal, it can be difficult to find the ones you need,” she said.
Programming the future of AI
Java clearly has a role in programming the data pipelines required for AI. But the jury is still out on whether or not it will be used to code the AI models themselves.
“I don’t see Java evolving toward more data science use,” said Ravisankar. “JVM languages, like Kotlin and Scala, are more suitable for data science workloads, but it’s unlikely they will replace Python, simply due to the learning curve.” He believes that a language like Julia has a better chance at being used for high-performance ML workloads, given its dynamism and speed.
In contrast, Khalusova, expects to see JVM languages used more and more for deep learning in the near future. She notes that the JVM community appears to be the most active in this area of data science. For example, deeplearning4J is gaining popularity, TensorFlow has a Java API, PyTorch has released an experimental Java API, and there’s a new deep learning framework called DJL. “Once those frameworks stabilize, I expect libraries with classical ML algorithms to follow,” she said.