Moving from Storm to Spark Streaming: Real-life Results and Analysis

Last Page: Environment for a Real-Life Head-to-Head Comparison

Storm vs. Spark Streaming:
Developing With Each

TL;DR: Our developers like building with Spark better, even though we had to build more plumbing for robust CEP.

Developing With Spark

The Spark library has more tools, especially if you are in the analytics business, like we are. You can inject MLlib for outlier detection. You can use SparkSQL to interrogate date in your RDD like it was in a database. Spark Data Frames are very useful for window-based analysis (and winding itself is a benefit of micro-batch processing).

Our team found Spark’s library of transformations easier to use than Storm’s Topology patterns. This advantage compounded, with every new Spark release (usually every twelve weeks), as Spark’s library  grew in terms of available features and refactored performance. Over the course of developing our migration Spark released three major versions (1.2, 1.3 and 1.4). Since we went to full production, Spark has released two new major versions, each of which have benefits for development (like IN clauses for SparkSQL) and operations (like improved memory management).

Spark runs on Java 8. Java 8 let us move to mixins for far better coding efficiency. Many of our data engineers actually made the jump from Java to Scala to take advantage of using Spark’s native language. None had a desire to move from Java to Clojure, Storm’s native language.

Finally, Spark was easier to set up in individual DEV environments. This not only let our developers get started more quickly. It has allowed us to on-board our data scientists more easily. Spark, unlike Storm, was appealing to our data scientists to use (they use PySpark via Jupyter notebooks). As I mentioned in my last post, this consolidation of our Speed Layer and Batch Layer processing on a single framework and technology has presented many code sharing benefits.

Developing with Storm

Storm is not on Java 8 yet. (We were using version 0.9.4 which was still on Java 6. Even use of Java 7 is pretty dicey on some versions.) As a result, Storm requires inclusion of older bundled resources that can cause conflicts with drivers for Cassandra, HDFS, etc.

Storm’s library is very clean and easy to use if you come from a classic MapReduce functional programming pattern background. However, it is not as feature rich as Spark. Also, for many years Storm’s online documentation has been fair at-best. This appears to be changing in the last few months, however.

Setting up a Storm for individual DEV environments takes a good amount of work. Much of this was due to Java version conflicts. I expect this to clear up as Storm moves to Java 8.

Nevertheless, Storm is much farther along on building the basic plumbing for production-quality CEP at scale.  For example, exactly-once processing comes out-of-the-box. Our team had to enhance Spark’s RDD processing to achieve true exactly-once processing without loss of data or partial batch re-play.

* * *

In the end, our software and data engineers prefered Spark to Storm. This is not surprising given that Spark’s open source contributors now outnumber Storm’s by 5:1.

Next Page: Storm vs. Spark Streaming: The Performance Comparison