Storm vs. Spark Streaming:
Playing Well with Others
TL;DR Version: Both technologies play well with other “Big Data” technologies. Storm edges out slightly on self-hosted pure Apache installations. Spark edges out on IaaS and PaaS distributions.
We do not use either Storm or Spark by itself, but instead use it in conjunction with Kafka, Cassandra, and HDFS. For each, we use the “Vanilla Apache” distributions. In reality, we really had no choice:
- Storm was not bundled with any distribution when we adopted it (later Hortonworks picked it up).
- Spark is released so often that Cloudera, MapR, DataStax, SAP and pretty much everyone but AWS EMR is always a few versions behind.
Here is how well each played with other technologies:
With Kafka: Storm is friendlier in terms of I/O; Spark Streaming in terms of CPU
Kafka is very, very CPU efficient. However, Storm’s Spouts continuously ping each of its consumed Kafka Topics in a matter that almost feels like a little kid continuously asking, “Are we there yet?” on a long road trip. While this allows Storm to get data almost instantaneously, it comes at a price of higher CPU. We found the load to virtually equal (at about 4-5% CPU per Topic on a c4.large EC2 host) when the Topics were empty vs. slightly backlogged (something we tested by turning of a topology, filling Kafka, then turning Storm back on). Luckily we do not needs hundreds of Kafka Topics. If we did, we would need a medium-sized server farm.
Spark Streaming, on the other hand, is much more efficient in terms of how it taxes Kafka CPU use. This reason for this is Spark’s micro-batching. On average, we could get about 25x more throughput per Kafka CPU core with Spark than we could with Storm. We could do this while keeping the Spark Batch Interval at 1 second (our definition of near real-time).
Unfortunately, circa Version 1.4, Spark made a big change with its Discretized Streams (Dstreams), shifting from storing offsets in Zookeeper to Kafka instead (Storm stores offsets in Zookeeper, Spark did this by default in previous versions). The byproduct of this was increased I/O. How much this affects you depends on your offset interval and message size. In our original approach to exactly-once processing, this doubled I/O. As a result, we had to change our exactly-once model to accommodate this complication. When we did, we found a few more weaknesses in Kafka-based off-set management (especially if you want to change the out-of-the-box offset partitioning). The bugs/shortcomings we found are all documented in Spark’s JIRA project; we hope they will be fixed before Spark’s big 2.0 uber-release.
* * *
Which was better: optimizing on CPU or I/O? For us, Spark worked better. If Spark’s Dstream partitioning bugs are fixed, Spark will pair with Kafka much, much better than Storm does.
With Cassandra: Storm is the clear winner
In Storm, each Executor in a persistent process. Each only needs to establish a Cassandra session once. This saves much overhead and makes prepared statements much of efficient. The DataStax driver (now even adopted by Netflix) will even re-establish the session for you if a Cassandra node fails or times out. This is about as easy and efficient as it gets.
However, Spark Streaming is an entirely different story. As mentioned above, each RDD is a new process. Each requires establishment of new sessions for connectivity to remote resources. This means you add the overhead of session creation on every batch (once for each partition). This is about as inefficient as it gets. I am hoping Spark will address this in later releases.
* * *
Storm is much friendlier to Cassandra than Spark. This is not unique to Cassandra. The same would be true for any database: HBase, Mongo, or even a traditional RDBMS like PostGreSQL.
With HDFS: Spark edged out
We Chaos Monkey-tested how Storm and Spark Streaming handled a range of HDFS component failures: Zookeeper failures, Data Node failures and the dreaded Name Node failure. Both did very well with the exception of Name Node failures. Spark handled this gracefully; Storm did not.
The reason, however, was not due to any inherent weaknesses in Storm. Instead it was due to Cloudera’s support of each. Cloudera provides libraries for Spark to enable it to dynamically switch over from the Primary Name Node to the Standby Name Node. It does not do this for Storm.
* * *
Spark’s slight edge working with HDFS is a result of more Tier I Hadoop distributors supporting Spark than Storm (twice as many, in fact). This differentiation is even starker if you want to run Spark on AWS EMR: Spark support is out-of-the-box. Storm is not.