Moving from Storm to Spark Streaming: Real-life Results and Analysis

Last Page: Storm vs. Spark Streaming: Developing With Each

Storm vs. Spark Streaming:
The Performance Comparison

TL,DR Version: Both scale to thousands of transactions per node per second. However Storm is more scalable. Both are durable. However, Storm lets you sleep more easily at night.

Speed Comparison: Storm processes any given transactions faster than Spark

Data traverses Storm faster—especially when traversing a multi-stage DAG. This is due to two reasons (the first is obvious, the second a surprise to many I have spoke with).

The Obvious: Storm is event driven. Its Spouts continuously check their inbound queues for messages. Assuming your queue has no backlog, your messages begin processing virtually instantaneously. On the other hand, Spark Streaming is a micro-batch. It only picks up messages on a timed interval (we tested batch intervals of 0.5 seconds and 1 seconds). Essentially you are adding:

50% x yourBatchIntervalTime x yourNumberOfStages

to the average processing time. This adds up when you are traversing 3-7 stages along a DAG.

The Not-So-Obvious: Storm topologies run indefinitely. That means they setup things like sessions to Cassandra, HDFS and Kafka once and keep these for life (most drivers are good enough to reestablish these if they die at some time). That means this overhead is only incurred at startup, making processing faster. However, each Spark RDD is a new process. Each must re-establish a new session every time a new batch process starts (e.g., every second, for every partition on the RDD). That is a good amount of overhead.

Storm could process our five-stage DAG, re-queuing each stage in Kafka, in 66 msec. Spark Streaming averaged 2,561 msec to this: 38x longer.

Scalability: Spark Has Higher Throughput

Storm may be faster. But Spark could process more data in the same window in time. Here is one example of where micro-batches are more efficient, especially when receiving thousands of streams of data in parallel. After much tuning of partition sizes, batch sizes, etc., we found that Spark could process 45x more transactional throughput on a per CPU basis.

* * *

What wins out with 45x more throughput vs. 38x more speed? As our data takes seconds to reach us from remote sensors and customers, Spark wins. However, if I were automating stock trades (with a black fiber line tied with into the NYSE), I would pick Storm.

Reliability: Both Have Build-in Distributed Self-healing

Both Storm and Spark, when used with good driver libraries, have lots of built-in fault-tolerance. Both can handle failover of their internal components (for Storm: Zookeeper, Nimbus and Executors; for Spark: Masters and Workers) and keep on chugging. Both can handle interruption of their connections to things (like Kafka, HDFS and Cassandra) responding by automatically re-connecting and re-trying what failed.

Over months of operation, we found both platforms highly reliability. Each had places where its reliability was more elegant. For example, Spark handled HDFS master node failures seamlessly, while Storm handled network timeouts more gracefully.

…However Storm is harder to kill

While both Storm and Spark Streaming have extremely high reliability, Storm is much harder to kill. In two years of 24×7 operation, we had only one Storm failure: it was caused by a major AWS DNS outage in their Oregon Data Center. Storm handled everything else—hardware failures, Zookeeper losses, HDFS data node outages—without any loss of operation. In addition, we never had an issue with memory leaks or abnormal process termination—even across multiple data centers. That is pretty amazing!

Spark is not nearly as robust. It will re-try transactions, connections, etc. However, after a certain number of attempts it will wind down its process and abort. With data staged in Kafka you do not lose anything. However, you still need to examine your resources before re-starting—a production annoyance. Spark still has a few memory leaks that cause processes to ghost, requiring automated termination (YARN is helpful here). In addition, you have to be careful to ensure you do not expand beyond your allocated memory or your submitted jobs will die. Yes, Spark is still a memory hog. However, it is getting better as Spark incorporates more and more Project Tungsten features.

That being said, alerting and winding down is not always bad. If you have major infrastructure outage, winding down your DAG can remove load a critical time to let systems recover automatically. Storm will just silently pound on a weakened cluster until all the nodes die painfully. Never, point Storm at something that is also not fault-tolerant and scaled with N+2 capacity (N+1 is not enough: if one node fails, you will get a cascading overload).

* * *

Currently, Storm lets your Ops teams sleep with fewer alerts. At the rate of current Spark adoption, I expect Spark Streaming reliability to get much better. Whether it reaches parity with Storm remains to be seen, as most of Spark’s focus is on the analytic side vs. CEP.

Durability: Both provide acceptable guaranteed message delivery

Both Storm and Spark Streaming provide  guaranteed message delivery (GMD). In reality this translates to “keep trying until I receive positive acknowledgement of success.”

Storm does this in the manner that your product manager (or customer) would expect when reading GMD on the tin. It manages GMD one transaction at a time. If something fails, it re-tries until success.

Spark, on the other hand, manages GMD one batch at a time (actually, one RDD partition at a time). As a result, if any part of the batch fails, it will replay the entire batch, causing some transactions to repeat. This is sufficient for completely idempotent processes. However, it is bad when your process does something like creating an alert sent to a human.

Our team spent a lot of time modifying Spark’s Dstream processing to mirror Storm’s transaction-level guarantee. However, we ultimately found this created too much I/O overhead (primary due to changes in offset management introduced in Spark 1.4–more on this on Page 4). As such, we implemented a guarding model in our applications to block repeat transactions. As such, when we combine at-least-once with never-more-than-once, we achieve true exactly-once distributed processing (as close as one can get to 100%):

{At Least Once} ∩ {No More Than Once} = {Exactly Once}

This add-on also guards against a weakness in any GMD processing that can where any application thinks the first attempt failed due to timeout and re-attempts it, breaking the exactly-once paradigm. (How we do it, in microseconds, at scale, is a trade secret.) We saw this occasionally with Storm (especially when AWS was having particularly bad, persistent “noisy neighbor” problems).

Next Page: Storm vs. Spark Streaming: Playing Well with Others