Moving from Storm to Spark Streaming: Real-life Results and Analysis

Last Page: Storm vs. Spark Streaming: Playing Well with Others

Storm vs. Spark Streaming:
Managing Operation of Each

TL;DR: If we only did CEP, our Ops Team would prefer Storm. Since we do both real-time and batch processing, they are happier to have one less technology to manage.

Both are easy to automate

Whether you are using Chef, Puppet, Salt Stack, or Ansible, both Storm and Spark Streaming are easy to automate. Both are built for dynamic resource allocation via YARN, both can be re-balanced without outage, and both have lots of open source automation recipes to do this.

Spark slightly wins out as there are simply more distributors clamoring to support it. However, this should not be deciding factor on which technology to use.

Fine-grained capacity allocation in Spark is easier

Both Storm and Spark have easy means to allocate relative capacity. Storm allows you to designate the memory and number of Executors per host per Topology (where a Topology is a structured set of DAG computations). Spark allows you to designate the memory, number of partition and virtual cores per Application (also a structure set of DAG computations). Both tools also provide nice UIs that allow you to browse performance data captured in logs (see next two sections below). However, Spark allows more fine-grained capacity allocation and monitoring. This is due to two aspects of its architecture.

First, Spark simply “runs cooler:”  As I mentioned before, Storm Spouts continuously ping for data, essentially keeping CPU higher even when data volumes are low. Often we found that increasing traffic from 1 message per second per node per second to 1,000 per second per node would only bump CPU from 40% to 42%. While this is great scalability, it makes if very hard to calculate capacity. Spark, on the other hand, runs “cool” when Topics are near-idle. This allowed our team to construct detailed projections of CPU needs (and TCO) vs. volume.

Second, Allocated Core:  Spark has a little secret in that the Spark cores you allocate per application actually have nothing to do with true CPU cores (or even AWS virtual compute cores). Essentially, you can declare that a cluster has a very high number of cores (say 16x or 24x the actual number of cores); then allocate jobs in a very fine-grained manner (essentially allocating fractions of a percent of computing capacity). This is very useful when running many, many jobs. When combined with YARN, this gives you excellent insight into capacity management, prioritization, etc.

However, with Spark there are two catches.

  1. You should always specify a number of cores if you are running more than one job on one cluster and not running YARN. If you don’t, the first app will default to Infinite, causing abnormal errors in the second, parallel app submission. This wasted a good 40 hours of debug time—something I hope no one else incurs again.
  2. You DO need to be mindful of allocating enough memory per application that it does not crash under maximum load but also does not waste memory. Spark 1.5 added some nice memory management features in the UI to help avoid this. We have started testing some of the newly released dynamic memory management features of Spark 1.6 to see if this makes Spark less of a memory hog.

As Spark adopts more and more Project Tungsten code, resource allocation should become more and more efficient.

UI Controls: Spark’s is more advanced—
but has a doozy of a bug

Both Storm and Spark have nice web UIs to look at status and other metrics. Spark’s is slightly more advanced, especially in its ability to look at the map and filter stages of your graph as well as a timeline graph of processing per streaming micro-batch. However, Spark’s UI suffers from a few pretty bad bugs that tripped us up:

  • The first is a minor annoyance:  Spark monitors each submitted app on ports counting up from 4040. Unfortunately, Chrome blocks port 4045. If your submitted app falls on this, you need to kill it and re-submit it if you want to monitor it. (The first time this happened to us it was a bit disconcerting.)
  • The second bug was much worse:  Spark records detailed histories on completion of each batch. For batch jobs, this is fantastic. For streaming jobs, this creates a history on every micro-batch (i.e., every second if you make your batch interval near real-time). When you shut a job down (or worse, it crashes), this entire history has to be wrapped up. If your job has been running for weeks, this can take tens of minutes or more to process. If this is not bad enough, the history server is embedded in the Spark Master server. This means the history wind-down process can take out your master node. We eventually disabled Spark Histories, as this bug has not yet been resolved (at least more people are upvoting it now).

This is clearly a byproduct of using a made-for-batch technology for real-time processing. We hope this is resolved fully as more people encounter it (comments on it on Apache’s JIRA project are much more active now than they were six months ago). The solution requires two changes: 1) separate the embedded history server and 2) implement an analog of log-rotation for histories for long-running streaming jobs.

Monitoring Logs: Storm is easier—but both share too much INFO

Log monitoring is essential to operations management. We use Sentry to monitor Errors to let us know as soon as something bad happens. We use FluentD, Graylog and Graphyne to analyze logs and to set up threshold alerts. We link all of this to Slack for full ChatOps support (even easier today than the popular Hubot + Hipchat combo of a few years ago).

Both Storm and Spark Streaming log huge amounts of data at INFO—so much that INFO is not usable for anything but non-Production test and debugging. Yes, detailed logging is essential to debugging distributed computing operations. Yes, these logs also provide fantastic data for their UI management consoles. However, if you want to log something that stands out to your DevOps teams, you will need to log it at WARN. This does make it harder to separate true warnings at WARN from useful INFO at WARN.

Storm logging is a bit easier to subscribe to as all Executors log to file with static names and Storm (as of 0.10 released in November) finally supports Apache Log4j 2. Spark is still on the original Log4j. In addition, each submitted app AND each partition log to a dynamically named location. You have to do some automation and UNIX trickery to make it easy to manage log subscriptions like Sentry DSNs and FluentD processing.

Over time, I expect both to make log management easier. It would, however, be very nice to be able to bump some of their low-level logging from INFO to DEBUG.

* * *

Both Storm and Spark Streaming are easy for technically minded Ops folks to manage in large, 24×7 distributed computing environments. Storm is a bit farther along, as it has been used for this role in really big places (e.g., Twitter, Yahoo!) for several years. Spark is still needs a bit more “seasoning”.

Next: Storm vs. Spark Streaming: The Winner