L5: Past Lessons

Lagrange Point 5 (L5): Looking at past tech developments to learn, grow and avoid mistakes

Moving from Storm to Spark Streaming: Real-life Results and Analysis

In my last post, I explained why we decided to move the Speed Layer of our  Lambda Architecture from Apache Storm to Apache Spark Streaming in . In this post, I get the “real meat”:

  • How did it go?
  • What did we learn?
  • And most important of all… would we do it again?

This post recounts our detailed experiences moving to a 100%-Spark architecture. While Spark is arguably the most popular open source project in history (currently its only rival in terms of number of contributors is AngularJS), our experience with it was not all wine and roses. Some experiences were great. Others remain frustrating today after nearly nine months in live operation, streaming mission-critical data. Whether you love Spark or Storm, there are some bragging rights for your favorite platform.

Before I get started I should warn you that this post is pretty long. I could have broken it up into different posts,  one on each category of analysis. However, I thought it was more useful as a single blog post.

Our Real-world Environment

This is not one of those simple streaming analytic run-offs using the the canonical “Twitter Word Count” test (Spark version, Storm version). This is a real-life comparison of Storm vs. Spark Streaming after months of live operation in production, analyzing complex, real-life data from many enterprise customers.

We do not use either technology by itself, but instead use it in conjunction with Apache Kafka (Cloudera’s distribution), Apache Cassandra (DataStax’s distribution), and Apache Hadoop (also Cloudera’s distribution, storing data in Apache Parquet format). I am not divulging any trade secrets here, as we list these technologies on our job descriptions for recruiting.

Similarly, we do not simply pass data through a single stage graph (no robust real-world system uses a single-stage DAG). Instead our DAG processing traverses from 3-7 stages, depending on the type of data we receive. At each stage we persist data back to Kafka for durability and recovery.

Obviously, everything we run is clustered (no single servers). Along these lines,  we only use native installations of downloaded distributions. Everything here can be hosted anywhere you like: your own data center, GCE, AWS, Azure, etc. The results are not tied to IaaS solutions like AWS EMR.

This comparison is also not a short-duration test (which would also  be artificial). We run our streaming processing 24×7, without scheduled downtime. Our Lambda Architecture enabled us to stream the same data into Storm and Spark at the same time, allowing a true head-to-head comparison of development, deployment, performance and operations.

Finally, these results are not just based on uniform sample data (e.g., 140-character Tweets). We used a wide range of real-life sensor data, in multiple encoding formats, with messages ranging from 100 Bytes to 110 Megabytes in size (i.e., real-world, multi-tenant complexity). We tested this at data rates exceeding 48 Gbps per node. We have come up with novel ways to stream data larger than the message.max.bytes size in Kafka in real-time along our DAG–disclosing how we do this would be a trade secret 😉

So what did we learn? I will discuss the results from four perspectives:

  1. Developing with each (a.k.a., the software engineering POV)
  2. Head-to-head performance comparison
  3. Using each with other “Big Data” technologies
  4. Managing operations of each (a.k.a., the DevOps POV)

BTW, of course Spark Streaming is a micro-batch architecture while Storm is a true event processing architecture. Storm Trident vs. Spark Streaming would be a true “apple-to-apple” comparison. However, that was not our real-life experience. The move from one-transaction-at-a-time to micro-batches presented some changes in conceptual thinking (especially for “exactly once” processing). I include some learnings from this.

Next Page: Storm vs. Spark Streaming: Developing With Each

Moving from Storm to Spark Streaming: Taking the Jump

At work, my team has built a true Lambda Data Architecture for our SaaS services (if you don’t know what Lambda is, take a look at a free chapter of Nathan Marz’ book or my prior blog post on why I love Lambda so much). The Lambda architecture is ideal for what we do: sensor analytics:

  • We use the speed layer to perform complex event processing on real-time analytic data
  • We then trigger follow-on windowing analysis based on deferred events and triggers (essentially deferred speed layer calculations across a wider time range)
  • We then follow this with multiple layers of batch layer computations that do everything from analytic wrap-up to discovery of strategic data patterns to update of machine learning models.

Spark: The Promise to Simplify Two Architectures into One

While Lambda is powerful, it is also complicated. It essentially requires you to maintain two virtually separate data architectures: something expensive to build and keep in sync over software releases. This cost is one the reasons some argue against the use of Lambda.

Like many current practitioners of Lambda, we originally used Apache Storm for real-time speed computations and MapReduce for batch layer computations. Contrary to what you might read, you can build Lambda without Storm, MapReduce, or even Hadoop. I have built several using a wide range of technologies (even using human Mechanical Turks for the batch layer). However, up until about 2014, Storm and MapReduce (along with Kafka, HDFS and Cassandra) were the open source, high-scale tools of choice.

Then Spark graduated to a top level Apache project in late 2014.

Since Spark added its real-time streaming capability it has offered the potential of providing both real-time and batch computation from a single software library and platform. Yes, Spark Streaming is micro-batch and more akin to Storm Trident than pure Storm. However, for a broad range of use cases Spark Streaming approximates real-time (soon, I will be writing a post on when I would use Storm vs. Spark Streaming vs. Flink ). However, for all intents and purposes, Spark Streaming was close enough to real-time to process sensor signals that take anywhere from 2 seconds to several minutes to be transmitted from devices over cellular connections to data centers.

So, what is the catch if Spark “offers it all?” Many. While Spark is very powerful, it presents several limitations. First its speed requires lots of memory—which means it is ultimately not as scalable as plain-old MapReduce. Second, it is a bit of a “Swiss Army Knife” of data platforms—which means it essentially is force-fitting a batch model into a real-time one. What would be the tradeoffs of moving this vs. staying with Storm: a real-time specialist proven for scale at places like Twitter and Yahoo!?

Experiment 1: Replacing MapReduce with Spark Batch Jobs

For these reasons, I took my time evaluating Spark before jumping in. First I tried it as batch only tool, replacing some of our MapReduce jobs with Spark (0.9) ones. As expected, the Spark jobs were much faster: about 60x faster. Also as expected, I had to by more expensive servers to get more memory (we use AWS EC2 – not EMR – for several business-specific reasons). Overall I got about 60x performance boost for 4x cost increase. Not too bad of a trade-off.

What I did not expect was what happened to my development costs: they went down significantly. Our developers found it much easier to build and de-bug jobs in Spark vs. MapReduce. How easier: we saw a 75% reduction in bugs and a 67% reduction in development timelines. These savings easily paid for some extra servers (of course, the story would have been different if I had thousands of servers vs. hundreds). When Spark set a new large-scale sorting record a few months later, we felt pretty good that our choice would have a good enough balance of speed vs. scale vs. cost for our needs.

Taking the Big Jump: 700+ Contributors Means A Lot

At this point, I waited. I was curious about the benefits of consolidation. However the developers I trusted at places that had enough resources to use whatever tools still said Storm was more scalable. Having supported thousands of transactions per second for the bulk of my career, I have always been always favored speed and scale (having seen teams die when they could not scale). So I watched

In May 2014, Spark 1.0 came out. Four months later 1.1 was released. Three months later 1.2. I started to look at the release notes and noticed that Spark was becoming one of the most popular open source projects in history. How popular? 3x the contributors of MongoDB. Nearly 6x Cassandra. Almost 11x HBase. (In fact Spark has had 5x the number of contributors of the entire Hadoop ecosystem combined. By the time Spark 1.2 came out, it was clearly evolving faster than Storm, with almost 4x and many total contributors.

Growth in Contributors per Month to major Open Source data platforms (sources: Github and Black Duck software)

When this many people adopt something, good things generally happened. Yes, new features are added more quickly. However, more importantly, developers break things, resulting in less risk and improved performance. When Spark 1.3 came out (the third major release in 90 days), we started a project to move to Spark.

Since then, much of the industry has jumped into the Spark pool (IBM has probably jumped with the most people into the pool). Spark is up to 1.6 as of a few weeks ago (Spark was advancing so quickly we had to migrate to new major versions and regression test twice prior to going to production). Spark MLlib has given us some data science-friendly tools and capabilities. Spark SQL with Apache Parquet  has been mind-blowing–more on that in a few weeks.

However, our journey was not all wine and roses. Some experiences were great. Others are still deeply frustrating today: after six months in live operation, streaming mission-critical data 24×7.  My next blog post will recount in detail what we learned using as real-world data as you can get. Whether you love Spark or Storm, there are some bragging rights to your favorite platform. Stay tuned…

BTW: If you like this stuff, I am hiring 😉