At work, my team has built a true Lambda Data Architecture for our SaaS services (if you don’t know what Lambda is, take a look at a free chapter of Nathan Marz’ book or my prior blog post on why I love Lambda so much). The Lambda architecture is ideal for what we do: sensor analytics:
- We use the speed layer to perform complex event processing on real-time analytic data
- We then trigger follow-on windowing analysis based on deferred events and triggers (essentially deferred speed layer calculations across a wider time range)
- We then follow this with multiple layers of batch layer computations that do everything from analytic wrap-up to discovery of strategic data patterns to update of machine learning models.
Spark: The Promise to Simplify Two Architectures into One
While Lambda is powerful, it is also complicated. It essentially requires you to maintain two virtually separate data architectures: something expensive to build and keep in sync over software releases. This cost is one the reasons some argue against the use of Lambda.
Like many current practitioners of Lambda, we originally used Apache Storm for real-time speed computations and MapReduce for batch layer computations. Contrary to what you might read, you can build Lambda without Storm, MapReduce, or even Hadoop. I have built several using a wide range of technologies (even using human Mechanical Turks for the batch layer). However, up until about 2014, Storm and MapReduce (along with Kafka, HDFS and Cassandra) were the open source, high-scale tools of choice.
Then Spark graduated to a top level Apache project in late 2014.
Since Spark added its real-time streaming capability it has offered the potential of providing both real-time and batch computation from a single software library and platform. Yes, Spark Streaming is micro-batch and more akin to Storm Trident than pure Storm. However, for a broad range of use cases Spark Streaming approximates real-time (soon, I will be writing a post on when I would use Storm vs. Spark Streaming vs. Flink ). However, for all intents and purposes, Spark Streaming was close enough to real-time to process sensor signals that take anywhere from 2 seconds to several minutes to be transmitted from devices over cellular connections to data centers.
So, what is the catch if Spark “offers it all?” Many. While Spark is very powerful, it presents several limitations. First its speed requires lots of memory—which means it is ultimately not as scalable as plain-old MapReduce. Second, it is a bit of a “Swiss Army Knife” of data platforms—which means it essentially is force-fitting a batch model into a real-time one. What would be the tradeoffs of moving this vs. staying with Storm: a real-time specialist proven for scale at places like Twitter and Yahoo!?
Experiment 1: Replacing MapReduce with Spark Batch Jobs
For these reasons, I took my time evaluating Spark before jumping in. First I tried it as batch only tool, replacing some of our MapReduce jobs with Spark (0.9) ones. As expected, the Spark jobs were much faster: about 60x faster. Also as expected, I had to by more expensive servers to get more memory (we use AWS EC2 – not EMR – for several business-specific reasons). Overall I got about 60x performance boost for 4x cost increase. Not too bad of a trade-off.
What I did not expect was what happened to my development costs: they went down significantly. Our developers found it much easier to build and de-bug jobs in Spark vs. MapReduce. How easier: we saw a 75% reduction in bugs and a 67% reduction in development timelines. These savings easily paid for some extra servers (of course, the story would have been different if I had thousands of servers vs. hundreds). When Spark set a new large-scale sorting record a few months later, we felt pretty good that our choice would have a good enough balance of speed vs. scale vs. cost for our needs.
Taking the Big Jump: 700+ Contributors Means A Lot
At this point, I waited. I was curious about the benefits of consolidation. However the developers I trusted at places that had enough resources to use whatever tools still said Storm was more scalable. Having supported thousands of transactions per second for the bulk of my career, I have always been always favored speed and scale (having seen teams die when they could not scale). So I watched
In May 2014, Spark 1.0 came out. Four months later 1.1 was released. Three months later 1.2. I started to look at the release notes and noticed that Spark was becoming one of the most popular open source projects in history. How popular? 3x the contributors of MongoDB. Nearly 6x Cassandra. Almost 11x HBase. (In fact Spark has had 5x the number of contributors of the entire Hadoop ecosystem combined. By the time Spark 1.2 came out, it was clearly evolving faster than Storm, with almost 4x and many total contributors.
When this many people adopt something, good things generally happened. Yes, new features are added more quickly. However, more importantly, developers break things, resulting in less risk and improved performance. When Spark 1.3 came out (the third major release in 90 days), we started a project to move to Spark.
Since then, much of the industry has jumped into the Spark pool (IBM has probably jumped with the most people into the pool). Spark is up to 1.6 as of a few weeks ago (Spark was advancing so quickly we had to migrate to new major versions and regression test twice prior to going to production). Spark MLlib has given us some data science-friendly tools and capabilities. Spark SQL with Apache Parquet has been mind-blowing–more on that in a few weeks.
However, our journey was not all wine and roses. Some experiences were great. Others are still deeply frustrating today: after six months in live operation, streaming mission-critical data 24×7. My next blog post will recount in detail what we learned using as real-world data as you can get. Whether you love Spark or Storm, there are some bragging rights to your favorite platform. Stay tuned…
BTW: If you like this stuff, I am hiring 😉