Moving from Storm to Spark Streaming, Part 1: Taking the Jump

At work, my team has built a true Lambda Data Architecture for our SaaS services (if you don’t know what Lambda is, take a look at a free chapter of Nathan Marz’ book or my prior blog post on why I love Lambda so much). The Lambda architecture is ideal for what we do: sensor analytics:

  • We use the speed layer to perform complex event processing on real-time analytic data
  • We then trigger follow-on windowing analysis based on deferred events and triggers (essentially deferred speed layer calculations across a wider time range)
  • We then follow this with multiple layers of batch layer computations that do everything from analytic wrap-up to discovery of strategic data patterns to update of machine learning models.

Spark: The Promise to Simplify Two Architectures into One

While Lambda is powerful, it is also complicated. It essentially requires you to maintain two virtually separate data architectures: something expensive to build and keep in sync over software releases. This cost is one the reasons some argue against the use of Lambda.

Like many current practitioners of Lambda, we originally used Apache Storm for real-time speed computations and MapReduce for batch layer computations. Contrary to what you might read, you can build Lambda without Storm, MapReduce, or even Hadoop. I have built several using a wide range of technologies (even using human Mechanical Turks for the batch layer). However, up until about 2014, Storm and MapReduce (along with Kafka, HDFS and Cassandra) were the open source, high-scale tools of choice.

Then Spark graduated to a top level Apache project in late 2014.

Since Spark added its real-time streaming capability it has offered the potential of providing both real-time and batch computation from a single software library and platform. Yes, Spark Streaming is micro-batch and more akin to Storm Trident than pure Storm. However, for a broad range of use cases Spark Streaming approximates real-time (soon, I will be writing a post on when I would use Storm vs. Spark Streaming vs. Flink ). However, for all intents and purposes, Spark Streaming was close enough to real-time to process sensor signals that take anywhere from 2 seconds to several minutes to be transmitted from devices over cellular connections to data centers.

So, what is the catch if Spark “offers it all?” Many. While Spark is very powerful, it presents several limitations. First its speed requires lots of memory—which means it is ultimately not as scalable as plain-old MapReduce. Second, it is a bit of a “Swiss Army Knife” of data platforms—which means it essentially is force-fitting a batch model into a real-time one. What would be the tradeoffs of moving this vs. staying with Storm: a real-time specialist proven for scale at places like Twitter and Yahoo!?

Experiment 1: Replacing MapReduce with Spark Batch Jobs

For these reasons, I took my time evaluating Spark before jumping in. First I tried it as batch only tool, replacing some of our MapReduce jobs with Spark (0.9) ones. As expected, the Spark jobs were much faster: about 60x faster. Also as expected, I had to by more expensive servers to get more memory (we use AWS EC2 – not EMR – for several business-specific reasons). Overall I got about 60x performance boost for 4x cost increase. Not too bad of a trade-off.

What I did not expect was what happened to my development costs: they went down significantly. Our developers found it much easier to build and de-bug jobs in Spark vs. MapReduce. How easier: we saw a 75% reduction in bugs and a 67% reduction in development timelines. These savings easily paid for some extra servers (of course, the story would have been different if I had thousands of servers vs. hundreds). When Spark set a new large-scale sorting record a few months later, we felt pretty good that our choice would have a good enough balance of speed vs. scale vs. cost for our needs.

Taking the Big Jump: 700+ Contributors Means A Lot

At this point, I waited. I was curious about the benefits of consolidation. However the developers I trusted at places that had enough resources to use whatever tools still said Storm was more scalable. Having supported thousands of transactions per second for the bulk of my career, I have always been always favored speed and scale (having seen teams die when they could not scale). So I watched

In May 2014, Spark 1.0 came out. Four months later 1.1 was released. Three months later 1.2. I started to look at the release notes and noticed that Spark was becoming one of the most popular open source projects in history. How popular? 3x the contributors of MongoDB. Nearly 6x Cassandra. Almost 11x HBase. (In fact Spark has had 5x the number of contributors of the entire Hadoop ecosystem combined. By the time Spark 1.2 came out, it was clearly evolving faster than Storm, with almost 4x and many total contributors.

Growth in Contributors per Month to major Open Source data platforms (sources: Github and Black Duck software)

When this many people adopt something, good things generally happened. Yes, new features are added more quickly. However, more importantly, developers break things, resulting in less risk and improved performance. When Spark 1.3 came out (the third major release in 90 days), we started a project to move to Spark.

Since then, much of the industry has jumped into the Spark pool (IBM has probably jumped with the most people into the pool). Spark is up to 1.6 as of a few weeks ago (Spark was advancing so quickly we had to migrate to new major versions and regression test twice prior to going to production). Spark MLlib has given us some data science-friendly tools and capabilities. Spark SQL with Apache Parquet  has been mind-blowing–more on that in a few weeks.

However, our journey was not all wine and roses. Some experiences were great. Others are still deeply frustrating today: after six months in live operation, streaming mission-critical data 24×7.  My next blog post will recount in detail what we learned using as real-world data as you can get. Whether you love Spark or Storm, there are some bragging rights to your favorite platform. Stay tuned…

BTW: If you like this stuff, I am hiring 😉

internetofthings-1038x576

How to Architect for IoT

Last week I had the pleasure of doing a podcast with Forbe-contributor Mike Kavis on how to architect for the Internet of Things (“IoT”). We originally connected on Twitter regarding a discussion on whether the IoT and sensors are Big Data. That discussion led a podcast on architecture challenges–from device to data to data consumer–created by the onset of millions (or billions) of connected sensors and smart things.

Here in an excerpt of what we discussed

  • Connected devices bring back some classic engineering challenges back into the forefront.  How do you transmit data securely and with low power consumption? How do you handle lossy networks and cut-off transmissions?
  • Not everything is smartphone app transmitting JSON over HTTP (that would be cost prohibitive from both a hardware and bandwidth perspective). How do you handle communication myriad protocols, each of which could be using a near-infinite variety of data encoding formats?
  • IoT data is messy. Devices get cut off in mid-transition (or repeat over and over until they get an acknowledgement). How do you detect this–and clean it up–as data arrives?
  • IoT data is of incredibly high volume. By 2020, we will have 4x more sensor and IoT data than enterprise data. We already get more data today from sensors than we do from PCs. How do we scale to consume and use this. In addition, connected devices are not always smart or fault-tolerant. How do you ensure you are always ready to catch all that data (i.e., you need a zero-downtime IoT utility)
  • IoT and sensor and of itself is not terribly useful. It is rarely in a format that a (business or consumer) analyst would even be able to read. It would be incredibly wasteful to store all this as-is in a business warehouse, DropBox repo, etc.
  • IoT and sensor data needs context. Knowing device Knowing that FE80:0000:0000:0000:0202:B3FF:FE1E:8329 is at GPS location X,Y is of no use. You need to marry it to data about the “things” to get useful insights.
  • IoT data simultaneously “lives” in two points of view: what does this mean right now and what does this imply for the big picture. The Lambda Architecture is an ideal tool to handle this.
  • Finally, while all the attention is on the consumer stories, the real money is the Industrial and Enterprise Internet of Things. It’s also where smart things are far less creepy.

Listen to the podcast to hear more of the details

You can find the full podcast on Cloud Technology Partner’s website and SoundCloud:

I also want to take a moment to extend a big thank you to the folks at Cloud Technology Partners, SYS-CON Media, and Cloud Computing Journal for sharing this podcast.  I also want thanks to all of you on Twitter who retweeted it. I was happily overwhelmed by the sharing and interest!

5 points where tech balances between life and work