Tag Archives: systems engineering

Moving from Storm to Spark Streaming: Real-life Results and Analysis

In my last post, I explained why we decided to move the Speed Layer of our  Lambda Architecture from Apache Storm to Apache Spark Streaming in . In this post, I get the “real meat”:

  • How did it go?
  • What did we learn?
  • And most important of all… would we do it again?

This post recounts our detailed experiences moving to a 100%-Spark architecture. While Spark is arguably the most popular open source project in history (currently its only rival in terms of number of contributors is AngularJS), our experience with it was not all wine and roses. Some experiences were great. Others remain frustrating today after nearly nine months in live operation, streaming mission-critical data. Whether you love Spark or Storm, there are some bragging rights for your favorite platform.

Before I get started I should warn you that this post is pretty long. I could have broken it up into different posts,  one on each category of analysis. However, I thought it was more useful as a single blog post.

Our Real-world Environment

This is not one of those simple streaming analytic run-offs using the the canonical “Twitter Word Count” test (Spark version, Storm version). This is a real-life comparison of Storm vs. Spark Streaming after months of live operation in production, analyzing complex, real-life data from many enterprise customers.

We do not use either technology by itself, but instead use it in conjunction with Apache Kafka (Cloudera’s distribution), Apache Cassandra (DataStax’s distribution), and Apache Hadoop (also Cloudera’s distribution, storing data in Apache Parquet format). I am not divulging any trade secrets here, as we list these technologies on our job descriptions for recruiting.

Similarly, we do not simply pass data through a single stage graph (no robust real-world system uses a single-stage DAG). Instead our DAG processing traverses from 3-7 stages, depending on the type of data we receive. At each stage we persist data back to Kafka for durability and recovery.

Obviously, everything we run is clustered (no single servers). Along these lines,  we only use native installations of downloaded distributions. Everything here can be hosted anywhere you like: your own data center, GCE, AWS, Azure, etc. The results are not tied to IaaS solutions like AWS EMR.

This comparison is also not a short-duration test (which would also  be artificial). We run our streaming processing 24×7, without scheduled downtime. Our Lambda Architecture enabled us to stream the same data into Storm and Spark at the same time, allowing a true head-to-head comparison of development, deployment, performance and operations.

Finally, these results are not just based on uniform sample data (e.g., 140-character Tweets). We used a wide range of real-life sensor data, in multiple encoding formats, with messages ranging from 100 Bytes to 110 Megabytes in size (i.e., real-world, multi-tenant complexity). We tested this at data rates exceeding 48 Gbps per node. We have come up with novel ways to stream data larger than the message.max.bytes size in Kafka in real-time along our DAG–disclosing how we do this would be a trade secret 😉

So what did we learn? I will discuss the results from four perspectives:

  1. Developing with each (a.k.a., the software engineering POV)
  2. Head-to-head performance comparison
  3. Using each with other “Big Data” technologies
  4. Managing operations of each (a.k.a., the DevOps POV)

BTW, of course Spark Streaming is a micro-batch architecture while Storm is a true event processing architecture. Storm Trident vs. Spark Streaming would be a true “apple-to-apple” comparison. However, that was not our real-life experience. The move from one-transaction-at-a-time to micro-batches presented some changes in conceptual thinking (especially for “exactly once” processing). I include some learnings from this.

Next Page: Storm vs. Spark Streaming: Developing With Each

How to Architect for IoT

Last week I had the pleasure of doing a podcast with Forbe-contributor Mike Kavis on how to architect for the Internet of Things (“IoT”). We originally connected on Twitter regarding a discussion on whether the IoT and sensors are Big Data. That discussion led a podcast on architecture challenges–from device to data to data consumer–created by the onset of millions (or billions) of connected sensors and smart things.

Here in an excerpt of what we discussed

  • Connected devices bring back some classic engineering challenges back into the forefront.  How do you transmit data securely and with low power consumption? How do you handle lossy networks and cut-off transmissions?
  • Not everything is smartphone app transmitting JSON over HTTP (that would be cost prohibitive from both a hardware and bandwidth perspective). How do you handle communication myriad protocols, each of which could be using a near-infinite variety of data encoding formats?
  • IoT data is messy. Devices get cut off in mid-transition (or repeat over and over until they get an acknowledgement). How do you detect this–and clean it up–as data arrives?
  • IoT data is of incredibly high volume. By 2020, we will have 4x more sensor and IoT data than enterprise data. We already get more data today from sensors than we do from PCs. How do we scale to consume and use this. In addition, connected devices are not always smart or fault-tolerant. How do you ensure you are always ready to catch all that data (i.e., you need a zero-downtime IoT utility)
  • IoT and sensor and of itself is not terribly useful. It is rarely in a format that a (business or consumer) analyst would even be able to read. It would be incredibly wasteful to store all this as-is in a business warehouse, DropBox repo, etc.
  • IoT and sensor data needs context. Knowing device Knowing that FE80:0000:0000:0000:0202:B3FF:FE1E:8329 is at GPS location X,Y is of no use. You need to marry it to data about the “things” to get useful insights.
  • IoT data simultaneously “lives” in two points of view: what does this mean right now and what does this imply for the big picture. The Lambda Architecture is an ideal tool to handle this.
  • Finally, while all the attention is on the consumer stories, the real money is the Industrial and Enterprise Internet of Things. It’s also where smart things are far less creepy.

Listen to the podcast to hear more of the details

You can find the full podcast on Cloud Technology Partner’s website and SoundCloud:

I also want to take a moment to extend a big thank you to the folks at Cloud Technology Partners, SYS-CON Media, and Cloud Computing Journal for sharing this podcast.  I also want thanks to all of you on Twitter who retweeted it. I was happily overwhelmed by the sharing and interest!