Tag Archives: Kafka

Moving from Storm to Spark Streaming: Real-life Results and Analysis

In my last post, I explained why we decided to move the Speed Layer of our  Lambda Architecture from Apache Storm to Apache Spark Streaming in . In this post, I get the “real meat”:

  • How did it go?
  • What did we learn?
  • And most important of all… would we do it again?

This post recounts our detailed experiences moving to a 100%-Spark architecture. While Spark is arguably the most popular open source project in history (currently its only rival in terms of number of contributors is AngularJS), our experience with it was not all wine and roses. Some experiences were great. Others remain frustrating today after nearly nine months in live operation, streaming mission-critical data. Whether you love Spark or Storm, there are some bragging rights for your favorite platform.

Before I get started I should warn you that this post is pretty long. I could have broken it up into different posts,  one on each category of analysis. However, I thought it was more useful as a single blog post.

Our Real-world Environment

This is not one of those simple streaming analytic run-offs using the the canonical “Twitter Word Count” test (Spark version, Storm version). This is a real-life comparison of Storm vs. Spark Streaming after months of live operation in production, analyzing complex, real-life data from many enterprise customers.

We do not use either technology by itself, but instead use it in conjunction with Apache Kafka (Cloudera’s distribution), Apache Cassandra (DataStax’s distribution), and Apache Hadoop (also Cloudera’s distribution, storing data in Apache Parquet format). I am not divulging any trade secrets here, as we list these technologies on our job descriptions for recruiting.

Similarly, we do not simply pass data through a single stage graph (no robust real-world system uses a single-stage DAG). Instead our DAG processing traverses from 3-7 stages, depending on the type of data we receive. At each stage we persist data back to Kafka for durability and recovery.

Obviously, everything we run is clustered (no single servers). Along these lines,  we only use native installations of downloaded distributions. Everything here can be hosted anywhere you like: your own data center, GCE, AWS, Azure, etc. The results are not tied to IaaS solutions like AWS EMR.

This comparison is also not a short-duration test (which would also  be artificial). We run our streaming processing 24×7, without scheduled downtime. Our Lambda Architecture enabled us to stream the same data into Storm and Spark at the same time, allowing a true head-to-head comparison of development, deployment, performance and operations.

Finally, these results are not just based on uniform sample data (e.g., 140-character Tweets). We used a wide range of real-life sensor data, in multiple encoding formats, with messages ranging from 100 Bytes to 110 Megabytes in size (i.e., real-world, multi-tenant complexity). We tested this at data rates exceeding 48 Gbps per node. We have come up with novel ways to stream data larger than the message.max.bytes size in Kafka in real-time along our DAG–disclosing how we do this would be a trade secret 😉

So what did we learn? I will discuss the results from four perspectives:

  1. Developing with each (a.k.a., the software engineering POV)
  2. Head-to-head performance comparison
  3. Using each with other “Big Data” technologies
  4. Managing operations of each (a.k.a., the DevOps POV)

BTW, of course Spark Streaming is a micro-batch architecture while Storm is a true event processing architecture. Storm Trident vs. Spark Streaming would be a true “apple-to-apple” comparison. However, that was not our real-life experience. The move from one-transaction-at-a-time to micro-batches presented some changes in conceptual thinking (especially for “exactly once” processing). I include some learnings from this.

Next Page: Storm vs. Spark Streaming: Developing With Each

Why business owners should care about this thing called the Lambda Architecture

Updated on April 19, adding “Mapping this back to…” final section

In the past 25 years I have seen four things that really made me step back and say, “This changes everything.” The first was the browser (before that we got data from the Internet using news groups and anonymous FTP). The second was open source distribution (we could get whole architectures up in hours, not weeks or months). The third was App Stores (Amazon and Apple allowed us to distribute software with zero marginal cost). The most recent was the Lambda Architecture

Yep, it is that big.

If into a business owner or product manager who is into Big Data, data-driven decision-making, iterative A/B testing, machine learning-driven recommendation or any similar analytics application you have probably heard a passing reference about this thing called the Lambda Architecture. However, anyone digging in deeper immediately finds a menagerie of arcane terms that could only appeal to developer: Kafka, Storm, Spark, Cassandra, Elephant DB, Impala, Speed Layer, Batch Layer, Immutable Data Store, etc. This is unfortunate, because it obscures how disruptive of a change the Lambda Architecture represents. As a result, many people with decision-making authority to fund technology changes are missing out on something really big.

Life in the traditional architecture world

Traditional architectures are based on transactions. They force collection of data into formats required to complete a given transaction (i.e., I need to collect N fields of information to process sale of an item). In addition, traditional architectures enable data to be changed: I can update my profile, update my shopping cart, update my order status, etc. This makes perfect sense if your object is to complete a transaction.

But what if I want to understand more about who buys what, who is doing what, or often more importantly what leads something to happen (or not happen)? I cannot get this from the transaction data but instead have to perform “data archaeology” stitching multiple sources of data together to create what happened just before and after the transaction. If I am lucky, I have all this data. However, more often than not I need to engage in development efforts to: collect more data at the time of transaction, log more info, pull it into a data warehouse, change my reports, then dig in to see if I can figure things out. This not only takes much time and effort; it is also a ripe source of errors.

Lambda flips how we view data on its head

The Lambda Architecture starts with an entirely different premise: that it is impossible to understand today all the future uses and interpretations we will need from our data.

This is not just a platitude. It is underlying philosophy that the value of data comes from the ability to ask it to answer as many questions for you that would every want to ask. This drives entirely different approaches to how data is captured, stored, interpreted—and most importantly of all—continuously reinterpreted as you learn and discover more about your company, customers and operations:

  • First data is preserved in its original form and never changed or destroyed. This lets you look at any piece of data at any point in time and factor in changes over time. For example, you could re-segment your customers every year, quarter, or even day as you learn new patterns.
  • Second data is not forced into arbitrary formats (i.e., schemas) but is preserved raw as you may want to go back and gleam different elements. For example you could later realize a variable such as source IP address of a customer visit to your site may entirely change how you measure, interpret and react to customers from this address
  • Third data is engineered to allow it to be easily reinterpreted as you learn more. This does not just focus on making reinterpretation fast; it also makes reinterpretation fault-tolerant (i.e., easy to correct in the event of a bug—without any loss of information)
  • Finally it allows all of this in real-time with two points of view: a just-in-time view and the deep cross-sectional view (both of which are always current). This lets you make decisions quickly without sacrificing the 100% loss-less accuracy needed for important business areas (such as finance, medicine, or mission-critical operations).

Once you have these capabilities, the things you can do with data—quickly and at scale—are pretty amazing. I will share some of these in future posts, as I want to keep this post short.

However, I will close this post out with a simple analogy…

“Think Like I Chef” vs. the Fast Food Menu

Traditional architectures are like fast food menus. You have these options. If you want to change the menu, we can do some market research, see what works and rollout a new menu. If you want to change again (or explore “what if we had done this?”) we can repeat this process.

Lambda architecture is like the pantry of a great chef. You have all these ingredients. If you feel like duck à l’orange, we can make this. If you want a duck confit salad, we can re-purpose the ingredients. If you want really rich potatoes, we can render the fat and cook the potatoes in it. If you want vegan, we can pull other items out of the pantry and make something else. There are so many more options.

Mapping This Back to Things Business People Care About

So what does this mean for your business? Do you remember the last time heard these comments:

  • “You’ll see that report. It will be in our Data Warehouse–tomorrow around 10am.”
  • “Oh, that’s in our warehouse. We can build a program to convert and and load the data into production. It will only take 3 weeks. Can you submit your TPS form to the Steering Committee so we can prioritize this?”
  • “Gee, it’s too bad we did not capture that data. We can start to capture it now. In a few months we can start analyzing it.”

With Lambda, all of these comments–and many more–go away. Data is never thrown away. It is always in production, ready to be used–for analysis or real-time transactions. There is no delay between transactional use and analysis–data flows down both paths as once.

Just imagine what problems you can solve when these limitations go away.