Tag Archives: Hadoop

Moving from Storm to Spark Streaming: Real-life Results and Analysis

In my last post, I explained why we decided to move the Speed Layer of our  Lambda Architecture from Apache Storm to Apache Spark Streaming in . In this post, I get the “real meat”:

  • How did it go?
  • What did we learn?
  • And most important of all… would we do it again?

This post recounts our detailed experiences moving to a 100%-Spark architecture. While Spark is arguably the most popular open source project in history (currently its only rival in terms of number of contributors is AngularJS), our experience with it was not all wine and roses. Some experiences were great. Others remain frustrating today after nearly nine months in live operation, streaming mission-critical data. Whether you love Spark or Storm, there are some bragging rights for your favorite platform.

Before I get started I should warn you that this post is pretty long. I could have broken it up into different posts,  one on each category of analysis. However, I thought it was more useful as a single blog post.

Our Real-world Environment

This is not one of those simple streaming analytic run-offs using the the canonical “Twitter Word Count” test (Spark version, Storm version). This is a real-life comparison of Storm vs. Spark Streaming after months of live operation in production, analyzing complex, real-life data from many enterprise customers.

We do not use either technology by itself, but instead use it in conjunction with Apache Kafka (Cloudera’s distribution), Apache Cassandra (DataStax’s distribution), and Apache Hadoop (also Cloudera’s distribution, storing data in Apache Parquet format). I am not divulging any trade secrets here, as we list these technologies on our job descriptions for recruiting.

Similarly, we do not simply pass data through a single stage graph (no robust real-world system uses a single-stage DAG). Instead our DAG processing traverses from 3-7 stages, depending on the type of data we receive. At each stage we persist data back to Kafka for durability and recovery.

Obviously, everything we run is clustered (no single servers). Along these lines,  we only use native installations of downloaded distributions. Everything here can be hosted anywhere you like: your own data center, GCE, AWS, Azure, etc. The results are not tied to IaaS solutions like AWS EMR.

This comparison is also not a short-duration test (which would also  be artificial). We run our streaming processing 24×7, without scheduled downtime. Our Lambda Architecture enabled us to stream the same data into Storm and Spark at the same time, allowing a true head-to-head comparison of development, deployment, performance and operations.

Finally, these results are not just based on uniform sample data (e.g., 140-character Tweets). We used a wide range of real-life sensor data, in multiple encoding formats, with messages ranging from 100 Bytes to 110 Megabytes in size (i.e., real-world, multi-tenant complexity). We tested this at data rates exceeding 48 Gbps per node. We have come up with novel ways to stream data larger than the message.max.bytes size in Kafka in real-time along our DAG–disclosing how we do this would be a trade secret 😉

So what did we learn? I will discuss the results from four perspectives:

  1. Developing with each (a.k.a., the software engineering POV)
  2. Head-to-head performance comparison
  3. Using each with other “Big Data” technologies
  4. Managing operations of each (a.k.a., the DevOps POV)

BTW, of course Spark Streaming is a micro-batch architecture while Storm is a true event processing architecture. Storm Trident vs. Spark Streaming would be a true “apple-to-apple” comparison. However, that was not our real-life experience. The move from one-transaction-at-a-time to micro-batches presented some changes in conceptual thinking (especially for “exactly once” processing). I include some learnings from this.

Next Page: Storm vs. Spark Streaming: Developing With Each

Ten Tech Trends for Your 2012 New Year’s Resolutions List

Article first published as Ten Tech Trends for Your 2012 New Year’s Resolutions List on Technorati

BabyNewYearOne of the most exciting things about working in tech is using it to create new ways to work, play—and even live. We have seen many great technology innovations develop over the past few years. Over 2012, ten of them will complete the jump from “new concept” to “mainstream trend.” How many of them are your ready for?

1. Everything Will Be Portable. The move to portable computing (smartphones, tablets and ultrabooks) will accelerate. Thick laptops and—even worse—desktops will be a relic of the past (except for those with high-power computing needs). If you are not yet mobile- and portable-ready, you better get there very soon.

2. Augmented Reality Will Go Mainstream. Augmented Reality (AR) is no longer a science fiction concept. Smartphones and (especially) tablets are mass-market platforms for everyday augmented reality. We are already seeing the first applications at Tech Meetups, CES and more. At least three innovators will exploit this, gaining mainstream adoption, by the end of 2012.

3. Touch Will Be Ubiquitous. Over the past five years, capacitive touch interfaces have re-programmed how millions of us interact with technology. As more devices are now sold today with touch than without, it is time to begin optimizing your user interface and user experience for touch (instead of a two-button mouse and keyboard).

4. Voice Will Be Next. While the intuitiveness of touch is a leapfrog improvement over mouse-and-keyboard, it still ties up our hands. Voice-based interaction is where we need to go. Apple’s Siri began the move of voice-driven interaction into the mainstream. This year, we’ll see SDKs for iOS and Android that harness the creativity of thousands to explode use of voice.

5. Fat Will Be the New Thin. Over a decade ago, broadband Internet enabled browsers to replace thick client applications. Now, portable computing usage across low power, lossy networks (e.g., mobile, WiFi, Bluetooth) coupled with AppStore Model has brought locally installed apps back in vogue. Building web apps is not enough; you need AppStore apps too.

6. Location-based Privacy Will Be Solved. Over the last two years location-based services became really hot. Unfortunately location-related privacy issues became hot too. The move of these services into mainstream populations of tens of millions will expand anecdotal security scares into weekly news stories, forcing adoption of safer location-based privacy policies.

7. Cloud Will Be the New Norm. Cloud computing is no longer an “edge market.” It is now adopted by big enterprises, public sector agencies—and even consumer tech providers. The cost, convenience and flexibility advantages of cloud computing will make it too hard for everyone not to use—everyday—by the end of this year.

8. …So Will Twitter. While people still love to debate the reasons to use Twitter, everything from the Arab Spring to the Charlie Sheen Meltdown showed that Twitter is now a well-recognized media channel. #Election2012 will accelerate mainstream use of Twitter—with the same overwhelming intensity we have seen for years in “traditional” campaign advertising.

9. ‘Consumerization of IT’ Planned and Budgeted. Consumer tech has become so sophisticated (without sacrificing ease-of-use and intuitiveness) that we began last year to demand its use in the enterprise. 2012—the first year in which most enterprise budgets include planned projects to support the consumerization of IT—will both accelerate and “lock in” this new tech trend.

10. 2012 Will Be Declared the Begin of “The ‘Big Data’ Era.” This year we will see another 40% increase in data we need to manage. This growth, coupled with recent releases of enterprise-ready high-scale NoSQL products will begin adoption of this tech by the entire industry. Looking back, 2012 will represent the start of the global, cross-industry Big Data era.

If you haven’t started embracing these already, now is a great time to add them to your “2012 Technology New Year’s Resolution List.” Sponsor a few pilot projects in your enterprise. Buy one or two Post-CES products to help you work more efficiently at the office. Or—if you want to include the whole family—buy one to use while you shop online, watch TV or manage your household.