Wrap Up: Which Is Best?
TL;DR: It depend on your use case.
What do I like best for CEP: Storm or Spark Streaming? It depends. I know, I hate answers like this. However, this is a true engineering tradeoff. Storm is better in some situations; Spark in others:
- If you are working on true real-time features where 1 second is too long (e.g., stock trading), then Storm wins. However, if you want to interleave machine learning with this to detect outliers or illicit patterns, you will need to build your own real-time ML infrastructure. This is not rocket science. However, it presents extra development work.
- If near real-time (2-4 seconds) meets your needs, then Spark Streaming is fast enough. In addition, if you are performing in-stream analytics, advanced data manipulation beyond simple ETL, or benefit from slowing down and looking at your data in a window of time, then Spark Streaming is better for you. You will save on development and testing and have will have fewer moving parts in your architecture.
- If you are handing billions of transactions per day, Storm will be more economical from an Ops point of view. However, if you are doing both real-time and batch processing, you will need to maintain two platforms and two libraries of code. In many companies this added Dev cost will outweigh Operational savings. It depends on a comparison of feature vs. traffic costs.
Either way, both are great technologies. Moving to Spark Streaming was right for us. I am glad we made the move. If had the freedom to use AWS EMR, I would have been even happier with the move.
Nevertheless, assuming both technologies continue to advance, I would not simply pick Spark for everything. That type of thinking is dangerous. I have been applying CEP for 23 years now (and have several CEP-related patents from the pre-OSS days). The one constant has been change. The way to guard against this is to keep your eyes open and apply engineering judgment.
If presented with platform choice again, I would look at my product-market use cases, team size, and budget and pick what provides the best balance of benefits and challenges. It could be Storm, it could be Spark–or it could be Flink or Samza. If you do not mind IaaS lock, your answer could even be AWS Lambda or GCE’s new Google Cloud Functions.
If you like this type of engineering work and analysis and enjoy using the latest data technologies to tackle hard problems, give me a call: I am hiring.