Tag Archives: developers

Spark Streaming and Expert Systems for the Industrial IoT

This week, at the Washington DC Spark Interactive, Savi Engineering shared some of our work on using Spark Streaming and Expert Systems technology (Drools) to analyze the Industrial IoT in near-real time.

At Savi, we use a hybrid Lambda Architecture (see my post on why Lambda is so important). By “hybrid” we mean that unlike pure Lambda Architectures, we cannot restate the past 100% as we have already notified humans of critical IoT events (e.g., theft, safety risk). We can only enrich and auto-resolve these as more data becomes available. You can find tips on how do this — in general with streaming technologies and specifically with Spark — in the following presentation. You can also learn more about tackling real-world IoT challenges:

In addition, at Savi we combine fully explicit rules with real-time machined learning algorithms to perform risk and performance analytics in near-real time (see my post on the differences in focus areas between our Data Engineers and Data Scientists).  James Nowell of our Engineering team provided a great presentation on how we run Drools inside Spark RDDs (yes–Drools, we do this without performance penalties) to create linear-scale expert systems to analyze all that IoT as if we were an omniscient human. You can find his presentation here:

In future presentations, we will expand on areas such as:

  • The differences in use of Spark (using the same data) between Data Scientists and Engineers
  • How we scale machine learning algorithms for real-time, sub-second execution (thousands of times per second)
  • Creating a DAG that combines hardware device edge intelligence with cloud-based intelligence

If you like what you see here, Savi is hiring. Take a look at here.


Data Scientists vs. Data Engineers: Facts vs. Interpretation

Some of the things we build at work are closed-loop, Internet-scale machine learning micro-services. We have created algorithms that run in milliseconds that we can invoke via REST calls, thousands of times per second. We also have created data pipeline processes that process new (mostly sensor) data and build and publish new models when critical thresholds are reached. This work requires the collaboration of two very in-demand specialists: Data Scientists and Data Engineers.

Contrary to the classic Math vs. Coding vs. Domain Expertise Venn diagram, Data Scientists and Data Engineers share many similarities. Both love data. Both have domain expertise. Both are great functional programmers. Both are good at solving complicate mathematical problems—both discrete and continuous. Both use many similar tools and languages (in our case, Spark, Hadoop, Python and Scala).

However, over the past two years, as we have improved the collaboration between each to build better machine learning services, we have some key differences between each role. These differences are not just based on skill set or disposition. They also include differences areas of responsibility that are essential to creating fast, scalable, and accurate machine learning services.

It is easy to muddle raw data from fully deterministic derived data from algorithmically derived data. Raw data never changes. Rules may change but are easy to manage with clean version controls. However, even the same deterministic algorithms can produce different results (one example: whenever you refit or rebuild a model using new data, your results can change). If you are building algorithmic services you need to keep everything clean and separate. If not, you cannot cleanly “learn” from new data and continuously improve your services.

We have found a very nice separation of responsibility that prevents muddling things:

  • Our Data Engineers are responsible for determinist facts
  • Our Data Scientists are responsible for interpretation of these

This boils down to this: determinist rules are the purview of engineers while algorithmic guesses come from scientists. This is a gross simplification (as both engineers deal in many, many complexities). However, this separate keeps it very clear, not only in determining “who does what” but also preventing errors, guesses, and other unintended consequences that pollute data driven decision-making.

Let’s take Google Now’s “Where you parked” service as an example. Data Engineers are responsible for processing the streaming sensor updates from your phone, combining this with past data, determining motion vs. at rest, factoring out duplicate transmission, geospatial drift, etc. Data Scientists are responsible for coming up with the algorithm to determine whether your detected stop state is a place where you parked (vs. simply being at work, at home, or at a really bad stop light). Essentially, Data Engineers capture and process the data to extract required model features while Data Scientists come up with the algorithm to interpret these features and provide an answer.

Once you have separation down, both teams can collaborate cleanly. Data Scientists experiment and test algorithms while Engineers design how to apply at scale, with sub-second execution. Data Scientists determine what approach is used to build models (and what triggers model optimization, build and re-fitting). Data Engineers build seamless implementation of this. Data Scientists build algorithm prototypes and MVPs; Data Engineers scale these into fast, reliable, services. Data Scientists worry about (and define rules) to exclude outliers that would wreak havoc on F-tests; Data Engineers implement defensive programming and automated test coverage to ensure unplanned data does not wreak havoc on production operation.