L2: In the Shadows

Lagrange Point 2 (L2): Potential surprises and developments under the radar

Data Scientists vs. Data Engineers: Facts vs. Interpretation

Some of the things we build at work are closed-loop, Internet-scale machine learning micro-services. We have created algorithms that run in milliseconds that we can invoke via REST calls, thousands of times per second. We also have created data pipeline processes that process new (mostly sensor) data and build and publish new models when critical thresholds are reached. This work requires the collaboration of two very in-demand specialists: Data Scientists and Data Engineers.

Contrary to the classic Math vs. Coding vs. Domain Expertise Venn diagram, Data Scientists and Data Engineers share many similarities. Both love data. Both have domain expertise. Both are great functional programmers. Both are good at solving complicate mathematical problems—both discrete and continuous. Both use many similar tools and languages (in our case, Spark, Hadoop, Python and Scala).

However, over the past two years, as we have improved the collaboration between each to build better machine learning services, we have some key differences between each role. These differences are not just based on skill set or disposition. They also include differences areas of responsibility that are essential to creating fast, scalable, and accurate machine learning services.

It is easy to muddle raw data from fully deterministic derived data from algorithmically derived data. Raw data never changes. Rules may change but are easy to manage with clean version controls. However, even the same deterministic algorithms can produce different results (one example: whenever you refit or rebuild a model using new data, your results can change). If you are building algorithmic services you need to keep everything clean and separate. If not, you cannot cleanly “learn” from new data and continuously improve your services.

We have found a very nice separation of responsibility that prevents muddling things:

  • Our Data Engineers are responsible for determinist facts
  • Our Data Scientists are responsible for interpretation of these

This boils down to this: determinist rules are the purview of engineers while algorithmic guesses come from scientists. This is a gross simplification (as both engineers deal in many, many complexities). However, this separate keeps it very clear, not only in determining “who does what” but also preventing errors, guesses, and other unintended consequences that pollute data driven decision-making.

Let’s take Google Now’s “Where you parked” service as an example. Data Engineers are responsible for processing the streaming sensor updates from your phone, combining this with past data, determining motion vs. at rest, factoring out duplicate transmission, geospatial drift, etc. Data Scientists are responsible for coming up with the algorithm to determine whether your detected stop state is a place where you parked (vs. simply being at work, at home, or at a really bad stop light). Essentially, Data Engineers capture and process the data to extract required model features while Data Scientists come up with the algorithm to interpret these features and provide an answer.

Once you have separation down, both teams can collaborate cleanly. Data Scientists experiment and test algorithms while Engineers design how to apply at scale, with sub-second execution. Data Scientists determine what approach is used to build models (and what triggers model optimization, build and re-fitting). Data Engineers build seamless implementation of this. Data Scientists build algorithm prototypes and MVPs; Data Engineers scale these into fast, reliable, services. Data Scientists worry about (and define rules) to exclude outliers that would wreak havoc on F-tests; Data Engineers implement defensive programming and automated test coverage to ensure unplanned data does not wreak havoc on production operation.

Why business owners should care about this thing called the Lambda Architecture

Updated on April 19, adding “Mapping this back to…” final section

In the past 25 years I have seen four things that really made me step back and say, “This changes everything.” The first was the browser (before that we got data from the Internet using news groups and anonymous FTP). The second was open source distribution (we could get whole architectures up in hours, not weeks or months). The third was App Stores (Amazon and Apple allowed us to distribute software with zero marginal cost). The most recent was the Lambda Architecture

Yep, it is that big.

If into a business owner or product manager who is into Big Data, data-driven decision-making, iterative A/B testing, machine learning-driven recommendation or any similar analytics application you have probably heard a passing reference about this thing called the Lambda Architecture. However, anyone digging in deeper immediately finds a menagerie of arcane terms that could only appeal to developer: Kafka, Storm, Spark, Cassandra, Elephant DB, Impala, Speed Layer, Batch Layer, Immutable Data Store, etc. This is unfortunate, because it obscures how disruptive of a change the Lambda Architecture represents. As a result, many people with decision-making authority to fund technology changes are missing out on something really big.

Life in the traditional architecture world

Traditional architectures are based on transactions. They force collection of data into formats required to complete a given transaction (i.e., I need to collect N fields of information to process sale of an item). In addition, traditional architectures enable data to be changed: I can update my profile, update my shopping cart, update my order status, etc. This makes perfect sense if your object is to complete a transaction.

But what if I want to understand more about who buys what, who is doing what, or often more importantly what leads something to happen (or not happen)? I cannot get this from the transaction data but instead have to perform “data archaeology” stitching multiple sources of data together to create what happened just before and after the transaction. If I am lucky, I have all this data. However, more often than not I need to engage in development efforts to: collect more data at the time of transaction, log more info, pull it into a data warehouse, change my reports, then dig in to see if I can figure things out. This not only takes much time and effort; it is also a ripe source of errors.

Lambda flips how we view data on its head

The Lambda Architecture starts with an entirely different premise: that it is impossible to understand today all the future uses and interpretations we will need from our data.

This is not just a platitude. It is underlying philosophy that the value of data comes from the ability to ask it to answer as many questions for you that would every want to ask. This drives entirely different approaches to how data is captured, stored, interpreted—and most importantly of all—continuously reinterpreted as you learn and discover more about your company, customers and operations:

  • First data is preserved in its original form and never changed or destroyed. This lets you look at any piece of data at any point in time and factor in changes over time. For example, you could re-segment your customers every year, quarter, or even day as you learn new patterns.
  • Second data is not forced into arbitrary formats (i.e., schemas) but is preserved raw as you may want to go back and gleam different elements. For example you could later realize a variable such as source IP address of a customer visit to your site may entirely change how you measure, interpret and react to customers from this address
  • Third data is engineered to allow it to be easily reinterpreted as you learn more. This does not just focus on making reinterpretation fast; it also makes reinterpretation fault-tolerant (i.e., easy to correct in the event of a bug—without any loss of information)
  • Finally it allows all of this in real-time with two points of view: a just-in-time view and the deep cross-sectional view (both of which are always current). This lets you make decisions quickly without sacrificing the 100% loss-less accuracy needed for important business areas (such as finance, medicine, or mission-critical operations).

Once you have these capabilities, the things you can do with data—quickly and at scale—are pretty amazing. I will share some of these in future posts, as I want to keep this post short.

However, I will close this post out with a simple analogy…

“Think Like I Chef” vs. the Fast Food Menu

Traditional architectures are like fast food menus. You have these options. If you want to change the menu, we can do some market research, see what works and rollout a new menu. If you want to change again (or explore “what if we had done this?”) we can repeat this process.

Lambda architecture is like the pantry of a great chef. You have all these ingredients. If you feel like duck à l’orange, we can make this. If you want a duck confit salad, we can re-purpose the ingredients. If you want really rich potatoes, we can render the fat and cook the potatoes in it. If you want vegan, we can pull other items out of the pantry and make something else. There are so many more options.

Mapping This Back to Things Business People Care About

So what does this mean for your business? Do you remember the last time heard these comments:

  • “You’ll see that report. It will be in our Data Warehouse–tomorrow around 10am.”
  • “Oh, that’s in our warehouse. We can build a program to convert and and load the data into production. It will only take 3 weeks. Can you submit your TPS form to the Steering Committee so we can prioritize this?”
  • “Gee, it’s too bad we did not capture that data. We can start to capture it now. In a few months we can start analyzing it.”

With Lambda, all of these comments–and many more–go away. Data is never thrown away. It is always in production, ready to be used–for analysis or real-time transactions. There is no delay between transactional use and analysis–data flows down both paths as once.

Just imagine what problems you can solve when these limitations go away.