Some of the things we build at work are closed-loop, Internet-scale machine learning micro-services. We have created algorithms that run in milliseconds that we can invoke via REST calls, thousands of times per second. We also have created data pipeline processes that process new (mostly sensor) data and build and publish new models when critical thresholds are reached. This work requires the collaboration of two very in-demand specialists: Data Scientists and Data Engineers.
Contrary to the classic Math vs. Coding vs. Domain Expertise Venn diagram, Data Scientists and Data Engineers share many similarities. Both love data. Both have domain expertise. Both are great functional programmers. Both are good at solving complicate mathematical problems—both discrete and continuous. Both use many similar tools and languages (in our case, Spark, Hadoop, Python and Scala).
However, over the past two years, as we have improved the collaboration between each to build better machine learning services, we have some key differences between each role. These differences are not just based on skill set or disposition. They also include differences areas of responsibility that are essential to creating fast, scalable, and accurate machine learning services.
It is easy to muddle raw data from fully deterministic derived data from algorithmically derived data. Raw data never changes. Rules may change but are easy to manage with clean version controls. However, even the same deterministic algorithms can produce different results (one example: whenever you refit or rebuild a model using new data, your results can change). If you are building algorithmic services you need to keep everything clean and separate. If not, you cannot cleanly “learn” from new data and continuously improve your services.
We have found a very nice separation of responsibility that prevents muddling things:
- Our Data Engineers are responsible for determinist facts
- Our Data Scientists are responsible for interpretation of these
This boils down to this: determinist rules are the purview of engineers while algorithmic guesses come from scientists. This is a gross simplification (as both engineers deal in many, many complexities). However, this separate keeps it very clear, not only in determining “who does what” but also preventing errors, guesses, and other unintended consequences that pollute data driven decision-making.
Let’s take Google Now’s “Where you parked” service as an example. Data Engineers are responsible for processing the streaming sensor updates from your phone, combining this with past data, determining motion vs. at rest, factoring out duplicate transmission, geospatial drift, etc. Data Scientists are responsible for coming up with the algorithm to determine whether your detected stop state is a place where you parked (vs. simply being at work, at home, or at a really bad stop light). Essentially, Data Engineers capture and process the data to extract required model features while Data Scientists come up with the algorithm to interpret these features and provide an answer.
Once you have separation down, both teams can collaborate cleanly. Data Scientists experiment and test algorithms while Engineers design how to apply at scale, with sub-second execution. Data Scientists determine what approach is used to build models (and what triggers model optimization, build and re-fitting). Data Engineers build seamless implementation of this. Data Scientists build algorithm prototypes and MVPs; Data Engineers scale these into fast, reliable, services. Data Scientists worry about (and define rules) to exclude outliers that would wreak havoc on F-tests; Data Engineers implement defensive programming and automated test coverage to ensure unplanned data does not wreak havoc on production operation.