Lean + Data Science


I joined this field because of the excitement that we feel upon discovering new patterns in data. With time though, it became clear that identifying patterns is just the first half of this journey. There is another part, which is about sharing this discovery. ‘Sharing’ can be in the form of a presentation, in the form of a change in an existing product or even an entirely new product. Thus the two parts together — nurturing ideas from inception to a product — is the full stack of data science.

In this post, I’m sharing my experience on how to build machine learning models into the software in efficient ways. Before going into details, let’s put up some underlying assumptions:

Lean product development in data science

Lean manufacturing has a proven track record in several industries. Although it was originally formed as a methodology for the automobile industry, it has been extended to software development as well, thanks to the brilliant work of Mary and Tom Poppendieck. The central tenet of lean is to reduce waste in the process to maximise productivity. The Poppendiecks identify seven types of waste for software. Here they are, anchored to data science practices.

  1. Partially done work. I consider most Jupyter notebooks as prime examples. They are partially done work because you usually do not add either documentation or tests. Importantly, it’s not a waste in the past: it’s waste in the future. Waste every time someone wonders what that purpose of that notebook was? Waste every time someone wants to build from it and it turns out that you left off with a not linear run of the cells.
  2. Extra features. Not so long ago we were really pushing the roll-out of a feature because we wanted to prove that it works. It worked. Unfortunately, we did not see that the question was not doubt in whether the feature would work from a machine learning perspective but doubt whether it would work on the market. Because of this misunderstanding, we wasted two weeks of effort and a few hours on maintenance from time to time.
  3. Relearning. A straightforward illustration is that code that everyone writes from scratch every time but is intricate enough to spend a good 2 hours reimplementing it. Think of reading the data, setting up the cross-validation etc.
  4. Handoffs. A little bit different from relearning in the sense that this is something you can’t avoid. Code handover, though, can be optimised as a process. One prominent place of handover that I often see is between prototyping and product data teams.
  5. Task switching. You are in a hurry — which is the usual case — so you decide to work on two features at the same time. Maybe you even do this because otherwise, you feel that you would spend too much time on one and won’t have time for the second.
  6. Delays. Approval time on a budget, on allocation, waiting for QA, waiting for a model to train, waiting for the tests to pass and so on. Although it would be ideal if in these times people could relax for 10 minutes, usually that does not happen. Instead because of frustration, they start working around it (e.g. begin implementing the next feature).
  7. Defects. Loss goes NaN; the probability is 1.2; model serialised in a different version, missing data in production… And so on. Writing well-covered code does not mean you are not going to face defects, but fixing those defects will be significantly easier.

We can easily agree that reducing these wastes certainly helps in providing more reliable output. The usual question is whether there is a way to overcome these wastes in a time and cost-efficient way. I believe that the key is embracing the uncertainty and design for that from day 1.

Embracing uncertainty

Embracing uncertainty means accepting the fact that we don’t know everything. We don’t know if that GAN is going to be better than some simple affine transformation for data generation, we don’t know if the product idea that we have now is of any interest to real customers, we don’t know many things. However, we can have good practices that help us answer these questions in the right way at the right time.

Harry called everyone for a brainstorming meeting to do some market research. The aim of the meeting is to finetune the scope of the company’s next breakthrough product.

The bases of these good practices are communication and hypotheses. In a previous post of this series, I was already talking about the importance of hypotheses. That time it was hypotheses on the unit and system level, here, I’m going to talk about the importance of hypotheses on the business unit level. The most important reason for products failing is not meeting the market need. What often happens with data science products is that the team is delaying the release because, e.g. the model is not performing well in all cases. Then when the release finally happens suddenly, no one seems to care.

The reason behind these stories is a wrong concept of product development. Before implementing a feature for one year, we first have to make sure that the feature is needed. Let’s take the story of x.ai am AI-driven meeting scheduler. They had ideas about how to solve the problem with machine learning, but instead of implementing it from the ground up first, they started to test the concept with the help of a human meeting scheduler. They have learnt tremendously about the market needs and thanks to that now they have a growing business. This was only possible by making the right choice to embrace uncertainty and reduce waste by not implementing features before it was clear whether they are bringing value to the customers.

The full stack of data science

A senior colleague of mine asked me after reading the TDD post how much that is for data scientists, how much for software engineers interacting with them and how much for data engineers. This question made me think of how I see the data science practice. I understand the data science team as a special part of the software development process. Since machine learning requires an in-depth knowledge of math (especially statistics), people often join from various non-software development fields to the team. That is good since these skills are traditionally missing from software engineering studies.

Sometimes waste is hard to distinguish from the expertise.

However, a machine learning model is part of the software and therefore, it has to be seamlessly integrated. Let’s say you have built an excellent image classifier but it does not fit the target device memory; the data points on the stream are not arriving in temporal order; we have to read the data very fast in order to provide realtime feedback; the output of the model should be displayed in a valid yet easy to understand way. This means a good data scientist should have an understanding of the end to end process as well.

Nowadays the market calls such professional the full-stack data scientist (and we often refer as the unicorn). I can also add here that the way how someone becomes a full-stack data scientist is fastest via communication. You go to the desk of the software developer and ask (it’s okay to ask even two times the same thing). You learn from the DevOps colleague when he is explaining how he solves efficiently the model deployment. You tell your new idea of model access to the data engineer maybe she already implemented that feature or knows about it. Luckily, I have several examples with colleagues who became data science unicorns by this simple recipe.

As a final thought, I’m convinced that before looking for unicorns in the job market, all company should look inside and see whether they can nurture it. It could be that the answer is ‘yes, but we don’t have time for that’ nevertheless only the culture nurturing such talents can utilize their full potential.