dunnhumby: Sandbox Session Q&A • Data Science Festival

Following our Sandbox Session with David Hoyle last May, To buy or not to buy, that is the question: Predicting what shoppers will buy in grocery retail, David received an oversubscription of questions from the community. So, he has kindly taken the time to answer these questions!

Keep reading to see if your question from this Sandbox Session has been answered.

Need a memory refresh? Click here to watch this session back.

1. How did you calculate the adjacency model in your model?

The adjacency graph is a graph (network) that tells us which products we expect to cannibalize sales from other products. Cannibalization predominantly happens between similar products. For example, you may choose between two similar brands of tinned tomatoes. The tins of tomatoes are very similar to each other and differ mainly in price, branding, and marketing. If you choose brand A over brand B, then brand B has lost sales to brand A. In other words, brand A has cannibalized sales from brand B. We can construct this cannibalization graph (network) by calculating a product-to-product similarity measure for each product pairing, say in a whole category such as the dairy category, and then doing community detection on that graph to detect the subgroups of products that we think will be strong substitutes for each other. There are many different product-to-product similarity measures and many different community detection algorithms one can use. The precise details of what we do is obviously part of dunnhumby’s intellectual property (IP) and so I can’t go into those details here.

2. You said that the results of the model are passed into a transformation function – can you please expand more into how and why?

As with many machine learning and statistical models, the scale on which we use the predictive features to calculate a linear predictor is not necessarily the scale of the observed data, and so we have to use a non-linear transformation to transform between the linear predictor and the expected sales. In a classical statistical model this is what a (inverse) link function does. In a machine learning model such as a neural network, this is what the transfer (activation) function of the nodes does, i.e. transform a linear combination of the features from the preceding layer into an output from a node in the current layer.

3. You mentioned you use probabilistic programming language in dunnhumby – what is it or can you provide a little more explanation?

A Probabilistic Programming Language (PPL) is a domain specific language that is focused on allowing the user to create and use probabilistic models. With a PPL I can create a probabilistic model in a declarative programming style, i.e. in the PPL syntax I just state what kind of model I want in a very similar way to how I would write that model mathematically on paper. The PPL will then take that model and automatically work out the gradient of the likelihood with respect to the model parameters for me. I don’t have to write any more code. This means I can focus on experimenting with the mathematical form of the model I want and not get bogged down in having to write new optimization code to do the model training. A PPL allows me to focus on the science of building a probabilistic model by freeing me from a lot of the coding associated with a model. There are many PPLs, e.g. Stan, PyMC, NumPyro. At dunnhumby we have used Stan. You can read more about how we have used the Stan PPL in these two blog-posts,

https://medium.com/dunnhumby-data-science-engineering/what-are-probabilistic-programming-languages-and-why-they-might-be-useful-for-you-a4fe30c4d409

https://medium.com/dunnhumby-data-science-engineering/how-we-have-used-probabilistic-programming-languages-at-dunnhumby-18454c0802ba

4. So in summary, simple statistical models like SARIMAX are more often used than global deep learning models? Is my understanding correct?

Yes, in general. There are two main reasons for this:

Interpretability – demand models can have many applications and often we need the model predictions to be transparent and the model parameters to be interpretable. This favours simpler models over more complex models.
When putting any model into production you have to balance model complexity against how robust that model needs to be in a real-world operational pipeline, where data may be noisier, messy than the data you used to develop the model. Often we don’t have the luxury of having a “human in the loop” is a large-scale operational process, and so the models must not break even though the data and other inputs may be very different and challenging compared to the data used for the research phase of a project. Again, simpler models tend to be more robust in real-world operational settings.

As with anything, this is not a hard and fast rule. There will be situations where a more complex model, like a deep learning neural network, that can capture much more complex patterns of variation, are justified, perhaps because you have a situation where you know you will be able to control the operational settings very well.

5. How do you deal with the promotion data (feeding the models), which is challenging to capture and keep up with up-to-date promotions?

The historical promotion information is supplied to us by the supermarkets, along with the historical price and sales volume data. This means we know when each product was promoted and how it was promoted, e.g. was it marketed as a “Buy one get one free” offer and a bright yellow sign on the edge of the shelf in the supermarket. When we onboard a new client to use our software, our Data Engineering teams will spend a significant amount of time setting up the promotional data feeds and ensuring that it is of sufficient quality. For example, we can check to see if there is a corresponding increase in sales (seen in the sales data) when the retailer says the product was promoted.

6. What are the more complex models?

Complex models can range from Bayesian hierarchical multinomial regression models through to the use of deep learning neural network models. The level of complexity you choose should reflect the level of detail you are wanting to capture in shoppers’ responses. For example, if you only want to model broad simple seasonal patterns along with responses to price changes and promotions, then a relatively simple model form will suffice.

7. What are the methods you used to validate and check accuracy of your model?

As with any forecasting model we can check the accuracy of the model by making predictions on a hold-out sample and comparing the predictions to the actuals. With forecasting models it is essential that the hold-out sample is always in the future ahead of the last time point in the training data, i.e. we make predictions over a forward horizon.

There are also other checks we can do on the models to assess their quality and validity. For example, we can check whether the direct elasticities estimated are of the correct sign that we would expect for a consumption good, i.e. they should be negative. We can also check that any seasonal patterns make sense, e.g. an estimated (modelled) seasonality profile for an ice-cream product that had its peak in winter months might cause suspicions that the modelling had failed. When onboarding new clients we do an extensive manual check on the quality of the models being produced by our software and make any necessary tweaks to the software settings. After that we will continually monitor, using specific metrics we have designed, both the data quality coming into the models and the quality of the models being produced.

8. Do you use shelf location/store location as a feature in the modelling? Also how do you identify a promo period when the data quality of the promo tool used is poor?

Yes, we use shelf location as a feature in the modelling. Store location can be used as a feature, although it is more common to group stores in a similar geographical location together for modelling purposes. In the USA, the store zones can be fairly granular, i.e. a large retailer may have 10-50 of these store zones. In the UK, and Europe the store zones are typically larger.
One approach to identify promotional periods when the quality of the promotional data is poor, is to look for periods of significantly increased sales in the sales data. A sharp, non-seasonal, increase in sales of a product will often be due to a drop in the price of the product, i.e. the product was promoted.

9. Any tips for predicting demand for products that come in limited editions?

Modelling limited editions can be tricky, as there is an implication that the product is being bought to be kept, and so it is not a straight-forward consumption good. The product may be a luxury item and so has been bought to indicate or signal the owners wealth and status. Such luxury goods are often called ‘Veblen’ goods after the economist Thorstein Veblen. Luxury or Veblen goods will have a positive direct price elasticity, meaning the more expensive they are the more desirable they become. Marketing and brand features can also have a bigger impact, compared to when modelling ordinary supermarket consumption goods such as tinned tomatoes. My recommendation would be to use your domain knowledge of what makes someone want to own that limited edition item, e.g. is the particular brand, or year of manufacture, and use features that reflect those drivers of demand in your model.

10. If the forecast is done for a category of products, if something goes wrong in the forecast how do you find which product went off?

Our software works at the category-level, which means we use information from all the products in a category to help estimate the parameters in each model. However, we have a model for each product in the category. So our forecasts are at the product-level. This means we can easily see which product is producing an ‘off’ forecast if the forecasted category total has gone wrong in some way.

11. Has the move to online shopping resulted fewer factors in the Demand Models (you perhaps don’t need to worry so much about product placement in the aisle) or has it introduced other factors?

Modelling online sales has its own challenges compared to modelling in-store sales. For example, in-store I know what marketing (promotion details) you are getting exposed to because it is largely controlled by the supermarket – the supermarket determines what promotional labels get put on the shelves and what other promotional signs get displayed. Consequently, I can take those promotional details into account in the demand model I build. In an online setting, I don’t know what other rival promotional offers you are looking at or have looked at recently when you are deciding to make an online purchase, Consequently, it is less clear what the relevant promotion features are that I should include in my demand model and what the values of those promotional features should be. Another important factor affecting modelling of online sales is ease with which shoppers can compare product offerings across different retailers.

12. With the pandemic affecting some business’ historicals, did you see this in some of your product categories? How did you consider these? Was it better to remove them since it added noise?

Yes, like all grocery retailers our client’s saw an impact on sales volume due to the Covid-19 pandemic. Fortunately, our demand modelling software already has mechanisms built into it to identify shocks in the sales and adjust the models appropriately. Again the details of those algorithms are part of dunnhumby IP, so I can’t go into the details here. Whilst sales data from the pandemic has been challenging to model fully, it has not been as challenging for us as it might have been had we not had these adaptive algorithms already in our modelling software.

13. It seems to me that there is no silver bullet for demand forecasting in retail. Some products require simple model families, and others products almost impossible to handle even with complex models. How much human intervention (manual model selection) do you need to choose the suitable model families? Do you think it is possible to create a system that chooses the most suitable model families without human intervention (model selection by an analyst)?

I agree, no single model form/family will be optimal for modelling demand of all products in a large retailer. However, for grocery retailers it is possible to use a single model form by incorporating a great deal of flexibility into the details of how the features are used and how the model fitting is done. For example, our demand modelling software allows a user to specify which promotional variables are included, how many basis functions should be included when modelling the seasonal component of a product’s sales pattern, how stringent should the detection of sales shocks be, and so on. In all, we have over 100 different model settings and options that we can set to optimise the demand model building process for each client. Selecting and setting of these modelling options is something we do when we first onboard a client and when we do a first manual review of the models that are produced for a client by our software.

Is it possible to automate the model family selection process? Yes. Will it lead to better quality models? Not necessarily. Is it necessary? No. It is often better to use a single simple model form that reflects the shopper process, i.e. how shoppers make choices, and then add flexibility to that model by having different options about which features are included or how the model fitting is done.