Small Business Software Reviews, Services Insight and Resources

Best Small Business Software Reviews, Services a steady flow of information, insight and inspiration for small business owners and operators: 2016, 2017, 2018, 2019, 2020.

Data Wrangling for Higher Prediction Accuracy


Data Wrangling for Higher Prediction Accuracy
5 (100%) 4 vote[s]

Imagine you’re dealing with the forecasting of sales for a retailer with limited data availability (which is, sadly, quite often the case), and you need to extract maximum value from the data to provide the retailer with the highest-quality piece of consultancy.

They have the following data at the level of transactions:


  • SKU identifier
  • SKU long name
  • Quantity sold
  • Shelf price
  • Check price
  • Channel
  • Stock

That’s pretty much it.

Enough for a high-accuracy predictive model? Nope.

So let’s tackle this issue and try to enrich the data and see which accuracy increment we can get.

Step 1. Seasonality. Well, this is usually the first thing between you and catching the trend. You need to take into account how sales differ by season (high vs low, local festivities, etc.). In our opinion, weeks are your best friends while trying to catch the seasonality patterns.

Step 2. Getting the most out of the SKU’s long name. It is essential for a model to gain an understanding of how products get together and form homogenous groups regarding behaviour, a.k.a. Clusters.

Rather than trying to deal with the curse of dimensionality and trying to guess the optimal number of clusters, we let the price optimization model decide which product groupings work best. We also need to acknowledge that the retailer in question has gone through piles of heuristics to end up with product classifications that work in the market. We were able to parse five different pieces of information from the long name by simple parsing (thanks for this, ‘re’ Python library!), and thus we introduce five categorical variables with different product segmentation.

Step 3. Checking the availability of products. Rather than using stock as is, which can introduce much noise in the model, we recode it to see whether the product had any availability issues during the observation period. Thus the binary variable.

Step 4. Making the price numbers work. Price is arguably the go-to factor in terms of boosting the sales (or making them bust). It has numerous ways it influences a shopper’s moods and decisions. First of all, the shelf price itself forms the first impression that lasts.

Another thing is the attractiveness of discount one can get. Finally, perceived relative prices affordability is also vital. To take all of this into account, we have created a plethora of features out of just two variables as shared by the customer. As mentioned, we check how shelf prices work, along with the discount percent we calculate, and, to crown it all, we introduce four price indexes vs the extracted segments and one vs the total assortment. They are all highly dynamic and highly impactful features, and correctly dealing with them is essential. Luckily, here at Competera, we know a thing or two about how prices work, so it wasn’t a significant issue for us.

Step 5. Adding the macroeconomic dimension. The life’s truth is, the result is a sum of your effort and the impact of the environment. So we need to take the latter into account. Talking about a consumer goods retailer, it always makes sense to check whether your environment is your friend, so we have calculated the real income index to capture the external long-term sales trends. It’s also of use to check total country retail sales, the birth rate in some cases, etc. In our case, real income was the right factor in choosing.

Step 6. The burden of history. Another bit of wisdom is that there is a time lag between thoughts of buying something and the actual purchase. Also, if a typical shopper buys a month’s worth of something, they will not come back to the category for the next four weeks. Also, different goods and brands have different velocities simply because of historical awareness and average frequency. All these recent behaviours need to be integrated into our calculations, so we take the lag of sales with the highest correlation with current sales to make the model’s predictions sharper.

There are many other steps to make sure we don’t insert junk, of course, but — that is for another time.

To keep the story short, this is what we get — the precision of the model prediction of sales has grown from 65% on the data as is, to 97% on the feature-engineered dataset.

Not too shabby, right?