Habr, hello!
There are many articles on the site about forecasting sell-outs of various categories of FMCG, while other product categories are not so popular. In addition, retail is usually investigated with a forecast of several days in advance, but not of a distributor with a forecast of 2 months in advance - although the latter is interested in a qualitative forecast more often than even a retailer. As an additional challenge, it is worth considering the conservatism of a significant part of distributors in the sense of choosing forecasting technologies. It is easy enough to imagine the practice of forecasting the company's middle-hand sales: collecting sales from SAP + master data in the man-made Excel Tool, automation in such cases is limited by a fairly simple forecasting method, which does not go far from the average / linear trend with rocket science in the form of HoltWinters.
It so happened that, on duty, I came across a manufacturer of a brand of inexpensive decorative cosmetics (about 500 SKU) and saw the sad consequences of conservative forecasting in the form of low KPIs. It was necessary to make small changes to the planning system, including the forecasting process, and then I will talk about my research.
Status Quo Ante Bellum
The main headache of the brand for several years was the extremely low level of service (
CSS = Shipped / Ordered ), significantly lower than the average for the company, which was aggravated by the forecast based on average sales (
Forecast = Average sales ). The vicious circle: the customer orders volumes, does not receive the full quantities and the next time comes with increased demand, and we plan the future only according to the satisfied need.
Based on this, it was decided to forecast not sales, but orders cleared of repetitions (i.e. if the client wants 100 pieces and orders every week, we assume that the demand per month is 100 pieces, not 400), so here further sales we will call orders cleared of duplicates. After improving the level of service, the difference between the terms is leveled. The company uses the formula (
Forecast accuracy = (1 - Sum of the modules of discrepancies between fact and forecast) / Forecast ) to estimate the accuracy of the forecast in the company, and we will use it. An important point: the forecast in this case is a forecast with lag 2, i.e. if we evaluate the accuracy in October, then we compare the sales of October with the forecast made in August. It is believed that a result of more than 35% can be interpreted as satisfactory. By the way, it is worth noting here that we did not, in fact, expect an increase in accuracy at first - we expected an increase in the level of service and we will evaluate the results precisely on the basis of the quality of shipments of goods to customers.
I didn’t have to sweat a lot over the sample for training - although the company does not have an analytical DWH, but there is an upload of orders by months, which a small Python cycle has collected and cleared. Similarly, we got master data. Events and promos had to be ignored in the calculations due to excessive noise (we were afraid to add more noise than useful information), since the largest retailer periodically conducted promos without agreement or notice, estimated the volume of events, etc.
As a model, we decided to use 3 branches - Naive, Exponential Smoothing and MachineLearning.
Naive Forecasting
The main reason naive forecasting is still in use is the ease of interpretation for humans. Indeed, if it is customary for a company to analyze each code in detail, using the 5 Why methodology, then the average forecast fits perfectly. Does the customer order 10 pieces per month on average? It is logical that he will order 10 pieces next month. The result is not what we expected and the client ordered 50 pieces? Probably, the client simply does not know how to predict / the auto-order has broken / the robot is furious, etc., you should implement Joint Forecasting and exchange files in order to relieve stress and improve the forecast accuracy (of course, a joke, but with some truth).
To add a little relevance to the methods, we suggested that the seasonality of the goods is identical to the seasonality of the category (the statmodels package was used to identify seasonality, the picture shows 3 main results).

We took 4 methods - the average and the median for the entire sales history (accuracy 0.32 and 0.30 respectively) and the last 6 points (0.36 and 0.26). 0.36 will be our benchmark - in the future we must get better results.
ESM Forecasting
Exponential smoothing is often the ultimate dream in demand planning and it is easy to see why. Estimates of accuracy in the forehead show the least comparable results for ESM and ML in terms of accuracy, they are included in all industrial forecasting systems (JDA, Oracle RDF, etc.), the calculation is faster and easier to interpret - that's why the classics are more alive than all living things. On the other hand, the result may not be entirely honest due to insufficient preparation of features.
For forecasting, we used the same statmodels package. To begin with, we took the Holt class for the entire sales history with cleared group seasonality and for the last 6 points (accuracy 0.34 and 0.37).
Next, we divided SKU into 2 groups - products with a long history were predicted by the HoltWinters class, and for a short history we left Holt at 6 points. The result was much better - 0.44, this is easily explained. If at the top level it seems that all goods are twin brothers, then, going down to the subcategory, we can see the differences.

ML Forecasting
The main disadvantage of standard methods is the lack of depth of calculations, because we do not use all available information. For example, the history of events: usually, in the classical approach, we should subtract the happened pipes from the history, make a baseline forecast and add the planned pipes with the appropriate calendar. In the case of insufficiently accurate recording of events, the result can be disappointing (in the case of Holt-Winters we managed to get - 5 pp to accuracy). Many reasons can be presented - for example, KAM categories underestimates volumes in order to show overfulfillment and receive bonuses. If we move from the approach (
Forecast = Baseline + Pipes ) to a forecast based on features, then we will be able to additional part of the information available to us. To do this, we have compiled a list of 50 features (prices, master data, sales, split customers, etc.). From the Sklearn library we took the base Lasso / Ridge / KNN Regressor, which often give quick win, but in our case only KNN was pleased with an accuracy of 0.44. A random forest has good perfomance in small samples, in our case, the accuracy is 0.48. Well, of course, we did not forget about Xgboost, which after a little cross-validation produced the best result - 0.51. Below is a graph with the accuracy of each model

Concerns
What can be conservatives regarding the use of ML in forecasting sales? For example, feedback: a number of ecom and offline retailers use neural networks and gradient boosting and periodically scare you with crazy orders. Yes, of course, if the retailer will measure the accuracy of the forecast, then the result may be optimistic, but this is to some extent a self-fulfilling prophecy: if there is a mistake, then the sale / black Friday / birthday of the network is turned on, etc.
Therefore, it is important for us to show that the result of Xgboost is not only more accurate, but also more stable (and not random roulette, as an unprepared glider may seem). To do this, we compared the distribution of the HoltWinters and Xgboost error distributions and made sure that the results of the latter have a denser center around zero and light-tailed


Intermediate Results
Since the beginning of the writing of the article / first run forecast, 2 more brands of decorative cosmetics have been switched to Xgboost, due to positive results. At the end of November, the average increase in the level of service was + 16%, i.e. the average level of the company was reached