A short overview of statistical learning theory

Statistical learning theory involves evaluating the amount of data needed to obtain a certain prediction accuracy. There are differences between statistics and machine learning, even though the two areas share common goals.

Differences between statistics and machine learning

Indeed, both seem to be trying to use data to improve decisions. While these areas have evolved in the same direction and now have many aspects, they were quite different. Statistics existed long before machine learning, and statistics was already a fully developed scientific discipline by 1920, especially thanks to the contributions of R. Fisher, who popularized maximum probability estimation (MLE) as a systematic tool for statistical inference.

However, MLE essentially requires workshop knowledge of the probability distribution from which the data are extracted, up to an unknown interest parameter. Often, the unknown parameter has a physical significance, and its estimation is essential to understand some phenomena better. Activating MLE, therefore, requires you to know a lot about the data generation process: this is known as modeling. Modeling can be driven by physics or prior knowledge of the problem. In any case, it requires quite a lot of field knowledge.

Introducing the new types of data

More recently (examples date back to the 1960s), new data sets (demographic, social, medical data) have become available. However, modeling the data contained is much more dangerous because we do not understand the input/output process, thus requiring a non-distributed approach. A typical example is the classification of images in which the purpose is to label an image simply from the digitization of that image. Understanding what makes a picture a cat or a dog, for example, is a very complicated process.

However, we do not need to understand the labeling process for the classification task but rather to reproduce it. In this sense, machine learning favors a black-box approach.

Conclusions

These differences between statistics and machine learning have narrowed in recent decades. On the one hand, statistics are increasingly concerned with the analysis of the finite sample, the incorrect specification of the model, and the calculation considerations. On the other hand, probabilistic modeling is now inherent in machine learning. At the intersection of the two fields is the theory of statistical knowledge, which deals primarily with questions about the sample’s complexity.