Here’s a hypothetical example to illustrate my point. An aspiring Data Scientist does some research regarding a particular problem and finds a blog post, a paper, and/or a forum post recommending the application of a regression model built with Stochastic Gradient Descent to the problem space. The following screenshot is an excerpt from Python’s most excellent scikit-learn library.
NOTE – Rest assured that similar R examples exist as well (e.g., the awesome glmnet pacakge) and I only use scikit-learn here as the scikit-learn HTML documentation is more visually attractive ;-).
The above green boxes illustrate some of the mathematical knowledge required to use this algorithm to build the most effective model. For example:
- The Stochastic Gradient Descent algorithm – what is it and how does it work.
- Regularization – what is it and how does it work.
- The differences between L1 and L2 regularization – why a Data Scientist might want one vs. the other or a blend of both.
I believe this relatively simple example illustrates my point about math and programming. Specifically, this example shows that without the required math knowledge, the Data Scientist has little hope of coding up the training/construction of the most effective model in any reasonable way.
For these reasons, our students learn every highlighted item above as part of our curriculum’s coverage of regression. We also teach our students the mathematics and theory for other important topics like decision trees, boosting, and recommender systems. It is also for these reasons that I advise the aspiring Data Scientists that I mentor that eventually they will need to dust off their math textbooks.
Until next time, happy data sleuthing!