In machine learning and statistics, loss functions are used to measure the performance of a machine learning algorithms. Loss functions are typically optimized to increase the predictive performance of the model. One loss function that you may be familiar with is the sum squared error. In linear regression, the sum squared error is minimized and is used to calculate the coefficients for the linear equation over the training data set.


The sum squared error can be derived by using minimizing the KL divergence between the  the true but unknown theoretical distribution and our machine learning model. The KL divergence can be thought of as the dissimilarity measure between two distribution and is minimized to maximize the similarity between the theoretical distribution and our machine learning model.


Let p(\theta) be the true but unknown theoretical distribution and q(\theta) be the machine learning model. For instance, lets assume it is a linear regression model.


Then using the equation for the KL divergence.


 L = D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^\infty p(x\theta) \, \log\frac{p(\theta)}{q(\theta)} \, {\rm d}\theta ,
 = \sum p(\theta) log(p(\theta)) + (\sum log(p(\theta) - \sum log(q(\theta)))
= \underbrace{\sum p(\theta) log(p(\theta)) + \sum log(p(\theta)}_{\text{Since we know what $P(\theta)$ is, these terms become constants.}} - log(q(\theta))


We can calculate  p(\theta) by sampling from the true distribution p(\theta^{*}).


= \text{constants} - \sum log(q(\theta))


Since we are trying to minimizing the KL divergence, the constants be ignored and the equations becomes:
= - \sum log(q(x | \theta))


Now going back to the linear regression example. Let's assume that q \sim\mathcal{N}(x \theta, 1), then the loglikehood is now:


=  - \sum ( \frac{1}{2} log(2 \pi) + \frac{1}{2}(\sigma^{2}) - \frac{1}{2 \sigma^{2}}(y - x \theta)^{2})


Since our goal is to minimize the KL divergence, to maximize the similarity between the empirical and theoretical distribution we can safely ignore the constants. And since we assume that the distribution is a normal distribution with variance 1, we can substitute in one for \sigma^{2}.


The loss equation then becomes:


 L = \sum \frac{1}{2} (y - x \theta)^{2}


which is the equation for the sum squared error.