Overview¶
- Model
- Basic articulation
- Mathematical tools
- Cost/Loss
- Goals
- Options
- Optimization/Learning
- Goals
- Closed form
- Stochastic gradient descent
Reflection on Learning¶
Why wouldn't we always use the closed form solution for linear regression?
Let's consider the computational complexity of the relevant linear algebra operations
- For a full list check out this wikipedia page
There are two operations to focus on, matrix multiplication and the calculation of a matrix's inverse
- Matrix multiplication applies for us for $\mathbf{X}^\mathsf{T}\mathbf{X}$ which is $O(n^2m)$
- Calculating the inverse which is $O(n^3)$
That means the closed solution is $O(n^2m + n^3)$ because of the inverse while a single step of gradient descent is $O(n^2m)$
- One step is $\mathbf{w} - \eta \frac{2}{m} (\mathbf{X}^\mathsf{T}\mathbf{X}\mathbf{w} - \mathbf{X}^\mathsf{T}\mathbf{y})$
When we have many features, gradient descent is preferable. Additionally with lots of data, we might converge on less data than the total amount, which would also speed things up.
Data (or organizing our observations)¶
Maximum likelihood estimation (model is probable data)¶
There is a deeper connection between minimizing MSE and modeling a linear relationship
Let's think about the source of our data
In science, we propose that the universe is governed by universal laws that describe the true, unobservable nature of processes we observe
That is, for some problems, there is a true linear relationship between some variable and some ouptut
- A running example, Ohm’s Law: The current flowing through a conductor is directly proportional to the voltage applied.
We sample data after some manipulation
- We want to measure the conductivity of a material, so we apply a voltage and measure the resultant current
- Current = Conductivity of material * Voltage
There is error in our measurement that is drawn from some gaussian distribution
- Our instruments are good, but they aren't perfectly able to record something (perhaps there is lack of precision in the number, e.g., a kitchen scale doesn't tell you a weight to the 10th decimal point, perhaps an issue of a non-essential interacting influence, think air resistance for gravity -- in our case this could be some slight impurity in the material being evaluated)
- Current = Conductivity of material * Voltage + error
We assume this error in measurement is Gaussian, so the error is a normal distribution with a mean around the true measurement
- $\mathcal{N}(\textrm{current}, \sigma^2)$
We can now articulate the likelihood of observing labels given our model
$$\textrm{P}(\mathbf{y} | \mathbf{X}) = \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2)$$
- Expanding this expression yields
$$ \textrm{P}(\mathbf{y} | \mathbf{X}) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(\mathbf{y}-\mathbf{X}\mathbf{w})^2}{2\sigma^2}}$$
- What makes a model a good model of reality? It assigns a high probabaility to our observations. This principle is called maximum likelihood estimation. In machine learning, this asserts that a good model is one whose parameters maximize the likelihood of the entire dataset:
$$ \textrm{P}(\mathbf{y} | \mathbf{X}) = \prod_{i=1}^{m} \textrm{P}(y^{(i)} | \mathbf{x}^{(i)}) $$
- There's an issue with taking the joint probability of all of our data (i.e., the product of the likelihoods). Multiplying numbers between 0 and 1 tends to smaller and smaller numbers. This can cause precision errors on our computers.
- We we can reformulate this using logs, where addition applies (e.g., log(A*B) = log(A)+log(B))
- Additionally, we frame this as a minimization problem which is the typical framing of an optimization problem
\begin{align} -\textrm{log P}(\mathbf{y} | \mathbf{X}) &= \sum_{i=1}^{m} \textrm{log }\textrm{P}(y^{(i)} | \mathbf{x}^{(i)}) \\ &= \sum_{i=1}^{m} \textrm{log }(\frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(y^{(i)}- \mathbf{x}^{(i)}\mathbf{w})^2}{2\sigma^2}})\\ &= \sum_{i=1}^{m} -\textrm{log }(\sqrt{2\pi \sigma^2})- (-\frac{(y^{(i)}- \mathbf{x}^{(i)}\mathbf{w})^2}{2\sigma^2}) \\ &= -\textrm{log }(\sqrt{2\pi \sigma^2}) - \frac{1}{2\sigma^2}\sum_{i=1}^{m} (y^{(i)}- \mathbf{x}^{(i)}\mathbf{w})^2 \\ &= \alpha - \sum_{i=1}^{m} (y^{(i)}- \mathbf{x}^{(i)}\mathbf{w})^2 \end{align}
- We ignore the constant terms that do not depend on our parameters.
- You can see that we recover MSE directly from the aims of maximizing the log-likelihood of our data!
Batches¶
- See the slides
Splits and k-fold¶
- See the slides
As Neural Networks (or how we pretend we have a brain)¶
- See the slides