High-Dimensional Learning

Context

  1. Basic of Statistical Learning: decomposition of error

  2. The Curse of Dimensionality: adversarial phenomena that emerges as the input space become high dimensional

  3. Addressing the curse: signals

Statistical Learning

Data distribution

Model

Synonyms: hypothesis class, function approximation

  • Neural Networks

Complexity measure: Some type of norm or quantity that you can evaluate in your hypothesis that is meant to divide your hypothesis into into that are simple and those that are complicated. e.g: how many neurons we have in our neural network?

Sobolev Norm

Learner: Try to find hypothesis with low complexity.

Error Metric

Idea: Notion of comparing or measuring errors quantitatively.

Loss function: For example measuring the square distance (Mean Squared Error).

Energy functions.

Point-wise measure.

Notion of average:

  1. Population average: expectation for the data of this point wise measure. How well we are gonna do.

  2. Empirical average: commonly named the training loss.

Conclusion: We are gonna try to reduce the population error but we can only work with the empirical error. How these 2 notions of average relate to each other?

Training and test eras can't be related point-wise.

Not see it as a random variable, but rather as a expectation of a random function. Training and test eras as objects, as functions, they should be close to each other. Redemacher complexities

Empirical Risk Minimization: The Algorithm

Supervised learning sense.

Focus on low complexity hypothesis class.

Consider an algorithm as an estimator.

Convex optimization: See also convex constraints. The constraint form is not easy to use and that's why we can also consider the penalized form.

Penalized form: Introduces a Lagrangian multiplier where the constraint now becomes part of the optimization objective.

Hyper parameter: Lambda nows control indirectly the strength of the regularization.

Interpolation form: Popularized by large neural networks. Functions hugely expresive that we can even completely fit the data. This implies an empirical risk of value 0. Strong assumption for this method to work: If and only if we assume that there's no noise in the data. This luxury only occurs in certain data sets.

Basic Descomposition of Error

Idea: It's always the same. We have something that we want to control. The way to introduce to actually make progress is always the same, add and subtract the appropriated quantity.

Guarantee in the performance in the test set once an arbitrary hipothesis has been chosen.

Infimun error: The best we could do a posteriori. If we had an oracle who could give us as many samples as we want while having theoretical infinity amount of computational resources, we could effectively have selected the best hypothesis.

To further interpreter the previous difference, we need to transform it into multiple differences.

Add and subtract the best we can do by resctricting the complexity

Now we are going to introduce the empirical objective, the test error over the "ball":

Thus the decomposition progress as follows,

We also add and subtract the infimun of the training error:

Interpretation of the current state of the terms inhabiting the decomposition expression.

It's the training error of my hypothesis minus the best training error in my ball. So if we are able to solve the empirical base minimization problem, how much does this term cost?

This term is going to be zero. If we are good at optimising things, then this green term is going to be small.

Interpretation: Optimization error.

2 terms remain:

Interpretation: They really look like something that relates the population objective with the training objective.

We can unify these two terms together and upper bound them:

This term is two times the largest fluctuation between the training and the test over our "ball" of available hypothesis.

Conceptualization: We can clearly see that if we have 2 functions and we want to minimize them, the difference between, we can always upper bound the difference.

Thus our expression over the error, or test error, looks as follows:

The error made is a contribution of three different sources of error.

Missing term

  • Approximation error: Exploit as much as we can the hypothesis like the prior information we have on the target function.

  • Statistical error:

  • Optimization error: Solve these problems in a efficient way.

Conclusion: If we want to be able to learn in high-dimensions we need to be good at these three errors at the same time.

Appendix questions: Question: Does the empirical error always need to have a convex constraint?

Answer: For sake of simplicity always assume the hypothesis space to be convex. In the context of neural networks this is synonymous to considering the last layer to be hugely wide (as wide as you can). Nevertheless even if our hypothesis space wasn't convex, our previous decomposition of error stills holds. All the presented equations are dimension-free.

Answer: It depends to the dimensionality, but it also depends on finer properties of the functional class.

How to simultaneous control all sources of error in the high-dimensional regime?

The curse of dimensionality

Statistical perspective

Question: How the terms that emerged in our descomposition of error behave as a function of the dimensionality of the input space?

Dynamic programming -> synonym: High-dimensional statistics.

Basic Principle of Learning: Intepolation

Propagate the information that we observed to the propagation form the neighbors.

Similarity: The principle of learning by basically finding patterns that are similar, it's something that suffers a lot in high-dimensions.

Learning Lipschitz Functions: Understand the role of locality in learning

Encapsulates the notion of locallity. It's a hypothesis of a function that only depends on locality. Thus the value of the function at one point is going to be close to the value of the function at a neighbour point.

Number of samples needed to learn given an arbitrary input space with d-dimension

Setup:

Our hypothesis space is going to be all the functions that

This introduced space can indeed be shown to be an Bannach space. This conclusion means that the emerged space has a notion of complexity, norm. Now we can define our estimator.

Estimator: ERM is the interpolant form.

This implies we are going through all the points.

Question: How do we complete the error between this estimator and the ground truth?

Mechanism to compute bounds in machine learning: Add and subtract trick.

Now we add and subtract it. Thus the previous expression becomes:

We need to bound these terms. The expression can be simplified by substituting the term

By 0. Because this point has been picked from the training set and by definition we know that our interpolant passes through all the points. Thus this term value is 0 by construction. The last term,

In conclusion we have, having fixed the Lipschitz constant to be one,

This defined as the closest point from the constant.

Optimal transport and the exact Wasserstein loss

Since we want to make the previous expression equal to epsilon, implies that the epsilon needs to be,

Conclusion: The lower bound of needed samples it's actually the necessary amount of samples in order to properly learn.

Pure approximation perspective

Optimization perspective

Read the space. This means we need to just evaluate every possible point and find the smallest value. This of course has exponential dependency in dimension.

Exponential blow up of complexity. Thus we need to make assumptions.

This is overcome in practice working with spaces that are nearly convex. They possess no bad local minima.

Instead of finding a global minima we focus about finding a local minimum.

Local minimum, formally called second-order stationary points.

Question: How hard is find a local minimum in high-dimensions?

Strong assumption of non bad local minima: This might not always be the case.

Last updated