Gist: Fundamental Principles underlying Deep Representation Learning architectures. Single word to convey gist of the course: Symmetry
Symmetry, as wide or as narrow as you may define its meaning, is one idea by which man through the ages has tried to comprehend and create order, beauty and perfection. H. Weyl 1952
The term symmetry has Greek origin (συμμετρία). Symmetry literally translates as same measure and ancient greeks used this term to somehow vaguely convey the the beauty of proportion in arts and harmony.
Plato considered the five regular polyhedra (~370 BC) what we now call the platonic solids. Originally thought as fundamental must be building blocks that shape the physical world. This idea was not very far from the truth. Kepler many centuries after attempted a rigorous analysis of the symmetry. He was in particular concerned with the symmetric shape of water crystals and he wrote a book titled “On the six-cornered snowflake” (1611)
Nowadays known as hexagonal packing of particles, it was an idea that clearly preceded the understanding of how matter is formed and the concept of atoms, molecules and crystals. It holds today as the basis of modern crystallography.
Modern geometry is also traced back to ancient Greece and the seminal work of Euclide's elements.
In a plane, given a line and a point not on it, at most one line parallel to the given line can be drawn through the point. Euclide (~300 BC)
This type of geometry was the only one known for over 2000 years. At the core of Euclidean geometry was a set of 5 basic assumptions, or how Euclid called them, postulates that he used to derive certain properties and prove results. For hundred of years the fifth postulate of Euclidian geometry is stating that:
It's possible to pass only one line parallel to a given line through a point that lies outside of it.
This fifth postulate subsequently defied any attempt to try to derive it from the other postulates of geometry.
In the 19th the Euclidian monopoly came to an end. The 19th century was really a remarkable burst of creativity that made geometry into probably one of the most exciting fields of mathematics. First it was the development of what is called projective geometry (J. V. Poncelet 1822).
In projective geometry, points and lines are interchangeable and there is no such thing as parallelism. Any two lines intersect at one exactly point. Nowadays it's very popular in computer graphics. Non considered non-euclidian geometry strictly. But it was probably the first one to undermine this euclidian concept of parallelism.
First published construction of a non-euclidian geometry. Is credited to N. Lobachevsky (1826), a Russian mathematician that considered the fifth axiom of euclidian geometry a completely arbitrary limitation.
In geometry I find certain imperfections which I hold to be the reason why this science [...] can as yet make no advance from that state in which it came to us from Euclid. As belonging to these imperfections, I consider [...] the momentous gap in the theory of parallels, to fill which all efforts of mathematicians have so far been in vain. N. Lobachevsky 1826
Alternative proposed postulate: More than one line can pass through a point that is parallel to a given one. Such construction required a space with negative curvature. This type of space is now called a hyperbolic space.
Academia rejection of the proposal
This idea was so unconventional and theoretical at the time of the publication that he was openly derided by colleagues at his university for writing and publishing such nonsense. Some mathematicians of his time said that this was at the level of some school teacher not a university professor.
Different persons came to the same idea from different backgrounds
J. Bolyai (1832), a Hungarian mathematician apparently came to the same ideas together with Lobachevsky.
I have discovered such wonderful things that I was amazed...out of nothing I have created a strange new universe. — Jánus Bolyai to his father
To praise it would amount to praising myself. For the entire content of the work...coincides almost exactly with my own meditations [in the] past thirty or thirty-five years.” — Gauss to Farkas Bolyai
Manifolds in which, as in the plane and in space, the line-element may be reduced to the form,are therefore only a particular case of the manifolds to be here investigated; they require a special name, and therefore these manifolds in which the square of the line-element may be expressed as the sum of the squares of complete differentials I will call flat.” Riemann
Last nail in Euclid's coffin
Gauss's own Ph.D student Bernard Riemann. In his lecture on the hypothesis on which geometry is based, he basically formulated what is nowadays called differential geometry of surfaces.
Euclidian geometry second axiom doesn't hold
He also constructed non-euclidian geometry on the sphere, that is sometimes called the Riemann Geometry in the narrow sense, and in this case the fifth postulate doesn't hold as well as in the construction of Lobachevsky, BUT also the second postulate doesn't hold.
Second postulate: All straight lines can be continued indefenitely. On the sphere all straight lines have finite lengths.
Entire zoo of different geometries emerged
Towards the end of the 1800s these geometries became non-unified fields. Mathematicians were debating which geometry is the right one and WHAT actually defines the geometry.
Given a [homogeneous] manifold and a transformation group acting [transitively] on it, to investigate those properties of figures on that manifold which are invariant under transformations of that group. F. Klein 1872
Klein proposed a very radical approach of treating geometry as the study of invariants and symmetries. These are the properties that remain unchanged under some class of transformation. This approach immediately created clarity by showing that different geometries could be defined by an appropriated choice of symmetry.
Example: Euclidian geometry are rigid motions. These are translations, reflections and rotations that preserves properties such as angles, distances, areas, parallelism of lines and their intersections.
The language of group theory as the language to formalize the notion of symmetry Group theory was also a shiny mathematical field that was born in the 19th century and the term group was first used by E. Galois (1832).
These ideas of symmetry on geometry were very profound and it also spilled into other fields, in particular in physics.
Every [differentiable] symmetry of the action of a physical system [with conservative forces] has a corresponding conservation law. E. Noether 1918
E. Noether, she was Klein's colleagues in Gothengen, she proved that every differentiable symmetry of the action of a physical system has a corresponding conservation law. By all means this was a really stunning result because beforehand you had to do very detailed materials experimental observations to discover fundamental laws such as conservation of energy. So you would measure the energy in many experiments and you will see that up some small errors the energy remains the same. It was an empirical result not coming from anywhere. Noether's theorem established is that the conservation of energy emerges from translational symmetry of time. So it's rather intuitive idea that the results of an experiment would be the same if you did it yesterday, today or if you do it tomorrow.
H. Weyl cited his poetic definition of symmetry. He used these ideas to develop the concept of what he called Gauge Invariance. It was a principle from which electromagnetism could be derived. He also speculated that he tried to unify with gravitation
After several decades non-abelian gauge theory, it was finally possible in a theory that was developed by Yang & Mills in 1954 to provide a unified framework that describes all the fundamental forces of nature, with the exception of gravity. This is what is called the standard model in the particle physics and it unifies the description of electromagnetism, weak interactions and strong interactions. All of this using the language of group theory and gauge invariance.
Unification of electromagnetic and weak forces (modelled with the groups U(1) × SU(2)) and the strong force (based on the group SU(3)) C. N. Yang & R. L. Mills 1954
It is only slightly overstating the case to say that Physics is the study of symmetry. P. Anderson 1972
Interesting resource on the topic of symmetry in physics (Roger Penrose) Fearful Symmetry: The Search for Beauty in Modern Physics If you were to distill these thousand plus pages into one word, it would be symmetry. He devotes a lot of time to group theory to the concepts of symmetry and how they are fundamental in physics.
The current state of affairs in the field of Deep Learning reminds a lot of the situation of geometry in the 19th century. In the past decade Deep Learning has brought a true revolution in the data science world. It made possible many tasks that previously maybe 20 years ago would be considered nearly science fiction. Whether it's computer vision that powers autonomous driving, speech recognition behind every single mobile phone today, natural language translation sometimes trained without supervision, playing intelligent games like go or doing science tasks as solving the protein folding problem.
So, on the other hand we nowadays have this zoo of different neuronal network architectures for different types of data but very few unifying principles. As a consequence it's difficult to understand the relations between different methods and this inevitably leads to the reinvention and rebranding of the same concepts.
20th Century Zoo of Neuronal Network Architectures
The same ideas are presented and published several times in different communities under different names. Sometimes it also brings unpleasant and bitter fights over priority, so this is really an unhealthy situation.
We need some form of unification and we want to do it in the spirit of the Erlangen Programme. This is what we call Geometric Deep Learning and it serves 2 purposes:
- 1.Pedagogical perspective: It provides a common mathematical framework to study the most successful neuronal architectures that are currently used ubiquitously in the field of Deep Learning or Deep representation Learning.
- 2.Constructive procedure to incorporate prior knowledge into neural networks: Build future architectures in a principled way.
Question: What will be after Deep Learning or maybe beyond learning by backpropagation?
Answer: The hope is that these principles will outlast their particular implementations.
Question: When did this theory born?
Where actually symmetry and how exactly it is manifested in Machine Learning and why it is important?
If we consider the simplest setting of machine learning, supervised learning, this is essentially a glorified function estimation problem. Problem description: We're given some unknown function and we observe its output on what is called a training set. What we try to do is to find a function that fits well the training data while keeping generalization properties. This function comes from some class of functions, some hypothesis class, and this way we try to predict outputs on previously unseen inputs, the so called the test set.
The typical example that is given is image classification where you have a set of dog and cat set images and it's a binary classification. This means that given any possible input image, assumed that it contains a cat or a dog, the function would classify them into class of cats or dogs.
Supervised ML = Function Approximation
Question: What happened over the past decade?
2 trends coincided:
- 2.Sufficient computational resources: The continuous improvements and development of GPUs, that were thought and originated as graphics hardware, are especially well suitable for general purpose computations that you encounter in deep learning and AI in general (array programming).
These two trends have led to the design of rich function classes that have the capacity, at least in theory, to interpolate such large datasets.
Neural Networks are a suitable choice to represent functions.
Perceptrons, Rosenblatt 1957
Neural networks of course are not new at all. They are at least 70 years old, so the first works are from the 50s and with a very simple choice of architecture, the so called perceptron. This type of architecture is probably the earliest and simplest neural network.
We can show that if we connect just 2 layers of such networks, perceptron, what is called multi-layer perceptron or a perceptron with one hidden layer, it produces a dense class of functions. In other words, we can approximate any continuous function or even broader class of functions to any desired accuracy. We call this property universal approximation. It is a very general architecture than can represent practically anything.
Cybenko 1989; Hornik 1991; Barron 1993; Leshno et al 1993; Maiorov 1999; Pinkus 1999
and asked whether its solution
, seen as a function of the three parameters a, b and c, can be written as the composition of functions of only two variables.
Hilbert's 13th problem statement: Solve 7th degree equation using algebraic (variant: continuous) functions of two parameters.
Kolmogorov-Arnold representation theorem: The model assumes that an output parameter
depends on vector input parameters
. Kolmogorov-Arnold formal model requires the continuity, but we relax this condition and assume that small increments in the inputs are causing small differences in the output. (_Ref )
The model itself is a set of unspecified functions.
be continuous. There exists univariate continuous functions
, such that
Nowadays there are better representations than the one exposed here. There are 3 main reasons why there are no explicit formulas for
- 1.The proofs of these theorems are non-constructive.
- 2.The outer function,highly depends onso we can't just choose one.
- 3.The inner functionis continuous.
NOTE: THE THEOREM ONLY STATES EXISTENCE OF THESE FUNCTIONS.
This form has similarities with a 2-layered neural network but this is highly debated.
e.g: The function,
can be decomposed as in the theorem since it's a summation of continuous univariate functions. Let
and for all the other functions to be set with
This is a very well studied problem in approximation theory and in low dimensions we have a lot of results that tell us exactly how the error will behave. An example of these behaviours could be how the error would occur if we sample our data in a certain way. It has been studied very extensively over the past century or even more but the situation appears to be absolutely different in high dimensions. Even if we pick a very nice class of functions of the so called Lipschitz continuous functions.
Lipschitz functions: superposition of gaussian blobs put in quadrants of a unit cube
What we will find out very quickly that as the dimensionality of this space, the unit cube, grows then the number of samples grow exponentially. This phenomenon is colloquially known as the curse of dimensionality. Modern machine learning methods need to operate with data not in two or three dimensions but in thousands or even millions dimensions. Images can serve as an illustrative example of very high dimensional input space.
Exponential growth explosion
The curse of dimensionality is simply an inevitable God's curse accompanying every machine learning problem. Therefore the previous exposed naive approach to learning is completely impossible to achieve.
This is probably best seen in computer vision applications like image classification. Since images are well known as high dimensional input space, even for example tiny images from MNIST data set, they are almost thousand dimension. But if we look deeper at this problem, intuitively we see that there is a lot of structure.
MNIST data set example: Number three
If we treat the input image as a d-dimensional vector, to for example the perceptron, we would destroy the input structure. In this case we are referring to the local spatial connectivity between pixels in the image. As a result now if we take the image and shift it by just one pixel, this vectored input will be very different and the neuronal network will need to be shown a lot of examples (data augmentation) in order to learn that shifted inputs must be classified in the same way as if they hadn't been applied any spatial transformation. Conclusion: MUST LEARN SHIFT INVARIANCE FROM DATA
MNIST data set example: Number three shifted by one single pixel
Question: How do we learn spatial transformations applied to the inputs? Answer: Using a standard practice called data augmentation. If we don't know how to model specific environments or certain way of telling that the shifted versions of these digits must be the same, we just add such examples to the training set in order to increase the diversity of known input space. This can be enormously wasteful.
Question: If there exists structure and shared information in the input space, how can we exploit that to achieve better learning and understanding?
Hubel, Wiesel 1962. LeCun et al. 1989
Answer: In the field of computer vision, the solution came from a classic work in neuroscience. In particular after the winning work by Hubel and Wiesel who studied the organization of the visual cortex. They showed that brain neurons are organised into what they called local receptive field. This was enough inspiration to influence a new type of architectures with local shared weights. Remarkable citation to the neocognitron of Fukushima (1980). In his paper from 1980 were already contained many elements of modern deep learning such as for example the activation function that he used was already a ReLU (Rectified Linear Unit). This culminated in Convolutional Neural Networks (CNN) by the pioneering work of Yann LeCun (1989). The concept of weight sharing across the image effectively partially solved the cursed of dimensionality.
Graph representation of a caffeine molecule
Chemical compounds like this caffeine molecule can be represented using a graph, where the nodes act as the atoms and edges as chemical compounds. If we were to apply neural network to this input, for example to predict some chemical properties like its binding energy or some other property, we can again parse it into a vector and give it to the neural network as its input. But now we can see that we can rearrange the nodes in any way basically. We can effectively permute the nodes of our graph because in graphs and like images we don't have a particular preferential or default order scheme for the nodes.
Molecular graphs appear to be just one example of data with this irregular non-Euclidean structure on which we would like to apply deep learning. Another prominent examples are social networks. They are tremendously gigantic with hundreds of millions of nodes and billions of edges among their nodes. We can also find different kinds of networks in biology such as interaction networks, manifolds in computer graphics and many other examples of data which could benefit from a more structured and principled mechanism of learning.
Fortunately we do have additional structure that comes from the geometry of the input signal. We call this structure symmetry prior. It's a general and powerful principle that gives us hope dealing with the curse of dimensionality. In the our previous example of the image classification, the input image is not just a d-dimensional vector, it's a signal defined on some geometric domain, denote here by
. In this example the input space is defined by a two-dimensional grid and we denote the space of signals by
The structure of the domain is captured by a symmetry group that is denoted by
, in this case it's the group of two-dimensional translations that acts on the points of
. For now, and without any further detailed explanation about groups, we need to understand that we have the domain and we have a group that acts on it, so it describes its structure (symmetry). The action of the group on points on the domain is manifested on signals defined on this domain to what is called the group representation, denoted by
. In the case of this example, pursuing an effort of further clarification, the representation is a matrix that acts on the dimensional vectors. It's a
matrix that in this example is simply a translation or shift operator.
The geometric structure of the domain that underlies the input signal affects the functions that we define on the signals and try to learn. Let the function here be denoted by
. We can then have functions that are unaffected by the action of the group, what we call invariant functions.
For example, in the image classification problem no matter where for example the cat may be located in the image, we still want to say that it's a cat what the image contains. This is an example of shift invariance.
Invariant functions: Shift invariance
We can also have cases where the function has the same input and output structure, as it occurs in the image segmentation problems, the output will be a pixel wise label mask. In these cases we want the output to be transformed in exactly the same way as the input. We name this class of functions as equivariant functions. The following example shows an example of shift equivariance in the scenario of an image segmentation problem.
Equivariant functions: Shift equivariance
In some cases we can construct a multi-scale hierarchy of domains. For example, continuing with the image classification example of the number three, coarsing or stretching the grid we now have another version of the grid denoted by
. This stretching assimilates nearby points on the domain and produces also a hierarchy of signals spaces that are related by what is called the coarse graining operator, denoted by
. On this coarse grid we can define a new coarse scale function, denoted by
, and we can say that our function is locally stable if it can be approximated as the composition of this corresponding operation and the coarse scale function.
Scale Separation Prior
While the original function might depend on long range interactions on the domain, these spatial locally stable functions can possible separate the interaction across scales. In our discussed example on the image classification, this corresponds to applying a classifier at a lower resolution image.
These exposed principles can probably recognize the majority of popular neural architectures. We can apply a sequence of equivariant layers that preserves the structure of the domain, possible followed by an invariant global polling layer that aggregates everything into a single output. In some cases we might also need to have a hierarchy of domains by some coarse procedure that takes the form of local polling.
Geometric Deep Learning general framework
As this framework acts broadly across the structure of the input domain, it can be applied to any kind of geometric data.
The “5G” of Geometric Deep Learning
The implementation of these principles, in the form of inductive biases, leads to some of the most popular architectures that exist today in deep representation learning (CNN, Graph Neural Networks, Transformers,Long Short Term Memory).
Conclusion: Every popular neural network architecture can be derived from fundamental principles of symmetry, same way as all physics can be derived from respective symmetries. We may refer to this as the deep learning-physics symmetry correspondence.