# 1. Introduction

**Gist:**Fundamental Principles underlying Deep Representation Learning architectures. Single word to convey gist of the course:

**Symmetry**

Symmetry, as wide or as narrow as you may define its meaning, isone ideaby which man through the ages has tried to comprehend and create order, beauty and perfection.H. Weyl 1952

The term

**symmetry**has**Greek**origin (*συμμετρία*). Symmetry literally translates as**same measure**and ancient greeks used this term to somehow**vaguely**convey the the**beauty**of proportion in arts and**harmony**.**Plato**considered the

**five regular polyhedra**(~370 BC) what we now call the

**platonic solids**. Originally thought as fundamental must be

**building blocks**that shape the physical world. This idea was not very far from the truth.

**Kepler**many centuries after attempted a rigorous analysis of the symmetry. He was in particular concerned with the symmetric shape of

**water crystals**and he wrote a book titled

**“On the six-cornered snowflake”**(1611)

Nowadays known as hexagonal packing of particles, it was an idea that clearly preceded the understanding of

**how matter is formed**and the concept of atoms, molecules and crystals. It holds today as the basis of modern**crystallography**.**Modern geometry**is also traced back to ancient Greece and the seminal work of

**Euclide's elements**.

In a plane, given a line and a point not on it, at most one line parallel to the given line can be drawn through the point.Euclide (~300 BC)

This type of geometry was the only one known for

**over 2000 years**. At the**core**of Euclidean geometry was a set of**5 basic assumptions**, or how Euclid called them,**postulates**that he used to derive certain**properties**and**prove**results. For hundred of years the**fifth postulate**of Euclidian geometry is stating that:It's possible to pass onlyone line parallelto a given line through a point that lies outside of it.

This fifth postulate subsequently defied any attempt to try to derive it from the other postulates of geometry.

In the

**19th**the Euclidian monopoly came to an end. The 19th century was really a remarkable burst of creativity that made**geometry**into probably one of the most exciting fields of mathematics.**First**it was the development of what is called**projective geometry**(J. V. Poncelet**1822**).In projective geometry,

**points**and**lines**are**interchangeable**and there is**no**such thing as**parallelism**. Any two lines**intersect**at one exactly**point**. Nowadays it's very popular in**computer graphics**.**Non considered non-euclidian geometry strictly**. But it was probably the first one to undermine this euclidian concept of**parallelism**.**First published construction of a non-euclidian geometry**. Is credited to

**N. Lobachevsky**(1826), a Russian mathematician that considered the fifth axiom of euclidian geometry a completely

**arbitrary limitation**.

In geometry I find certain imperfections which I hold to be the reason why this science [...] can as yet make no advance from that state in which it came to us from Euclid. As belonging to these imperfections, I consider [...] the momentous gap in the theory of parallels, to fill which all efforts of mathematicians have so far been in vain.N. Lobachevsky 1826

**Alternative proposed postulate**:

**More**than

**one**line can pass through a

**point**that is

**parallel**to a given one. Such construction required a

**space**with

**negative curvature**. This type of space is now called a

**hyperbolic space**.

**Academia rejection of the proposal**

This idea was so

**unconventional**and**theoretical**at the time of the publication that he was openly derided by colleagues at his university for writing and publishing such nonsense. Some mathematicians of his time said that this was at the level of some**school teacher**not a university professor.**Different persons came to the same idea from different backgrounds**

**J. Bolyai**(1832), a Hungarian mathematician apparently came to the same ideas together with Lobachevsky.

I have discovered such wonderful things that I was amazed...out of nothing I have created a strange new universe. —Jánus Bolyai to his father

To praise it would amount to praising myself. For the entire content of the work...coincides almost exactly with my own meditations [in the] past thirty or thirty-five years.” —Gauss to Farkas Bolyai

Manifoldsin which, as in the plane and in space, the line-element may be reduced to the form,$\sqrt{\sum{dx^2}}$are therefore only a particular case of the manifolds to be here investigated; they require a special name, and therefore these manifolds in which the square of the line-element may be expressed as the sum of the squares of complete differentials I will callflat.”Riemann

**Last nail in Euclid's coffin**

**Gauss's**own Ph.D student

**Bernard Riemann**. In his lecture on the hypothesis on which geometry is based, he basically formulated what is nowadays called

**differential geometry of surfaces**.

**Euclidian geometry second axiom doesn't hold**

He also constructed

**non-euclidian geometry**on the**sphere**, that is sometimes called the**Riemann Geometry**in the narrow sense, and in this case the**fifth postulate**doesn't hold as well as in the construction of**Lobachevsky**,**BUT**also the**second**postulate doesn't hold.**Second postulate**: All straight lines can be

**continued indefenitely**. On the

**sphere**all straight lines have

**finite**lengths.

**Entire zoo of different geometries emerged**

Towards the end of the

**1800s**these geometries became**non-unified**fields. Mathematicians were**debating**which geometry is the**right**one and**WHAT**actually defines the geometry.Given a [homogeneous] manifold and a transformation group acting [transitively] on it, to investigate those properties of figures on that manifold which are invariant under transformations of that group.F. Klein 1872

**Erlangen Programme**

Klein proposed a very

**radical**approach of**treating**geometry as the**study of invariants**and**symmetries**. These are the**properties**that remain**unchanged**under some class of**transformation**. This approach immediately created clarity by showing that**different**geometries could be defined by an appropriated choice of**symmetry**.Erlangen Programme

**Example**: Euclidian geometry are

**rigid motions**. These are

**translations, reflections and rotations**that

**preserves**properties such as

**angles, distances, areas, parallelism of lines and their intersections**.

**The language of group theory as the language to formalize the notion of symmetry**Group theory was also a shiny mathematical field that was born in the

**19th**century and the term group was first used by

**E. Galois**(1832).

These ideas of symmetry on geometry were very profound and it also spilled into other fields, in particular in

**physics**.Every [differentiable] symmetry of the action of a physical system [with conservative forces] has a corresponding conservation law.E. Noether 1918

E. Noether, she was Klein's colleagues in Gothengen, she

**proved**that every differentiable symmetry of the action of a physical system**has a corresponding conservation law**. By all means this was a really**stunning result**because beforehand you had to do**very detailed**materials**experimental**observations to discover fundamental laws such as**conservation of energy**. So you would measure the energy in many experiments and you will see that up some small**errors**the energy remains the same. It was an**empirical result**not coming from anywhere.**Noether's theorem**established is that the conservation of energy emerges from translational symmetry of time. So it's rather**intuitive**idea that the results of an experiment would be the same if you did it yesterday, today or if you do it tomorrow.**H. Weyl**cited his poetic definition of symmetry. He used these ideas to develop the concept of what he called

**Gauge Invariance**. It was a principle from which

**electromagnetism**could be

**derived**. He also speculated that he tried to unify with gravitation

After several decades

**non-abelian gauge theory**, it was finally possible in a theory that was developed by**Yang**&**Mills**in**1954**to provide a**unified framework**that describes all the fundamental forces of nature, with the exception of gravity. This is what is called the standard model in the particle physics and it unifies the description of electromagnetism, weak interactions and strong interactions. All of this using the**language**of**group theory**and**gauge invariance**.Unification of electromagnetic and weak forces (modelled with the groups U(1) × SU(2)) and the strong force (based on the group SU(3))C. N. Yang&R. L. Mills 1954

It is only slightly overstating the case to say that Physics is the study ofsymmetry.P. Anderson 1972

**Interesting resource on the topic of symmetry in physics**(Roger Penrose) Fearful Symmetry: The Search for Beauty in Modern Physics If you were to distill these thousand plus pages into one word, it would be

**symmetry**. He devotes a lot of time to

**group theory**to the concepts of

**symmetry**and how they are fundamental in physics.

The current state of affairs in the field of Deep Learning reminds a lot of the situation of

**geometry**in the**19th**century. In the past decade Deep Learning has brought a true revolution in the data science world. It made**possible**many tasks that previously maybe**20 years ago**would be considered nearly**science fiction**. Whether it's**computer vision**that powers**autonomous driving**,**speech recognition**behind every single mobile phone today,**natural language translation**sometimes**trained without supervision**,**playing**intelligent**games**like**go**or doing**science**tasks as solving the**protein folding**problem.So, on the other hand we

**nowadays**have this**zoo**of different**neuronal network architectures**for different types of**data**but very**few unifying principles**. As a consequence it's**difficult**to**understand**the**relations**between different methods and this**inevitably leads**to the**reinvention**and**rebranding**of the same concepts.20th Century Zoo of Neuronal Network Architectures

The

**same ideas**are presented and published several times in**different communities**under**different names**. Sometimes it also brings unpleasant and bitter fights over**priority**, so this is really an**unhealthy**situation.We need some form of

**unification**and we want to do it in the spirit of the**Erlangen Programme**. This is what we call**Geometric Deep Learning**and it serves**2 purposes**:- 1.
**Pedagogical perspective**: It provides a**common**mathematical**framework**to study the most successful neuronal architectures that are currently used ubiquitously in the field of Deep Learning or Deep**representation**Learning. - 2.
**Constructive procedure to incorporate prior knowledge into neural networks**: Build**future**architectures in a**principled**way.

Question:

**What will be after Deep Learning**or maybe beyond learning by**backpropagation**?Answer: The hope is that these principles will

**outlast**their particular implementations.Question:

**When did this theory born?**First appearance

Answer: The term Geometric Deep Learning was popularised in a

**2017**paper in the**IEEE***Single Processing*magazine. Co-Authorised by Michael M. Bronstein and Joan Bruna.Most recently the theory has been extended, to the length of approximately a

**book**, and refined with the help of new contributors such as Taco Cohen and Petar Veličković. The new paper is available**online**.**Where**actually

**symmetry**and

**how**exactly it is

**manifested**in Machine Learning and

**why**it is important?

If we consider the

**simplest**setting of machine learning,**supervised learning**, this is essentially a**glorified function estimation**problem.**Problem description**: We're given some**unknown**function and we**observe**its**output**on what is called a**training set**. What we try to**do**is to**find**a**function**that**fits**well the**training data**while keeping**generalization**properties. This function**comes**from some**class of functions**, some**hypothesis class**, and this way we try to**predict**outputs on previously**unseen inputs**, the so called the**test set**.The typical example that is given is

**image classification**where you have a**set of dog**and**cat set**images and it's a**binary classification**. This means that given any possible input image, assumed that it contains a cat or a dog, the function would classify them into class of cats or dogs.Supervised ML = Function Approximation

Question:

**What happened over the past decade?****2 trends**coincided:

- 1.
**Availability of large and high quality data sets**: For example data sets such as Imagenet that has millions of**labeled**images. - 2.
**Sufficient computational resources**: The continuous improvements and development of**GPUs**, that were thought and originated as**g**raphics hardware, are especially well suitable for general purpose computations that you encounter in deep learning and AI in general (**array programming**).

These two trends have led to the design of

**rich function classes**that have the capacity, at least in**theory**, to**interpolate**such large datasets.**Neural Networks are a suitable choice to represent functions**.

Perceptrons, Rosenblatt 1957

Neural networks of course are

**not new at all**. They are at least**70 years**old, so the first works are from the**50s**and with a very**simple**choice of**architecture**, the so called**perceptron**. This type of architecture is probably the earliest and**simplest**neural network.We can show that if we

**connect**just**2 layers**of such networks,**perceptron**, what is called**multi-layer perceptron**or a perceptron with one**hidden layer**, it produces a**dense**class of functions. In other words, we can**approximate any continuous function**or even broader class of functions to**any**desired**accuracy**. We call this**property**universal approximation. It is a very**general**architecture than can**represent**practically**anything**.Cybenko 1989; Hornik 1991; Barron 1993; Leshno et al 1993; Maiorov 1999; Pinkus 1999

At the second International Congress of Mathematicians in Paris 1900, Hilbert presented ten of his 23 problems, including the 13th problem about equations of degree seven. He considered the following equation,

$x^7+ax^3+bx^2+cx+1=0,$

and asked whether its solution

$x(a,b,c)$

, seen as a function of the three parameters a, b and c, can be written as the **composition**of functions of only two variables.**Hilbert's 13th problem statement**:

*Solve 7th degree equation using algebraic (variant: continuous) functions of two parameters*.

**Kolmogorov-Arnold representation theorem**: The model assumes that an output parameter

$z$

depends on vector input parameters $x$

. Kolmogorov-Arnold formal model requires the continuity, but we relax this condition and assume that small increments in the inputs are causing small differences in the output. (_Ref [4])$z = F(x_1,x_2,...,x_n) \\ z \: + \Delta z = F(x_1 \: + \Delta x_1,x_2 \: + \Delta x_2,...,x_n \: + \Delta x_n \\ z = \sum_{k=1}^{2n+1}\Phi_{k}\Bigg[ \sum_{j=1}^{n} f_{k,j}(x_{j})\Bigg]$

The model itself is a set of unspecified functions.

Let

$f \: [0,1]^{d} \mapsto \mathbb{R}$

be continuous. There exists **univariate**continuous functions$g_{q}, \Psi_{p,q}$

, such that $f(x_{1},...,x_{d}) = \sum_{q=0}^{2d}g_{q} \Big ( \Psi_{p,q}(x_{p})\Big)$

Nowadays there are better representations than the one exposed here. There are **3**main reasons why there are no explicit formulas for$g_{q}, \Psi_{p,q}$

.- 1.The proofs of these theorems are
**non-constructive**. - 2.The
**outer function**,$g_{q}$highly depends on$f$so we can't just choose one. - 3.The
**inner function**$\Psi_{p,q}$is continuous.

**: THE THEOREM ONLY STATES EXISTENCE OF THESE FUNCTIONS.**

*NOTE*This form has

**similarities**with a**2-layered**neural network but this is highly debated.**e.g:**The function,

$f_{1}(x_1,x_2) = x_{1}^{2} \: + x_{2}$

can be decomposed as in the theorem since it's a summation of continuous univariate functions. Let

$g_{1}(x) = x \:, \Psi_{1,1}(x_1) = x_{1}^2 \:, \Psi_{2,1}(x_2) = x_2$

and for all the other functions to be set with

$g_q = \Psi_{p,q} =0$

.This is a very well studied problem in

**approximation theory**and in**low dimensions**we have a lot of results that tell us exactly how the**error**will behave. An example of these behaviours could be**how**the error would occur if we**sample**our data in a certain way. It has been studied very extensively over the**past century**or even more but the situation appears to be**absolutely**different in**high dimensions**. Even if we pick a very nice class of functions of the so called**Lipschitz continuous functions**.Lipschitz functions: superposition of gaussian blobs put in quadrants of a unit cube

What we will find out very quickly that as the

**dimensionality**of this**space**, the unit cube, grows then the**number of samples**grow**exponentially**. This phenomenon is colloquially known as the curse of dimensionality. Modern machine learning methods need to operate with data not in two or three dimensions but in**thousands**or even**millions**dimensions.**Images**can serve as an illustrative example of very high dimensional input space.Exponential growth explosion

The curse of dimensionality is simply an inevitable God's curse accompanying every machine learning problem. Therefore the previous exposed naive approach to learning is completely impossible to achieve.

This is probably best seen in

**computer vision**applications like**image classification**. Since images are well known as high dimensional input space, even for example tiny images from MNIST data set, they are almost thousand dimension. But if we look deeper at this problem, intuitively we see that there is a lot of**structure**.MNIST data set example: Number three

If we treat the input image as a

**d-dimensional vector**, to for example the perceptron, we would**destroy**the**input structure**. In this case we are referring to the**local spatial connectivity**between pixels in the image. As a result now if we take the image and**shift**it by just**one pixel**, this vectored input will be very**different**and the neuronal network will need to be shown**a lot of examples**(*data augmentation*) in order to**learn**that**shifted**inputs must be classified in the same way as if they hadn't been applied any**spatial transformation**.**Conclusion**: MUST LEARN SHIFT INVARIANCE FROM DATAMNIST data set example: Number three shifted by one single pixel

Question:

**How do we learn spatial transformations applied to the inputs?**Answer: Using a standard practice called**data augmentation**. If we don't know how to model specific environments or certain way of telling that the shifted versions of these digits**must**be the**same**, we just**add**such examples to the**training set**in order to**increase**the diversity of known input space. This can be enormously**wasteful**.Question:

**If there exists structure and shared information in the input space, how can we exploit that to achieve better learning and understanding?**Hubel, Wiesel 1962. LeCun et al. 1989

Answer: In the field of

**computer vision**, the solution came from a classic work in neuroscience. In particular after the winning work by**Hubel**and**Wiesel**who studied the**organization of the visual cortex**. They showed that brain**neurons**are organised into what they called**local receptive field**. This was enough inspiration to influence a new type of**architectures**with**local shared weights**. Remarkable citation to the**neocognitron**of Fukushima (1980). In his paper from 1980 were already contained many elements of modern deep learning such as for example the**activation function**that he used was already a**ReLU**(**Re**ctified**L**inear**U**nit). This culminated in**C**onvolutional**N**eural**N**etworks (**CNN**) by the pioneering work of**Yann LeCun**(1989). The concept of**weight sharing**across the image effectively partially solved the cursed of dimensionality.Graph representation of a caffeine molecule

Chemical compounds like this caffeine

**molecule**can be represented using a**graph**, where the**nodes**act as the atoms and**edges**as chemical compounds. If we were to apply neural network to this input, for example to**predict**some chemical properties like its binding energy or some other property, we can again**parse**it into a**vector**and give it to the neural network as its input. But now we can see that we can**rearrange**the**nodes**in any way basically. We can effectively**permute**the nodes of our graph because in graphs and like images we don't have a particular preferential or default**order**scheme for the nodes.Molecular graphs appear to be just one example of data with this

**irregular non-Euclidean structure**on which we would like to apply deep learning. Another prominent examples are**social networks**. They are tremendously gigantic with hundreds of millions of nodes and billions of edges among their nodes. We can also find different kinds of networks in**biology**such as**interaction networks**,**manifolds**in computer graphics and many other examples of data which could benefit from a more structured and principled mechanism of learning.Fortunately we do have

**additional structure**that comes from the**geometry**of the input**signal**. We call this structure**symmetry prior**. It's a general and powerful principle that gives us hope dealing with the curse of dimensionality. In the our previous example of the image classification, the input image is**not**just a**d-dimensional**vector, it's a**signal**defined on some**geometric domain**, denote here by$\Omega$

. In this example the input space is defined by a **two-dimensional grid**and we denote the**space of signals**by$\mathcal{X}(\Omega)$

.The structure of the domain is captured by a

**symmetry group**that is denoted by$\boldsymbol{\mathfrak{G}}$

, in this case it's the group of **two-dimensional translations**that acts on the points of$\Omega$

. For now, and without any further detailed explanation about groups, we need to understand that we have the **domain**and we have a**group**that**acts**on it, so it describes its**structure (symmetry)**. The action of the group on points on the domain is manifested on signals defined on this domain to what is called the**group representation**, denoted by$ho$

. In the case of this example, pursuing an effort of further clarification, the representation is a **matrix**that acts on the dimensional vectors. It's a$d \times d$

matrix that in this example is simply a **translation**or**shift operator**.The geometric structure of the domain that underlies the input signal

**affects**the**functions**that we define on the**signals**and try to**learn**. Let the function here be denoted by$f$

. We can then have functions that are **unaffected**by the**action**of the group, what we call**invariant functions**.For example, in the image classification problem

**no**matter where for example the**cat**may be**located**in the image, we still want to say that it's a cat what the image contains. This is an example of**shift invariance**.Invariant functions: Shift invariance

We can also have cases where the function has the

**same**input and output**structure**, as it occurs in the**image segmentation**problems, the output will be a pixel wise label mask. In these cases we want the**output**to be**transformed**in exactly the same way as the**input**. We name this class of functions as**equivariant functions**. The following example shows an example of shift equivariance in the scenario of an image segmentation problem.Equivariant functions: Shift equivariance

In some cases we can construct a

**multi-scale hierarchy**of domains. For example, continuing with the image classification example of the number three,**coarsing**or stretching the**grid**we now have another version of the grid denoted by$\Omega'$

. This stretching assimilates **nearby**points on the domain and produces also a hierarchy of**signals spaces**that are related by what is called the**coarse graining operator**, denoted by$P$

. On this coarse grid we can define a **new coarse scale function**, denoted by$f'$

, and we can say that our function is **locally stable**if it can be**approximated**as the**composition**of this corresponding operation and the coarse scale function.Scale Separation Prior

While the original function might

**depend**on**long range interactions**on the domain, these**spatial locally stable functions**can possible separate the interaction across**scales**. In our discussed example on the image classification, this corresponds to**applying**a classifier at a**lower resolution**image.These exposed principles can probably

**recognize**the majority of popular neural**architectures**. We can apply a**sequence of equivariant**layers that**preserves**the**structure**of the domain, possible followed by an**invariant global polling**layer that**aggregates**everything into a**single**output. In some cases we might also need to have a**hierarchy of domains**by some**coarse procedure**that takes the form of**local polling**.Geometric Deep Learning general framework

As this framework acts broadly across the structure of the input domain, it can be applied to any kind of geometric data.

The “5G” of Geometric Deep Learning

The implementation of these principles, in the form of inductive biases, leads to some of the most

**popular**architectures that exist today in deep representation learning (**CNN**,**G**raph**N**eural**N**etworks,**Transformers**,**L**ong**S**hort**T**erm**M**emory).**Conclusion**: Every popular neural network architecture can be

**derived**from fundamental principles of

**symmetry**, same way as all physics can be derived from respective symmetries. We may refer to this as the

**deep learning-physics symmetry correspondence**.

Last modified 6mo ago