G

G

Geometric Deep Learning

Search…

1. Introduction

What Geometric Deep Learning is all about?

Symmetry, as wide or as narrow as you may define its meaning, isone ideaby which man through the ages has tried to comprehend and create order, beauty and perfection.H. Weyl 1952

Historical background

The term **symmetry** has **Greek** origin (*συμμετρία*). Symmetry literally translates as **same measure** and ancient greeks used this term to somehow **vaguely** convey the the **beauty** of proportion in arts and **harmony**.

Nowadays known as hexagonal packing of particles, it was an idea that clearly preceded the understanding of **how matter is formed** and the concept of atoms, molecules and crystals. It holds today as the basis of modern **crystallography**.

Euclidean geometry

In a plane, given a line and a point not on it, at most one line parallel to the given line can be drawn through the point.Euclide (~300 BC)

This type of geometry was the only one known for **over 2000 years**. At the **core** of Euclidean geometry was a set of **5 basic assumptions**, or how Euclid called them, **postulates** that he used to derive certain **properties** and **prove** results. For hundred of years the **fifth postulate** of Euclidian geometry is stating that:

It's possible to pass onlyone line parallelto a given line through a point that lies outside of it.

This fifth postulate subsequently defied any attempt to try to derive it from the other postulates of geometry.

End of Euclid’s Monopoly

In the **19th** the Euclidian monopoly came to an end. The 19th century was really a remarkable burst of creativity that made **geometry** into probably one of the most exciting fields of mathematics. **First** it was the development of what is called **projective geometry** (J. V. Poncelet **1822**).

No exists notion of parallelism

In projective geometry, **points** and **lines** are **interchangeable** and there is **no** such thing as **parallelism**. Any two lines **intersect** at one exactly **point**. Nowadays it's very popular in **computer graphics**. **Non considered non-euclidian geometry strictly**. But it was probably the first one to undermine this euclidian concept of **parallelism**.

Hyperbolic Geometry

In geometry I find certain imperfections which I hold to be the reason why this science [...] can as yet make no advance from that state in which it came to us from Euclid. As belonging to these imperfections, I consider [...] the momentous gap in the theory of parallels, to fill which all efforts of mathematicians have so far been in vain.N. Lobachevsky 1826

This idea was so **unconventional** and **theoretical** at the time of the publication that he was openly derided by colleagues at his university for writing and publishing such nonsense. Some mathematicians of his time said that this was at the level of some **school teacher** not a university professor.

I have discovered such wonderful things that I was amazed...out of nothing I have created a strange new universe. —Jánus Bolyai to his father

To praise it would amount to praising myself. For the entire content of the work...coincides almost exactly with my own meditations [in the] past thirty or thirty-five years.” —Gauss to Farkas Bolyai

Riemann (Differential) Geometry

Manifoldsin which, as in the plane and in space, the line-element may be reduced to the form,$\sqrt{\sum{dx^2}}$are therefore only a particular case of the manifolds to be here investigated; they require a special name, and therefore these manifolds in which the square of the line-element may be expressed as the sum of the squares of complete differentials I will callflat.”Riemann

He also constructed **non-euclidian geometry** on the **sphere**, that is sometimes called the **Riemann Geometry** in the narrow sense, and in this case the **fifth postulate** doesn't hold as well as in the construction of **Lobachevsky**, **BUT** also the **second** postulate doesn't hold.

Towards the end of the **1800s** these geometries became **non-unified** fields. Mathematicians were **debating** which geometry is the **right** one and **WHAT** actually defines the geometry.

WHAT actually defines geometry? The Erlangen Programme

Given a [homogeneous] manifold and a transformation group acting [transitively] on it, to investigate those properties of figures on that manifold which are invariant under transformations of that group.F. Klein 1872

Klein proposed a very **radical** approach of **treating** geometry as the **study of invariants** and **symmetries**. These are the **properties** that remain **unchanged** under some class of **transformation**. This approach immediately created clarity by showing that **different** geometries could be defined by an appropriated choice of **symmetry**.

Erlangen Programme

Influence to other fields

These ideas of symmetry on geometry were very profound and it also spilled into other fields, in particular in **physics**.

Every [differentiable] symmetry of the action of a physical system [with conservative forces] has a corresponding conservation law.E. Noether 1918

E. Noether, she was Klein's colleagues in Gothengen, she **proved** that every differentiable symmetry of the action of a physical system **has a corresponding conservation law**. By all means this was a really **stunning result** because beforehand you had to do **very detailed** materials **experimental** observations to discover fundamental laws such as **conservation of energy**. So you would measure the energy in many experiments and you will see that up some small **errors** the energy remains the same. It was an **empirical result** not coming from anywhere. **Noether's theorem** established is that the conservation of energy emerges from translational symmetry of time. So it's rather **intuitive** idea that the results of an experiment would be the same if you did it yesterday, today or if you do it tomorrow.

Gauge Invariance

Standard model in particle physics

After several decades **non-abelian gauge theory**, it was finally possible in a theory that was developed by **Yang** & **Mills** in **1954** to provide a **unified framework** that describes all the fundamental forces of nature, with the exception of gravity. This is what is called the standard model in the particle physics and it unifies the description of electromagnetism, weak interactions and strong interactions. All of this using the **language** of **group theory** and **gauge invariance**.

Unification of electromagnetic and weak forces (modelled with the groups U(1) × SU(2)) and the strong force (based on the group SU(3))C. N. Yang&R. L. Mills 1954

It is only slightly overstating the case to say that Physics is the study ofsymmetry.P. Anderson 1972

What does the historical background all have to do with Deep Learning?

The current state of affairs in the field of Deep Learning reminds a lot of the situation of **geometry** in the **19th** century. In the past decade Deep Learning has brought a true revolution in the data science world. It made **possible** many tasks that previously maybe **20 years ago** would be considered nearly **science fiction**. Whether it's **computer vision** that powers **autonomous driving**, **speech recognition** behind every single mobile phone today, **natural language translation** sometimes **trained without supervision**, **playing** intelligent **games** like **go** or doing **science** tasks as solving the **protein folding** problem.

So, on the other hand we **nowadays** have this **zoo** of different **neuronal network architectures** for different types of **data** but very **few unifying principles**. As a consequence it's **difficult** to **understand** the **relations** between different methods and this **inevitably leads** to the **reinvention** and **rebranding** of the same concepts.

20th Century Zoo of Neuronal Network Architectures

The **same ideas** are presented and published several times in **different communities** under **different names**. Sometimes it also brings unpleasant and bitter fights over **priority**, so this is really an **unhealthy** situation.

The Erlangen Programme of ML: Geometric Deep Learning

We need some form of **unification** and we want to do it in the spirit of the **Erlangen Programme**. This is what we call **Geometric Deep Learning** and it serves **2 purposes**:

- 1.
**Pedagogical perspective**: It provides a**common**mathematical**framework**to study the most successful neuronal architectures that are currently used ubiquitously in the field of Deep Learning or Deep**representation**Learning. - 2.
**Constructive procedure to incorporate prior knowledge into neural networks**: Build**future**architectures in a**principled**way.

Question: **What will be after Deep Learning** or maybe beyond learning by **backpropagation**?

Answer: The hope is that these principles will **outlast** their particular implementations.

Question: **When did this theory born?**

First appearance

Answer: The term Geometric Deep Learning was popularised in a **2017** paper in the **IEEE** *Single Processing* magazine. Co-Authorised by Michael M. Bronstein and Joan Bruna.

Most recently the theory has been extended, to the length of approximately a **book**, and refined with the help of new contributors such as Taco Cohen and Petar Veličković. The new paper is available **online**.

Motivation

If we consider the **simplest** setting of machine learning, **supervised learning**, this is essentially a **glorified function estimation** problem. **Problem description**: We're given some **unknown** function and we **observe** its **output** on what is called a **training set**. What we try to **do** is to **find** a **function** that **fits** well the **training data** while keeping **generalization** properties. This function **comes** from some **class of functions**, some **hypothesis class**, and this way we try to **predict** outputs on previously **unseen inputs**, the so called the **test set**.

Learn to discriminate dogs from cats via Supervised Learning

The typical example that is given is **image classification** where you have a **set of dog** and **cat set** images and it's a **binary classification**. This means that given any possible input image, assumed that it contains a cat or a dog, the function would classify them into class of cats or dogs.

Supervised ML = Function Approximation

Question: **What happened over the past decade?**

- 1.
**Availability of large and high quality data sets**: For example data sets such as Imagenet that has millions of**labeled**images. - 2.
**Sufficient computational resources**: The continuous improvements and development of**GPUs**, that were thought and originated as**g**raphics hardware, are especially well suitable for general purpose computations that you encounter in deep learning and AI in general (**array programming**).

These two trends have led to the design of **rich function classes** that have the capacity, at least in **theory**, to **interpolate** such large datasets.

Perceptron

Perceptrons, Rosenblatt 1957

Neural networks of course are **not new at all**. They are at least **70 years** old, so the first works are from the **50s** and with a very **simple** choice of **architecture**, the so called **perceptron**. This type of architecture is probably the earliest and **simplest** neural network.

Universal Approximation

We can show that if we **connect** just **2 layers** of such networks, **perceptron**, what is called **multi-layer perceptron** or a perceptron with one **hidden layer**, it produces a **dense** class of functions. In other words, we can **approximate any continuous function** or even broader class of functions to **any** desired **accuracy**. We call this **property** universal approximation. It is a very **general** architecture than can **represent** practically **anything**.

Cybenko 1989; Hornik 1991; Barron 1993; Leshno et al 1993; Maiorov 1999; Pinkus 1999

Kolmogorov–Arnold representation theorem: Model as a form of Artificial Intelligence

At the second International Congress of Mathematicians in Paris 1900, Hilbert presented ten of his 23 problems, including the 13th problem about equations of degree seven. He considered the following equation,

$x^7+ax^3+bx^2+cx+1=0,$

and asked whether its solution **composition** of functions of only two variables.

$x(a,b,c)$

, seen as a function of the three parameters a, b and c, can be written as the $z$

depends on vector input parameters $x$

. Kolmogorov-Arnold formal model requires the continuity, but we relax this condition and assume that small increments in the inputs are causing small differences in the output. (_Ref [4])$z = F(x_1,x_2,...,x_n) \\ z \: + \Delta z = F(x_1 \: + \Delta x_1,x_2 \: + \Delta x_2,...,x_n \: + \Delta x_n \\ z = \sum_{k=1}^{2n+1}\Phi_{k}\Bigg[ \sum_{j=1}^{n} f_{k,j}(x_{j})\Bigg]$

The model itself is a set of unspecified functions.

Let **univariate** continuous functions **3** main reasons why there are no explicit formulas for

$f \: [0,1]^{d} \mapsto \mathbb{R}$

be continuous. There exists $g_{q}, \Psi_{p,q}$

, such that $f(x_{1},...,x_{d}) = \sum_{q=0}^{2d}g_{q} \Big ( \Psi_{p,q}(x_{p})\Big)$

Nowadays there are better representations than the one exposed here. There are $g_{q}, \Psi_{p,q}$

.- 1.The proofs of these theorems are
**non-constructive**. - 2.The
**outer function**,$g_{q}$highly depends on$f$so we can't just choose one. - 3.The
**inner function**$\Psi_{p,q}$is continuous.

This form has **similarities** with a **2-layered** neural network but this is highly debated.

$f_{1}(x_1,x_2) = x_{1}^{2} \: + x_{2}$

can be decomposed as in the theorem since it's a summation of continuous univariate functions. Let

$g_{1}(x) = x \:, \Psi_{1,1}(x_1) = x_{1}^2 \:, \Psi_{2,1}(x_2) = x_2$

and for all the other functions to be set with

$g_q = \Psi_{p,q} =0$

.The Curse of Dimensionality

This is a very well studied problem in **approximation theory** and in **low dimensions** we have a lot of results that tell us exactly how the **error** will behave. An example of these behaviours could be **how** the error would occur if we **sample** our data in a certain way. It has been studied very extensively over the **past century** or even more but the situation appears to be **absolutely** different in **high dimensions**. Even if we pick a very nice class of functions of the so called **Lipschitz continuous functions**.

Lipschitz functions: superposition of gaussian blobs put in quadrants of a unit cube

What we will find out very quickly that as the **dimensionality** of this **space**, the unit cube, grows then the **number of samples** grow **exponentially**. This phenomenon is colloquially known as the curse of dimensionality. Modern machine learning methods need to operate with data not in two or three dimensions but in **thousands** or even **millions** dimensions. **Images** can serve as an illustrative example of very high dimensional input space.

Exponential growth explosion

The curse of dimensionality is simply an inevitable God's curse accompanying every machine learning problem. Therefore the previous exposed naive approach to learning is completely impossible to achieve.

This is probably best seen in **computer vision** applications like **image classification**. Since images are well known as high dimensional input space, even for example tiny images from MNIST data set, they are almost thousand dimension. But if we look deeper at this problem, intuitively we see that there is a lot of **structure**.

MNIST data set example: Number three

If we treat the input image as a **d-dimensional vector**, to for example the perceptron, we would **destroy** the **input structure**. In this case we are referring to the **local spatial connectivity** between pixels in the image. As a result now if we take the image and **shift** it by just **one pixel**, this vectored input will be very **different** and the neuronal network will need to be shown **a lot of examples** (*data augmentation*) in order to **learn** that **shifted** inputs must be classified in the same way as if they hadn't been applied any **spatial transformation**. **Conclusion**: MUST LEARN SHIFT INVARIANCE FROM DATA

MNIST data set example: Number three shifted by one single pixel

Question: **How do we learn spatial transformations applied to the inputs?** Answer: Using a standard practice called **data augmentation**. If we don't know how to model specific environments or certain way of telling that the shifted versions of these digits **must** be the **same**, we just **add** such examples to the **training set** in order to **increase** the diversity of known input space. This can be enormously **wasteful**.

Learning Structure

Question: **If there exists structure and shared information in the input space, how can we exploit that to achieve better learning and understanding?**

Hubel, Wiesel 1962. LeCun et al. 1989

Answer: In the field of **computer vision**, the solution came from a classic work in neuroscience. In particular after the winning work by **Hubel** and **Wiesel** who studied the **organization of the visual cortex**. They showed that brain **neurons** are organised into what they called **local receptive field**. This was enough inspiration to influence a new type of **architectures** with **local shared weights**. Remarkable citation to the **neocognitron** of Fukushima (1980). In his paper from 1980 were already contained many elements of modern deep learning such as for example the **activation function** that he used was already a **ReLU** (**Re**ctified **L**inear **U**nit). This culminated in **C**onvolutional **N**eural **N**etworks (**CNN**) by the pioneering work of **Yann LeCun** (1989). The concept of **weight sharing** across the image effectively partially solved the cursed of dimensionality.

Beyond Grids

Graph representation of a caffeine molecule

Chemical compounds like this caffeine **molecule** can be represented using a **graph**, where the **nodes** act as the atoms and **edges** as chemical compounds. If we were to apply neural network to this input, for example to **predict** some chemical properties like its binding energy or some other property, we can again **parse** it into a **vector** and give it to the neural network as its input. But now we can see that we can **rearrange** the **nodes** in any way basically. We can effectively **permute** the nodes of our graph because in graphs and like images we don't have a particular preferential or default **order** scheme for the nodes.

Molecular graphs appear to be just one example of data with this **irregular non-Euclidean structure** on which we would like to apply deep learning. Another prominent examples are **social networks**. They are tremendously gigantic with hundreds of millions of nodes and billions of edges among their nodes. We can also find different kinds of networks in **biology** such as **interaction networks**, **manifolds** in computer graphics and many other examples of data which could benefit from a more structured and principled mechanism of learning.

Symmetry Priors

Fortunately we do have **additional structure** that comes from the **geometry** of the input **signal**. We call this structure **symmetry prior**. It's a general and powerful principle that gives us hope dealing with the curse of dimensionality. In the our previous example of the image classification, the input image is **not** just a **d-dimensional** vector, it's a **signal** defined on some **geometric domain**, denote here by **two-dimensional grid** and we denote the **space of signals** by

$\Omega$

. In this example the input space is defined by a $\mathcal{X}(\Omega)$

.The structure of the domain is captured by a **symmetry group** that is denoted by **two-dimensional translations** that acts on the points of **domain** and we have a **group** that **acts** on it, so it describes its **structure (symmetry)**. The action of the group on points on the domain is manifested on signals defined on this domain to what is called the **group representation**, denoted by **matrix** that acts on the dimensional vectors. It's a **translation** or **shift operator**.

$\boldsymbol{\mathfrak{G}}$

, in this case it's the group of $\Omega$

. For now, and without any further detailed explanation about groups, we need to understand that we have the $ho$

. In the case of this example, pursuing an effort of further clarification, the representation is a $d \times d$

matrix that in this example is simply a Invariant functions: Image Classification

The geometric structure of the domain that underlies the input signal **affects** the **functions** that we define on the **signals** and try to **learn**. Let the function here be denoted by **unaffected** by the **action** of the group, what we call **invariant functions**.

$f$

. We can then have functions that are For example, in the image classification problem **no** matter where for example the **cat** may be **located** in the image, we still want to say that it's a cat what the image contains. This is an example of **shift invariance**.

Invariant functions: Shift invariance

Equivariant functions: Image Segmentation

We can also have cases where the function has the **same** input and output **structure**, as it occurs in the **image segmentation** problems, the output will be a pixel wise label mask. In these cases we want the **output** to be **transformed** in exactly the same way as the **input**. We name this class of functions as **equivariant functions**. The following example shows an example of shift equivariance in the scenario of an image segmentation problem.

Equivariant functions: Shift equivariance

Scale Separation Prior

In some cases we can construct a **multi-scale hierarchy** of domains. For example, continuing with the image classification example of the number three, **coarsing** or stretching the **grid** we now have another version of the grid denoted by **nearby** points on the domain and produces also a hierarchy of **signals spaces** that are related by what is called the **coarse graining operator**, denoted by **new coarse scale function**, denoted by **locally stable** if it can be **approximated** as the **composition** of this corresponding operation and the coarse scale function.

$\Omega'$

. This stretching assimilates $P$

. On this coarse grid we can define a $f'$

, and we can say that our function is Scale Separation Prior

While the original function might **depend** on **long range interactions** on the domain, these **spatial locally stable functions** can possible separate the interaction across **scales**. In our discussed example on the image classification, this corresponds to **applying** a classifier at a **lower resolution** image.

Geometric Deep Learning Blueprint

These exposed principles can probably **recognize** the majority of popular neural **architectures**. We can apply a **sequence of equivariant** layers that **preserves** the **structure** of the domain, possible followed by an **invariant global polling** layer that **aggregates** everything into a **single** output. In some cases we might also need to have a **hierarchy of domains** by some **coarse procedure** that takes the form of **local polling**.

Geometric Deep Learning general framework

As this framework acts broadly across the structure of the input domain, it can be applied to any kind of geometric data.

The “5G” of Geometric Deep Learning

The implementation of these principles, in the form of inductive biases, leads to some of the most **popular** architectures that exist today in deep representation learning (**CNN**, **G**raph **N**eural **N**etworks, **Transformers**,**L**ong **S**hort **T**erm **M**emory).

References

Last modified 28d ago

Copy link

Contents

What Geometric Deep Learning is all about?

Historical background

Euclidean geometry

End of Euclid’s Monopoly

What does the historical background all have to do with Deep Learning?

The Erlangen Programme of ML: Geometric Deep Learning

Motivation

Learn to discriminate dogs from cats via Supervised Learning

Perceptron

Universal Approximation

The Curse of Dimensionality

Learning Structure

Beyond Grids

Symmetry Priors

Geometric Deep Learning Blueprint

References