Graphs II

Purpose

On graphs and sets study:
  • Permuation invariant
  • Permutation equivariant
Guiding principles to synthesise and steer the design of deep learning architectures:
  1. 1.
    Symmetries
  2. 2.
    G\boldsymbol{\mathfrak{G}}
    -Invariance
  3. 3.
    G\boldsymbol{\mathfrak{G}}
    -Equivariance
  4. 4.
    Locality
  5. 5.
    Scale Separation

The Building Blocks of Geometric Deep Learning

The following statements encapsulate and sum up the general geometric deep learning framework.
Let
Ω\Omega
and
Ω\Omega'
be domains,
G\boldsymbol{\mathfrak{G}}
a symmetry group over
Ω\Omega
. Write
ΩΩ\Omega' \subseteq \Omega
can be considered a compact version of
Ω\Omega
We define the following building blocks:
Linear
G\boldsymbol{\mathfrak{G}}
-equivariant layer
B:X(Ω,C)X(Ω,C)B: \mathcal{X}(\Omega,\mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C'})
, satisfying linearity and resistance to group action in an equivariant way. No matter which symmetry group transformation
g\boldsymbol{\mathfrak{g}}
we take and featurization function
xx
.
B(g.x)=g.B(x)gGandxX(Ω,C)B(\boldsymbol{\mathfrak{g}}.x) = \boldsymbol{\mathfrak{g}}.B(x) \forall \boldsymbol{\mathfrak{g}} \in \boldsymbol{\mathfrak{G}} \: and \: x \in \mathcal{X}(\Omega,\mathcal{C})
Nonlinearity
σ:CC\sigma: \mathcal{C} \to \mathcal{C'}
Applied element-wise as
(σ(x))(u)=σ(x(u))\big ( \boldsymbol{\sigma}(x) \big)(u) = \sigma(x(u))
In deep learning we typically inject nonlinearities through usage point-wise nonlinearities such as sigmoid functions, tangent or ReLU.
When we compose linear
g\boldsymbol{\mathfrak{g}}
equivariant layers and nonlinearities, we will end up with universal approximators over the domains.
Local pooling (coarsening)
P:X(Ω,C)X(Ω,C)P: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C})
such that
ΩΩ\Omega' \subseteq \Omega
.
Coarsening: Sometimes it might be useful to coarse the domain. This means going from an initial
Ω\Omega
to an
Ω\Omega'
which is slightly smaller and contains the previous domain within it.
G\boldsymbol{\mathfrak{G}}
-invariant layer (global pooling)
A:X(Ω,C)yA: \mathcal{X}(\Omega,\mathcal{C}) \to \mathcal{y}
, satisfying
A(g.x)=A(x)gGandxX(Ω,C)A(\boldsymbol{\mathfrak{g}}.x) = A(x) \forall \boldsymbol{\mathfrak{g}} \in \boldsymbol{\mathfrak{G}} \: and \: x \in \mathcal{X}(\Omega, \mathcal{C})
If we need answers on the level of the entire domain, we can use a global pooling layer which is invariant to the symmetry group transformation.
Geometric Deep Learning blueprint

GDL Framework in action: Graphs & Sets

Motivation

We are going to start by demonstrating the blueprint on graphs because graphs and sets give us a very nice discrete domain with minimal geometric assumptions easy to analyse.
Also with some assumptions and transformations, any domain can be seen as a graph.
Question: Why we might want to study data that lives on graphs?
Answer: Graphs are common structures around us mostly all the time. They can be thought of as sort of the main modality of data we can encounter from nature.
e.g.:
  • Molecular graphs
  • Transportation networks
  • Social networks
The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.Jay Wright Forrester 1971

Learning On Sets: Graphs Without Connections

Why?

  1. 1.
    Graphs without connections, called unordered sets, have only nodes and no connections between them. This structure construct a simpler domain, so it's going to be much easier to analyse the geometric architectures that arise.
  2. 2.
    A significant part of the conclusions we achieve from sets will naturally carry over to graphs.
  3. 3.
    Even if we restrict ourselves completely to processing data on sets following this kind of structure, it would still a relevant area. e.g: point clouds data
Setup
Assume our graph has no edges
Ω=VE=\textcolor{Red}{\Omega} = \mathcal{V} \\ E = \empty
, the set of nodes.
We typically want to deal with featurizing this domain in some way. Thus we will assume that node
ii
has features
xiRkx_{i} \in \R^{k}
, which is a real value vector with k-dimensional features This corresponds to our feature space
C\mathcal{C}
.
Our feature space then is
C=Rk\textcolor{Red}{\mathcal{C}} = \R^{k}
Once we have these features a typical way in which we can process data like this is by stacking the node features into a node feature matrix
X\boldsymbol{X}
.
X=(x1,...,xn)\boldsymbol{X} = (\boldsymbol{x_{1}},...,\boldsymbol{x_{n}})^{\top}
This feature matrix
X\boldsymbol{X}
of shape
V×k|\mathcal{V}| \times k
, is interpreted such as row
ii
corresponds to the features of the
ithith
node.
NOTE: At the time we are building this feature matrix we have already introduced some notion of order. Thus we break apart our strong and initial assumption, when we assume the sets to be completely unordered. We would like a neural network that operates on
X\boldsymbol{X}
which not depends on the order of the rows of
X\boldsymbol{X}
.
e.g:
Given an unordered set of five nodes and their features being
{x1,x2,x3,x4,x5}\{ x_1, x_2, x_3, x_4, x_5\}
, we are asked to learn some class of functions, neural networks, that will take the features of these individual nodes and produce an output on the level of the entire set. This function
ff
is desired to be resistant to the order in which we fed these features.
f({x1,x2,x3,x4,x5})=y=f({x2,x5,x4,x3,x1})f(\{ x_1, x_2, x_3, x_4, x_5\}) = y = f(\{x_2, x_5, x_4, x_3, x_1\})
Conclusion: Even if we completely perturb the order in which we give the set elements to the function
ff
we expect the output
yy
to be unaffected. In the framework of GDL this corresponds to using the n-element Permutation group
Σn\Sigma_{n}
as our symmetry group
G\boldsymbol{\mathfrak{G}}
. The different group elements within this symmetry group will be the permutations
gG\boldsymbol{\mathfrak{g}} \in \boldsymbol{\mathfrak{G}}
The permutation will change the order in which we see the elements in this set and we want our neural networks to be resistant to this phenomenon.

Permutations and Permutations Matrices

Idea
It's useful to think about operators that change the node order.
e.g:
(1,2,3,4)(2,4,1,3)(1,2,3,4) \rightarrow (2,4,1,3)
This example permutation maps the following actions over the set:
  • y1x2y_1 \leftarrow x_2
  • y2x4y_2 \leftarrow x_4
  • y3x1y_3 \leftarrow x_1
  • y4x3y_4 \leftarrow x_3
Concretised in the form of a unique expression as:
Formally, a permutation on a set
SS
is equivalently
  1. 1.
    a bijection, one-to-one correspondence, from
    SS
    to itself,
  2. 2.
    a pair of linear orderings on
    SS
    ,
A linear order on a set
SS
is a binary relation
<<
with the following properties:
  • irreflexivity:
    xxx \nless x
    ,
  • asymmetry: if
    x<yx < y
    , then
    yxy \nless x
    ,
  • transitivity: if
    x<y<zx < y < z
    , then
    x<zx < z
    ,
  • comparison: if
    x<zx < z
    , then
    x<yx < y
    or
    y<zy < z
    ,
  • connectedness: if
    xyx \nless y
    and
    yxy \nless x
    , then
    x=yx = y
    .
  1. 1.
    an element in the symmetric group of
    G\boldsymbol{\mathfrak{G}}
    .
As automorphisms
σ:SS\sigma: S \to S
in set, the permutations of
XX
naturally form a group under composition, called the symmetric group (or permutation group) on
SS
. This group may be denoted by
SX,ΣX,X!S_{X}, \Sigma_{X}, X!
, but we are going to denote it by
G\boldsymbol{\mathfrak{G}}
. When
SS
is an finite set
(n)={1,...,n}(n) = \{ 1,...,n\}
, then its symmetric group is a finite group of cardinality
n!n!
. This is typically written as
Σn\Sigma_n
.
Within linear algebra, each permutation defines a
V×V|\mathcal{V}| \times |\mathcal{V}|
matrix (
n×nn \times n
).
Such matrices, called permutation matrices (group action
ho(g)ho(\boldsymbol{\mathfrak{g}})
), have zeros almost everywhere and have strictly just a one in every row and column. The positions of these ones correspond to the order of the nodes in the permutation.
e.g: For our initial example
X=[x1x2x3x4]\boldsymbol{X}= \begin{bmatrix} \boldsymbol{x_1} \\ \boldsymbol{x_2} \\ \boldsymbol{x_3} \\ \boldsymbol{x_4} \\ \end{bmatrix}
the permutation matrix looks as follows:
P(2,4,1,3)=[0100000110000010]\boldsymbol{P}_{(2,4,1,3)} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{bmatrix}
Its effect when left-multiplied is to permute rows of
X\boldsymbol{X}
as follows
PX=[x2x4x1x3]\boldsymbol{P}\boldsymbol{X} = \begin{bmatrix} \boldsymbol{x_2} \\ \boldsymbol{x_4} \\ \boldsymbol{x_1} \\ \boldsymbol{x_3} \\ \end{bmatrix}
Conclusion: This examples illustrates that the permutation matrix is exactly the representation of the group action of the of the permutation group. Thus we can use this matrix representation to reason about permutations while being in the universe of linear algebra, which is going to make all of the subsequent analysis a lot easier.

Permutation invariance

We want functions
f(X)f(\boldsymbol{X})
that operate over sets node features, that will NOT depend on the order of the nodes. Equivalently if we apply a permutation matrix to
X\boldsymbol{X}
, it shouldn't change the result. Thus
f(X)f(\boldsymbol{X})
is permutation invariant if and only, for all permutation matrices
P\boldsymbol{P}
:
f(PX)=f(X)f(\boldsymbol{PX}) = f(\boldsymbol{X})
This property is desirable for set neural networks when we assume that our domain is unordered sets. We then say that the set neural network
f(X)f(\boldsymbol{X})
is permutation invariant if and only if independently of the chosen
X\boldsymbol{X}
matrix chosen, the result stills unchanged.
We can compare this with our requirement in the geometric deep learning framework for
G\boldsymbol{\mathfrak{G}}
-invariant layer, and specifically in this case, our
ff
corresponds to the
G\boldsymbol{\mathfrak{G}}
-invariant layer
AA
, and the group action
g\boldsymbol{\mathfrak{g}}
corresponds to applying a permutation matrix. Therefore we have a one-to-one correspondence with the framework.
Deep Sets model
It's general procedure to perform representation learning on sets in a way that's permutation invariant. It applies a point-wise neural network
ψ\psi
on every single node in isolation. Then in order to guarantee permutation invariance, we sum up the features of all the nodes in the set to finally transform that with neural network, denoted by
ϕ\phi
.
f(X)=ϕ(ΣiVψ(xi))f(\boldsymbol{X}) = \textcolor{Orange}{\phi \Bigg (} \Sigma_{i \in \mathcal{V}} \textcolor{Green}{\psi(} x_i \textcolor{Green}{)} \textcolor{Orange}{\Bigg )}
In this expression
ϕ\textcolor{Orange}{\phi}
and
ψ\textcolor{Green}{\psi}
are learnable functions.
The sum aggregation is responsible to achieve the permutation invariance because independently of how we permute the rows of a matrix, its sum will be always be the same. Even though we can use other types of aggregators such as maximization or average, which indeed keep achieving permutation invariance. Thus we can generalize the previous expression to be
f(X)=ϕ(iVψ(xi))f(\boldsymbol{X}) = \textcolor{Orange}{\phi \Bigg (} \bigoplus_{i \in \mathcal{V}} \textcolor{Green}{\psi(} x_i \textcolor{Green}{)} \textcolor{Orange}{\Bigg )}
Since this moment, whenever we see the generic aggregator operator, denoted by
\bigoplus
, we must conceive anything which is permutation invariant over its arguments.

Permutation equivariance

Permutation invariant models are good for set-level outputs. This means that it's a desirable property to have when we want outputs on the level of the entire set.
Question: What if we would like answers to node level?
Answer: In order to preserve and identify node outputs we can't take use of the permutation invariant summation. Here summation would potentially destroy node-level outputs or at least make them really hard to recover.
Now rather than restrict the generated output to be resistant to permutation, we want to let the permutation produce a new and different result but while making this phenomena predictable. Therefore we want a function that doesn't change the node order when applied. This concludes that it doesn't matter when we apply the permutation to the set.
F(PX)=PF(X)\boldsymbol{F}(\boldsymbol{PX}) = \boldsymbol{PF}(\boldsymbol{X})
NOTE: Trying to be consistent with the linear algebra universe and its notation, the function is capitalized and in bold, because it returns a matrix as output.
Compare with GDL framework: Linear
G\boldsymbol{\mathfrak{G}}
-equivariant layer
B:X(Ω,C)X(Ω,C)B: \mathcal{X}(\Omega,\mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C'})
, satisfying linearity and resistance to group action in an equivariant way. No matter which symmetry group transformation
g\boldsymbol{\mathfrak{g}}
we take and featurization function
xx
.

Locality

Strict resistance to symmetries is not the only property we want. It's desirable that our neural networks's predictions not immensely collapse under the effect of non-perfect symmetry transformation.
Data is not usually transformed using a symmetry element but rather noise it's acquired and added in the transformations, generating distortions. Then we would want the signal to be stable under slight deformations of the domain.
If we want to achieve this level of resistance under distortions, it's crucial to compose local operations to model large scales ones, as local operations won't globally propagate errors. This implies that we would want graph neural networks (GNN) to have layers which behave locally with respect to the structure of the domain.
Question: What does it mean for an equivariant layer on sets to be local?
Answer: One way we can enforce locality in equivariant set functions is through a shared function
ψ\psi
applied to every node in isolation:
hi=ψ(xi)\boldsymbol{h}_{i} = \psi(\boldsymbol{x}_{i})
Stacking
hi\boldsymbol{h}_{i}
in the same way we stacked the original
xi\boldsymbol{x}_{i}
, this yields an new matrix, denoted by
H\boldsymbol{H}
, which is considered as the output of our function
F\boldsymbol{F}
.
H=F(X)\boldsymbol{H} = \boldsymbol{F}(\boldsymbol{X})
Compare with GDL framework:
(stacking)
equivariant\textcolor{Cyan}{equivariant}
functions, potentially with an
invariant\textcolor{Orange}{invariant}
tail yields any useful set neural nets.
f(X)=ϕ(iVψ(xi))f(\boldsymbol{X}) = \textcolor{Orange}{\phi \Bigg (} \bigoplus_{i \in \mathcal{V}} \textcolor{Green}{\psi(} x_i \textcolor{Green}{)} \textcolor{Orange}{\Bigg )}
This expression is quite close to what our framework would have construct. There is a point-wise function which corresponds precisely to a local equivariant function, denoted by
ψ\textcolor{Green}{\psi}
, over nodes, and then if needed to generate predictions at set-level, we can add a global pooling layer at the very end. In the expression this is denoted by the generic aggregator operator
ϕ\textcolor{Orange}{\phi}
Conclusion: This is probably the best and most general form of a neural network that operates on sets, without assuming or inferring additional structure or relations (edges).

Learning On Graphs

Now we can generalize the structure of the set of nodes, assuming they have edges between them constructing relations. Thus we consider graphs
δ=(V,E)\delta = (V,E)
where
EV×VE \subseteq V \times V
is a subset of the cartesian product.
Further information about graphs definition and properties can be found at chapter:
Our representation of edges will be consistent with the realms of linear algebra. So we can represent these edges in an adjacency matrix, which is a binary matrix, denoted by
A\boldsymbol{A}
. By using this representation edges have now become part of the domain. Any further additions like edges features or weights to them are completely possible.
Nevertheless, we are going to limit ourselves to work without any additional information attached to edges, although all the derived math related to just edges as a binary matrix, will carry out to more complex structures. So we will just conceive every edge is just a binary indicator of whether or not two nodes are connected.
Our main desire with sets holds for graphs, since we also want our graph neural networks to be resistant to permutations over the nodes and edges which construct the graph.
### Permutation invariance & equivariance
We assume there's some connectivity structure between our nodes. Therefore when we permute the nodes, we expect the edges to accordingly act and be permuted identically. We assume that after applying the permutation we will obtain 2 isomorphic graphs.
When the nodes are permuted, we need to appropriately permute both rows and columns of adjacency matrix
A\boldsymbol{A}
. Thus if we are using a permutation matrix, denoted by
P\boldsymbol{P}
, to permute our nodes, that means we are reordering our adjacency matrix
A\boldsymbol{A}
as
PAP\boldsymbol{PAP^{\top}}
. The first
P\boldsymbol{P}
is reponsible for permuting the rows and the permutation matrix transpose deals a permutation of the columns.
  • Invariance
f(PX,PAP)=f(X)f(\boldsymbol{PX},\boldsymbol{PAP^{\top}}) = f(\boldsymbol{X})
  • Equivariance
F(PX,PAP)=PF(X,A)\boldsymbol{F(\boldsymbol{PX},\boldsymbol{PAP^{\top}})} = P\boldsymbol{F(\boldsymbol{X,A})}
Conclusion: This two properties carry over very naturally from sets.

Locality: Neighbourhoods

Locality doesn't carry over as naturally as invariance and equivariance do from sets. We now assume there might exist relations between nodes, which in fact means we musn't transform every node just in isolation. Thus graphs give us a broader context by the so-called node's neighbourhood.
Node's neighbourhood: For a node
ii
, we can derive its 1-hop neighbourhood, denoted by
Ni\mathcal{N}_i
, as the set of all nodes
jj
which are connected to
ii
exclusively with one edge.
Accordingly, we can extract neighbourhood features,
XNi\boldsymbol{X}_{\mathcal{N}_i}
, like so:
Ni={xj:jNi}\mathcal{N}_i = \{ \boldsymbol{x}_j : j \in \mathcal{N}_i \}
Now we can do something more generic than just having a point-wise local function, we can have a local function, denoted by
ϕ\phi
, operates over the node itself and its neighbourhood.
ϕ(xi,Ni)\phi(\boldsymbol{x}_i,\mathcal{N}_i )
Doing so, we can construct permutation equivariant functions
F(X,A)\boldsymbol{F(X,A)}
, by appropriately applying the local function
ϕ\phi
, over all neighbourhoods.
F(X,A)=[ϕ(x1,N1)....ϕ(xn,Nn)]\boldsymbol{F(X,A)} = \begin{bmatrix} \phi(\boldsymbol{x}_1,\mathcal{N}_1 ) \\ .. \\ ..\\ \phi(\boldsymbol{x}_n,\mathcal{N}_n ) \\ \end{bmatrix}
To ensure equivariance, it's sufficiente if
ϕ\phi
doesn't depend on the order of the nodes
Ni\mathcal{N}_i

Graph Neural Networks

Graph Neural Network example
XNi={{xa,xb,xc,xd,xe}}\boldsymbol{X}_{\mathcal{N}_{i}} = \{ \{ \boldsymbol{x}_a, \boldsymbol{x}_b, \boldsymbol{x}_c, \boldsymbol{x}_d, \boldsymbol{x}_e \} \}
We have built an equivariant function on graphs by using another local permutation invariant function, learnable and denoted by
ϕ\phi
, acting at node level. The graph neural network computes new features representation for a node given its neighbourhood of features, denoted by
XNI\boldsymbol{X}_{\mathcal{N}_{I}}
, which for the sake of extra context can contain the node itself. Apply the local function
ϕ\phi
corresponds to the expression:
hi=ϕ(xi,XNi)\boldsymbol{h_i} = \phi(\boldsymbol{x}_i, \boldsymbol{X}_{\mathcal{N}_{i}})
We then parallelize the application of this local function in isolation to every single node conforming the graph.

General procedure

Given as inputs to a graph neural network, a graph with feature inputs, denoted by
X\boldsymbol{X}
, such that a particular node and its feature representation,
xi\boldsymbol{x}_i
, and some adjacency matrix
A\boldsymbol{A}
, the application of this function will update our features (inputs) to their latents representation.
GNN(X,A)(H,A)\boldsymbol{GNN(X,A)} \to \boldsymbol{(H,A)}
This update of a particular node depends on its neighbourhood features and, if desired, by itself. The adjacency matrix structure is preserved through this update, although there is room for more sophisticated procedures which could explicitly change the relations living in the graph.
From this general procedure emerges several possible tasks:
  • Node\textcolor{Red}{Node}
    classification: We can effectively learn a classifier acting at node level to predict likelihood node features output.
zi=f(hi)\boldsymbol{z}_i = f(\boldsymbol{h}_i)
  • Graph\textcolor{Green}{Graph}
    classification: Classifications at the level of the entire domain forces us to somehow transform our nodes features in a permutation invariant form. Generally this is achieved using a generic aggregator operator,
    \bigoplus
    , which could be the summation or average.
zG=f(iVhi)\boldsymbol{z}_G = f(\bigoplus_{i \in \mathcal{V}}\boldsymbol{h}_i)
  • Link\textcolor{Blue}{Link}
    prediction (edge classifier): Because now edges are an important part of our domain, we can also do predictions over edges. For example, edge classification and edge regression are factible. We can also estimate the likelihood of an edge existing between one or several nodes. We would need a function that takes under consideration both neighborhoods, the one comming from the node
    ii
    and the one from
    jj
    . In addition, we can use any possible and existent information present in their edges, denoted by
    eij\boldsymbol{e}_{ij}
    .
zij=f(hi,hj,eij)\boldsymbol{z}_{ij} = f(\boldsymbol{h}_i, \boldsymbol{h}_j, \boldsymbol{e}_{ij})
Tmp image
Local permutation invariant function
The local permutation invariant function, denoted by
ϕ(xi,XNi)\phi(\boldsymbol{x}_i, \boldsymbol{X}_{\mathcal{N}_{i}})
, shared applied to every node, guarantees that we can in fact construct permutation equivariant functions on graphs,
F(X,A)\boldsymbol{F(X,A)}
. This
ϕ\phi
can be called, depending on the literature as:
  1. 1.
    Message passing
  2. 2.
    Diffusion
  3. 3.
    Propagation
Whereas
F\boldsymbol{F}
corresponds to a layer in the GNN architecture. Finding and building good candidates to become this local function
ϕ\phi
,is one of the most active areas of research nowadays in the field of GNNs. Nevertheless, all of these proposed local functions can be derived and reasoned under the geometric deep learning framework using three spatial flavours.
  1. 1.
    Convolutional
  2. 2.
    Attentional
  3. 3.
    Message-passing
As we go from 1, convolutional, to 2, message-passing, the number of possible representable functions grows. We can represent convolutional layers as a special case of attentional layers, and represent attentional layers using message passing layers. The more right we go, the more complex the problem we can fit, but at the expense of poor scalability and increasing the risk of overfitting the data. There is room for trade offs attending to our context.
Convolutional
We have not clearly defined a decision method to correctly anticipate how to aggregate all the neighbors in the neighborhood. The neighborhood size for a given node could hugely vary concerning the average of the other nodes of the graph. We cannot be too pragmatic about size-dependent kernels (filters), those filters whose behavior and configuration strictly depend on their input size.
Question: What do convolutional GNNs accomplish?
Answer: They quantify how meaningful an individual neighbor is relative to a specific node. To express how a particular neighbor influences a node, we are going to use weights, denoted by c_{ij}. Consequently, we are going to do a point-wise transformation to all of our node features, and then every node will compute a weighted combination of all of its neighbors based on the c_{ij} coefficients.
hi=ϕ(xi,jNicijψ(xj))\boldsymbol{h}_i = \phi \Bigg ( \boldsymbol{x}_i, \bigoplus_{j \in \mathcal{N}_i}\textcolor{Orange}{c_{ij}} \psi(\boldsymbol{x_j}) \Bigg )
tmp convolutional GNN
e.g.
We sum all our neighbors, where each one is multiplied by the corresponding weight.
The essential part of this whole procedure is that these weights were already predefined before applying the graph neural network. Thie adjacency matrix dictates these weights accordingly to the graph topology.
Widespread instances of these kinds of layers include but are not exclusive:
  • ChebyNet (Defferrard et al., NeurIPS'16)
  • Graph Convolutional Network (Kipf & Welling, ICLR'17): This is currently the top-cited graph neural network paper.
  • Simplified Graph Convolutional (Wu et al., ICML'19): Has shown how simple these layers can become and can still recover mostly any real-world graph.
Because convolutional GNNs specify the weights before anything and are usually some function of the graph topology, these models are helpful when our graphs are homophilous.
Homophilous graphs: Graphs whose edges encode label similarities. The majority of our neighbors are expected to share the same label with high statistical significance. Thus averaging is a robust regularizer since the feature representation of a particular node will highly rely on the features of all its neighbors.
As the coefficients are predefined, this equation can be efficiently computed using sparse matrix multiplication. This explains why this type of GNNs are the most industrial ones nowadays, as they can handle large graphs with applicational value.
Attentional
Given a particular neighborhood, we do not need to assume that edges necessarily encode similarity as we have done in convolutional GNNs. This precondition of defining weights between nodes in the same neighborhood cannot be generalized. It will not be suitable for learning graphs that count with more complex relations.
e.g. social network interactions
Doing a retweet of somebody else opinion does not strictly means that we might share the same thoughts. We can even wholly disagree.
More suitable graph neural network architectures exist to encapsulate more complex graph relations among nodes. They quantify node relations as a function done in a feature-dependent matter rather than a pure graph topology computation.
hi=ϕ(xi,jNia(xi,xj)ψ(xj))\boldsymbol{h}_i = \phi \Bigg ( \boldsymbol{x}_i, \bigoplus_{j \in \mathcal{N}_i} \textcolor{Green}{a(}\boldsymbol{x}_i, \boldsymbol{x}_j \textcolor{Green}{)} \psi(\boldsymbol{x_j}) \Bigg )
tmp convolutional GNN
One very natural form of achieving this is by using an attention mechanism. So in convolutional graph neural networks, we previously had fixed coefficient interactions, whereas now these coefficients are computed based on the sender and receiver node features.
We have replaced the
cijc_{ij}
constant coefficient with an attention function that takes features of
xix_i
and
xjx_j
and produces a scalar, denoted by
αij\alpha_{ij}
.
αij=a(xi,xj)\alpha_{ij} = \textcolor{Green}{a(}\boldsymbol{x}_i, \boldsymbol{x}_j \textcolor{Green}{)}
Some trendy models tried to conceive this attention procedure idea within the graph domain:
  • MoNet (Monti et al., CVPR’17)
  • Graph ATtention network (Veličković et al., ICLR’18)
  • GATv2 (Brody et al., 2021): It is an improved and enhanced successor of GAT, since it solves the static attention problem in graph attention networks.
We are not obligated to assume that our graph is homophilous. We can now learn a radical set of complex and new interactions between nodes. The computational cost keeps feasible, with just one scalar per edge. Thus we are able to increase our learnability bounds without sacrificing scalability. Once these new scalars have been computed, we are dealing again with the earlier introduced sparse matrix multiplication employed in the convolutional world.
Attentional architectures are the de facto standard today for acquiring and learning complex interactions. This architecture shines in those graphs, which we expect there will be a complex behavior emergence along their edges. More about this field, including the acclaimed transformer architecture, in the upcoming chapter.
Message-passing
They are the most generic form of graph neural network layer we can tangibly achieve. We no longer assume that our edges encode a particular weighted combination that needs to be computed. Edges give us a recipe to pass data, and the message function determines what data actually is moving to which node. We can compute arbitrary vectors, the so-called message vectors, to be sent across edges.
hi=ϕ(xi,jNiψ(xi,xj))\boldsymbol{h}_i = \phi \Bigg ( \boldsymbol{x}_i, \bigoplus_{j \in \mathcal{N}_i} \textcolor{Red}{\psi(} \boldsymbol{x}_i, \boldsymbol{x}_j \textcolor{Red}{)} \Bigg )
tmp convolutional GNN
There is a vast difference between the procedures employed in the convolutional and attentional mechanisms. We had to compute something in our neighbor and then aggregate that in a weighted combination. The neighbor decides the aggregation influence, but the receiver also alters the final aggregation.
The
ψ\psi
function acts now on both the sender and the receiver node. They cooperate to compute a message vector, denoted by
mij=ψ(xi,xj)\boldsymbol{m}_{ij} = \textcolor{Red}{\psi(} \boldsymbol{x}_i, \boldsymbol{x}_j \textcolor{Red}{)}
which then the receiver node aggregates in some permutation invariant form.
There exist plenty of popular layers that suggest message passing computation:
  • Interaction Nets (Battaglia et al., NeurIPS’16)
  • Message Passing Neural Networks (Gilmer et al., ICML’17)
  • Deepmind GraphNets (Battaglia et al., 2018)
Nevertheless, these immense expressivity capabilities do not come without expenses. These graph neural networks become harder to interpret, scale, and learn. Although, this expressivity is obliged and unavoidable in tasks linked to computational chemistry, reasoning, and simulations. These tasks necessitate edges as a recipe for passing messages. There are many parameters to be learned involved within this mechanism.
Conclusion: The set of functions representable by convolutional GNNs, is less than or equal to the ones representable by attentional, and this latest one is capable of representing less than or equal to message passing. They are in increasing order of generality.
covolutional
\subseteq
attentional
\subseteq
message-passing