Graphs II

Purpose

On graphs and sets study:

  • Permuation invariant

  • Permutation equivariant

Guiding principles to synthesise and steer the design of deep learning architectures:

  1. Symmetries

  2. Locality

  3. Scale Separation

The Building Blocks of Geometric Deep Learning

The following statements encapsulate and sum up the general geometric deep learning framework.

We define the following building blocks:

In deep learning we typically inject nonlinearities through usage point-wise nonlinearities such as sigmoid functions, tangent or ReLU.

Local pooling (coarsening)

If we need answers on the level of the entire domain, we can use a global pooling layer which is invariant to the symmetry group transformation.

GDL Framework in action: Graphs & Sets

Motivation

We are going to start by demonstrating the blueprint on graphs because graphs and sets give us a very nice discrete domain with minimal geometric assumptions easy to analyse.

Also with some assumptions and transformations, any domain can be seen as a graph.

Question: Why we might want to study data that lives on graphs?

Answer: Graphs are common structures around us mostly all the time. They can be thought of as sort of the main modality of data we can encounter from nature.

e.g.:

  • Molecular graphs

  • Transportation networks

  • Social networks

The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.Jay Wright Forrester 1971

Learning On Sets: Graphs Without Connections

Why?

  1. Graphs without connections, called unordered sets, have only nodes and no connections between them. This structure construct a simpler domain, so it's going to be much easier to analyse the geometric architectures that arise.

  2. A significant part of the conclusions we achieve from sets will naturally carry over to graphs.

  3. Even if we restrict ourselves completely to processing data on sets following this kind of structure, it would still a relevant area. e.g: point clouds data

Setup

Assume our graph has no edges

, the set of nodes.

Our feature space then is

e.g:

The permutation will change the order in which we see the elements in this set and we want our neural networks to be resistant to this phenomenon.

Permutations and Permutations Matrices

Idea

It's useful to think about operators that change the node order.

This example permutation maps the following actions over the set:

Concretised in the form of a unique expression as:

\boldsymbol{\mathfrak{g}} = (\array{1 \mapsto 3 & 2 \mapsto 1 & 3 \mapsto 4 & 4 \mapsto})

e.g: For our initial example

the permutation matrix looks as follows:

Conclusion: This examples illustrates that the permutation matrix is exactly the representation of the group action of the of the permutation group. Thus we can use this matrix representation to reason about permutations while being in the universe of linear algebra, which is going to make all of the subsequent analysis a lot easier.

Permutation invariance

Deep Sets model

The sum aggregation is responsible to achieve the permutation invariance because independently of how we permute the rows of a matrix, its sum will be always be the same. Even though we can use other types of aggregators such as maximization or average, which indeed keep achieving permutation invariance. Thus we can generalize the previous expression to be

Permutation equivariance

Permutation invariant models are good for set-level outputs. This means that it's a desirable property to have when we want outputs on the level of the entire set.

Question: What if we would like answers to node level?

Answer: In order to preserve and identify node outputs we can't take use of the permutation invariant summation. Here summation would potentially destroy node-level outputs or at least make them really hard to recover.

Now rather than restrict the generated output to be resistant to permutation, we want to let the permutation produce a new and different result but while making this phenomena predictable. Therefore we want a function that doesn't change the node order when applied. This concludes that it doesn't matter when we apply the permutation to the set.

NOTE: Trying to be consistent with the linear algebra universe and its notation, the function is capitalized and in bold, because it returns a matrix as output.

Locality

Strict resistance to symmetries is not the only property we want. It's desirable that our neural networks's predictions not immensely collapse under the effect of non-perfect symmetry transformation.

Data is not usually transformed using a symmetry element but rather noise it's acquired and added in the transformations, generating distortions. Then we would want the signal to be stable under slight deformations of the domain.

If we want to achieve this level of resistance under distortions, it's crucial to compose local operations to model large scales ones, as local operations won't globally propagate errors. This implies that we would want graph neural networks (GNN) to have layers which behave locally with respect to the structure of the domain.

Question: What does it mean for an equivariant layer on sets to be local?

Compare with GDL framework:

Conclusion: This is probably the best and most general form of a neural network that operates on sets, without assuming or inferring additional structure or relations (edges).

Learning On Graphs

Further information about graphs definition and properties can be found at chapter:

Nevertheless, we are going to limit ourselves to work without any additional information attached to edges, although all the derived math related to just edges as a binary matrix, will carry out to more complex structures. So we will just conceive every edge is just a binary indicator of whether or not two nodes are connected.

Our main desire with sets holds for graphs, since we also want our graph neural networks to be resistant to permutations over the nodes and edges which construct the graph.

### Permutation invariance & equivariance

We assume there's some connectivity structure between our nodes. Therefore when we permute the nodes, we expect the edges to accordingly act and be permuted identically. We assume that after applying the permutation we will obtain 2 isomorphic graphs.

  • Invariance

  • Equivariance

Conclusion: This two properties carry over very naturally from sets.

Locality: Neighbourhoods

Locality doesn't carry over as naturally as invariance and equivariance do from sets. We now assume there might exist relations between nodes, which in fact means we musn't transform every node just in isolation. Thus graphs give us a broader context by the so-called node's neighbourhood.

\mathcal{N}_i = \{ j : (i,j) \in E \: \or \: (j,i) \in E \}

Graph Neural Networks

We then parallelize the application of this local function in isolation to every single node conforming the graph.

General procedure

This update of a particular node depends on its neighbourhood features and, if desired, by itself. The adjacency matrix structure is preserved through this update, although there is room for more sophisticated procedures which could explicitly change the relations living in the graph.

From this general procedure emerges several possible tasks:

Local permutation invariant function

  1. Message passing

  2. Diffusion

  3. Propagation

  1. Convolutional

  2. Attentional

  3. Message-passing

As we go from 1, convolutional, to 2, message-passing, the number of possible representable functions grows. We can represent convolutional layers as a special case of attentional layers, and represent attentional layers using message passing layers. The more right we go, the more complex the problem we can fit, but at the expense of poor scalability and increasing the risk of overfitting the data. There is room for trade offs attending to our context.

Convolutional

We have not clearly defined a decision method to correctly anticipate how to aggregate all the neighbors in the neighborhood. The neighborhood size for a given node could hugely vary concerning the average of the other nodes of the graph. We cannot be too pragmatic about size-dependent kernels (filters), those filters whose behavior and configuration strictly depend on their input size.

Question: What do convolutional GNNs accomplish?

Answer: They quantify how meaningful an individual neighbor is relative to a specific node. To express how a particular neighbor influences a node, we are going to use weights, denoted by c_{ij}. Consequently, we are going to do a point-wise transformation to all of our node features, and then every node will compute a weighted combination of all of its neighbors based on the c_{ij} coefficients.

e.g.

We sum all our neighbors, where each one is multiplied by the corresponding weight.

The essential part of this whole procedure is that these weights were already predefined before applying the graph neural network. Thie adjacency matrix dictates these weights accordingly to the graph topology.

Widespread instances of these kinds of layers include but are not exclusive:

  • ChebyNet (Defferrard et al., NeurIPS'16)

  • Graph Convolutional Network (Kipf & Welling, ICLR'17): This is currently the top-cited graph neural network paper.

  • Simplified Graph Convolutional (Wu et al., ICML'19): Has shown how simple these layers can become and can still recover mostly any real-world graph.

Because convolutional GNNs specify the weights before anything and are usually some function of the graph topology, these models are helpful when our graphs are homophilous.

Homophilous graphs: Graphs whose edges encode label similarities. The majority of our neighbors are expected to share the same label with high statistical significance. Thus averaging is a robust regularizer since the feature representation of a particular node will highly rely on the features of all its neighbors.

As the coefficients are predefined, this equation can be efficiently computed using sparse matrix multiplication. This explains why this type of GNNs are the most industrial ones nowadays, as they can handle large graphs with applicational value.

Attentional

Given a particular neighborhood, we do not need to assume that edges necessarily encode similarity as we have done in convolutional GNNs. This precondition of defining weights between nodes in the same neighborhood cannot be generalized. It will not be suitable for learning graphs that count with more complex relations.

e.g. social network interactions

Doing a retweet of somebody else opinion does not strictly means that we might share the same thoughts. We can even wholly disagree.

More suitable graph neural network architectures exist to encapsulate more complex graph relations among nodes. They quantify node relations as a function done in a feature-dependent matter rather than a pure graph topology computation.

One very natural form of achieving this is by using an attention mechanism. So in convolutional graph neural networks, we previously had fixed coefficient interactions, whereas now these coefficients are computed based on the sender and receiver node features.

Some trendy models tried to conceive this attention procedure idea within the graph domain:

  • MoNet (Monti et al., CVPR’17)

  • Graph ATtention network (Veličković et al., ICLR’18)

  • GATv2 (Brody et al., 2021): It is an improved and enhanced successor of GAT, since it solves the static attention problem in graph attention networks.

We are not obligated to assume that our graph is homophilous. We can now learn a radical set of complex and new interactions between nodes. The computational cost keeps feasible, with just one scalar per edge. Thus we are able to increase our learnability bounds without sacrificing scalability. Once these new scalars have been computed, we are dealing again with the earlier introduced sparse matrix multiplication employed in the convolutional world.

Attentional architectures are the de facto standard today for acquiring and learning complex interactions. This architecture shines in those graphs, which we expect there will be a complex behavior emergence along their edges. More about this field, including the acclaimed transformer architecture, in the upcoming chapter.

Message-passing

They are the most generic form of graph neural network layer we can tangibly achieve. We no longer assume that our edges encode a particular weighted combination that needs to be computed. Edges give us a recipe to pass data, and the message function determines what data actually is moving to which node. We can compute arbitrary vectors, the so-called message vectors, to be sent across edges.

There is a vast difference between the procedures employed in the convolutional and attentional mechanisms. We had to compute something in our neighbor and then aggregate that in a weighted combination. The neighbor decides the aggregation influence, but the receiver also alters the final aggregation.

which then the receiver node aggregates in some permutation invariant form.

There exist plenty of popular layers that suggest message passing computation:

  • Interaction Nets (Battaglia et al., NeurIPS’16)

  • Message Passing Neural Networks (Gilmer et al., ICML’17)

  • Deepmind GraphNets (Battaglia et al., 2018)

Nevertheless, these immense expressivity capabilities do not come without expenses. These graph neural networks become harder to interpret, scale, and learn. Although, this expressivity is obliged and unavoidable in tasks linked to computational chemistry, reasoning, and simulations. These tasks necessitate edges as a recipe for passing messages. There are many parameters to be learned involved within this mechanism.

Conclusion: The set of functions representable by convolutional GNNs, is less than or equal to the ones representable by attentional, and this latest one is capable of representing less than or equal to message passing. They are in increasing order of generality.

Last updated