Graphs II
Last updated
Last updated
On graphs and sets study:
Permuation invariant
Permutation equivariant
Guiding principles to synthesise and steer the design of deep learning architectures:
Symmetries
-Invariance
-Equivariance
Locality
Scale Separation
The following statements encapsulate and sum up the general geometric deep learning framework.
Let and be domains, a symmetry group over . Write can be considered a compact version of
We define the following building blocks:
Linear -equivariant layer , satisfying linearity and resistance to group action in an equivariant way. No matter which symmetry group transformation we take and featurization function .
In deep learning we typically inject nonlinearities through usage point-wise nonlinearities such as sigmoid functions, tangent or ReLU.
Local pooling (coarsening)
If we need answers on the level of the entire domain, we can use a global pooling layer which is invariant to the symmetry group transformation.
We are going to start by demonstrating the blueprint on graphs because graphs and sets give us a very nice discrete domain with minimal geometric assumptions easy to analyse.
Also with some assumptions and transformations, any domain can be seen as a graph.
Question: Why we might want to study data that lives on graphs?
Answer: Graphs are common structures around us mostly all the time. They can be thought of as sort of the main modality of data we can encounter from nature.
e.g.:
Molecular graphs
Transportation networks
Social networks
“The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.” Jay Wright Forrester 1971
Graphs without connections, called unordered sets, have only nodes and no connections between them. This structure construct a simpler domain, so it's going to be much easier to analyse the geometric architectures that arise.
A significant part of the conclusions we achieve from sets will naturally carry over to graphs.
Even if we restrict ourselves completely to processing data on sets following this kind of structure, it would still a relevant area. e.g: point clouds data
Setup
Assume our graph has no edges
, the set of nodes.
Our feature space then is
e.g:
The permutation will change the order in which we see the elements in this set and we want our neural networks to be resistant to this phenomenon.
Idea
It's useful to think about operators that change the node order.
This example permutation maps the following actions over the set:
Concretised in the form of a unique expression as:
e.g: For our initial example
the permutation matrix looks as follows:
Conclusion: This examples illustrates that the permutation matrix is exactly the representation of the group action of the of the permutation group. Thus we can use this matrix representation to reason about permutations while being in the universe of linear algebra, which is going to make all of the subsequent analysis a lot easier.
Deep Sets model
The sum aggregation is responsible to achieve the permutation invariance because independently of how we permute the rows of a matrix, its sum will be always be the same. Even though we can use other types of aggregators such as maximization or average, which indeed keep achieving permutation invariance. Thus we can generalize the previous expression to be
Permutation invariant models are good for set-level outputs. This means that it's a desirable property to have when we want outputs on the level of the entire set.
Question: What if we would like answers to node level?
Answer: In order to preserve and identify node outputs we can't take use of the permutation invariant summation. Here summation would potentially destroy node-level outputs or at least make them really hard to recover.
Now rather than restrict the generated output to be resistant to permutation, we want to let the permutation produce a new and different result but while making this phenomena predictable. Therefore we want a function that doesn't change the node order when applied. This concludes that it doesn't matter when we apply the permutation to the set.
NOTE: Trying to be consistent with the linear algebra universe and its notation, the function is capitalized and in bold, because it returns a matrix as output.
Strict resistance to symmetries is not the only property we want. It's desirable that our neural networks's predictions not immensely collapse under the effect of non-perfect symmetry transformation.
Data is not usually transformed using a symmetry element but rather noise it's acquired and added in the transformations, generating distortions. Then we would want the signal to be stable under slight deformations of the domain.
If we want to achieve this level of resistance under distortions, it's crucial to compose local operations to model large scales ones, as local operations won't globally propagate errors. This implies that we would want graph neural networks (GNN) to have layers which behave locally with respect to the structure of the domain.
Question: What does it mean for an equivariant layer on sets to be local?
Compare with GDL framework:
Conclusion: This is probably the best and most general form of a neural network that operates on sets, without assuming or inferring additional structure or relations (edges).
Further information about graphs definition and properties can be found at chapter:
Nevertheless, we are going to limit ourselves to work without any additional information attached to edges, although all the derived math related to just edges as a binary matrix, will carry out to more complex structures. So we will just conceive every edge is just a binary indicator of whether or not two nodes are connected.
Our main desire with sets holds for graphs, since we also want our graph neural networks to be resistant to permutations over the nodes and edges which construct the graph.
### Permutation invariance & equivariance
We assume there's some connectivity structure between our nodes. Therefore when we permute the nodes, we expect the edges to accordingly act and be permuted identically. We assume that after applying the permutation we will obtain 2 isomorphic graphs.
Invariance
Equivariance
Conclusion: This two properties carry over very naturally from sets.
Locality doesn't carry over as naturally as invariance and equivariance do from sets. We now assume there might exist relations between nodes, which in fact means we musn't transform every node just in isolation. Thus graphs give us a broader context by the so-called node's neighbourhood.
We then parallelize the application of this local function in isolation to every single node conforming the graph.
This update of a particular node depends on its neighbourhood features and, if desired, by itself. The adjacency matrix structure is preserved through this update, although there is room for more sophisticated procedures which could explicitly change the relations living in the graph.
From this general procedure emerges several possible tasks:
Local permutation invariant function
Message passing
Diffusion
Propagation
Convolutional
Attentional
Message-passing
As we go from 1, convolutional, to 2, message-passing, the number of possible representable functions grows. We can represent convolutional layers as a special case of attentional layers, and represent attentional layers using message passing layers. The more right we go, the more complex the problem we can fit, but at the expense of poor scalability and increasing the risk of overfitting the data. There is room for trade offs attending to our context.
Convolutional
We have not clearly defined a decision method to correctly anticipate how to aggregate all the neighbors in the neighborhood. The neighborhood size for a given node could hugely vary concerning the average of the other nodes of the graph. We cannot be too pragmatic about size-dependent kernels (filters), those filters whose behavior and configuration strictly depend on their input size.
Question: What do convolutional GNNs accomplish?
Answer: They quantify how meaningful an individual neighbor is relative to a specific node. To express how a particular neighbor influences a node, we are going to use weights, denoted by c_{ij}. Consequently, we are going to do a point-wise transformation to all of our node features, and then every node will compute a weighted combination of all of its neighbors based on the c_{ij} coefficients.
e.g.
We sum all our neighbors, where each one is multiplied by the corresponding weight.
The essential part of this whole procedure is that these weights were already predefined before applying the graph neural network. Thie adjacency matrix dictates these weights accordingly to the graph topology.
Widespread instances of these kinds of layers include but are not exclusive:
ChebyNet (Defferrard et al., NeurIPS'16)
Graph Convolutional Network (Kipf & Welling, ICLR'17): This is currently the top-cited graph neural network paper.
Simplified Graph Convolutional (Wu et al., ICML'19): Has shown how simple these layers can become and can still recover mostly any real-world graph.
Because convolutional GNNs specify the weights before anything and are usually some function of the graph topology, these models are helpful when our graphs are homophilous.
Homophilous graphs: Graphs whose edges encode label similarities. The majority of our neighbors are expected to share the same label with high statistical significance. Thus averaging is a robust regularizer since the feature representation of a particular node will highly rely on the features of all its neighbors.
As the coefficients are predefined, this equation can be efficiently computed using sparse matrix multiplication. This explains why this type of GNNs are the most industrial ones nowadays, as they can handle large graphs with applicational value.
Attentional
Given a particular neighborhood, we do not need to assume that edges necessarily encode similarity as we have done in convolutional GNNs. This precondition of defining weights between nodes in the same neighborhood cannot be generalized. It will not be suitable for learning graphs that count with more complex relations.
e.g. social network interactions
Doing a retweet of somebody else opinion does not strictly means that we might share the same thoughts. We can even wholly disagree.
More suitable graph neural network architectures exist to encapsulate more complex graph relations among nodes. They quantify node relations as a function done in a feature-dependent matter rather than a pure graph topology computation.
One very natural form of achieving this is by using an attention mechanism. So in convolutional graph neural networks, we previously had fixed coefficient interactions, whereas now these coefficients are computed based on the sender and receiver node features.
Some trendy models tried to conceive this attention procedure idea within the graph domain:
MoNet (Monti et al., CVPR’17)
Graph ATtention network (Veličković et al., ICLR’18)
GATv2 (Brody et al., 2021): It is an improved and enhanced successor of GAT, since it solves the static attention problem in graph attention networks.
We are not obligated to assume that our graph is homophilous. We can now learn a radical set of complex and new interactions between nodes. The computational cost keeps feasible, with just one scalar per edge. Thus we are able to increase our learnability bounds without sacrificing scalability. Once these new scalars have been computed, we are dealing again with the earlier introduced sparse matrix multiplication employed in the convolutional world.
Attentional architectures are the de facto standard today for acquiring and learning complex interactions. This architecture shines in those graphs, which we expect there will be a complex behavior emergence along their edges. More about this field, including the acclaimed transformer architecture, in the upcoming chapter.
Message-passing
They are the most generic form of graph neural network layer we can tangibly achieve. We no longer assume that our edges encode a particular weighted combination that needs to be computed. Edges give us a recipe to pass data, and the message function determines what data actually is moving to which node. We can compute arbitrary vectors, the so-called message vectors, to be sent across edges.
There is a vast difference between the procedures employed in the convolutional and attentional mechanisms. We had to compute something in our neighbor and then aggregate that in a weighted combination. The neighbor decides the aggregation influence, but the receiver also alters the final aggregation.
which then the receiver node aggregates in some permutation invariant form.
There exist plenty of popular layers that suggest message passing computation:
Interaction Nets (Battaglia et al., NeurIPS’16)
Message Passing Neural Networks (Gilmer et al., ICML’17)
Deepmind GraphNets (Battaglia et al., 2018)
Nevertheless, these immense expressivity capabilities do not come without expenses. These graph neural networks become harder to interpret, scale, and learn. Although, this expressivity is obliged and unavoidable in tasks linked to computational chemistry, reasoning, and simulations. These tasks necessitate edges as a recipe for passing messages. There are many parameters to be learned involved within this mechanism.
Conclusion: The set of functions representable by convolutional GNNs, is less than or equal to the ones representable by attentional, and this latest one is capable of representing less than or equal to message passing. They are in increasing order of generality.
Nonlinearity Applied element-wise as
When we compose linear equivariant layers and nonlinearities, we will end up with universal approximators over the domains.
such that .
Coarsening: Sometimes it might be useful to coarse the domain. This means going from an initial to an which is slightly smaller and contains the previous domain within it.
-invariant layer (global pooling) , satisfying
We typically want to deal with featurizing this domain in some way. Thus we will assume that node has features , which is a real value vector with k-dimensional features This corresponds to our feature space .
Once we have these features a typical way in which we can process data like this is by stacking the node features into a node feature matrix .
This feature matrix of shape , is interpreted such as row corresponds to the features of the node.
NOTE: At the time we are building this feature matrix we have already introduced some notion of order. Thus we break apart our strong and initial assumption, when we assume the sets to be completely unordered. We would like a neural network that operates on which not depends on the order of the rows of .
Given an unordered set of five nodes and their features being , we are asked to learn some class of functions, neural networks, that will take the features of these individual nodes and produce an output on the level of the entire set. This function is desired to be resistant to the order in which we fed these features.
Conclusion: Even if we completely perturb the order in which we give the set elements to the function we expect the output to be unaffected. In the framework of GDL this corresponds to using the n-element Permutation group as our symmetry group . The different group elements within this symmetry group will be the permutations
e.g:
Formally, a permutation on a set is equivalently
a bijection, one-to-one correspondence, from to itself,
a pair of linear orderings on ,
A linear order on a set is a binary relation with the following properties:
asymmetry: if , then ,
transitivity: if , then ,
comparison: if , then or ,
connectedness: if and , then .
an element in the symmetric group of .
As automorphisms in set, the permutations of naturally form a group under composition, called the symmetric group (or permutation group) on . This group may be denoted by , but we are going to denote it by . When is an finite set , then its symmetric group is a finite group of cardinality . This is typically written as .
Within linear algebra, each permutation defines a matrix ().
Such matrices, called permutation matrices (group action ), have zeros almost everywhere and have strictly just a one in every row and column. The positions of these ones correspond to the order of the nodes in the permutation.
Its effect when left-multiplied is to permute rows of as follows
We want functions that operate over sets node features, that will NOT depend on the order of the nodes. Equivalently if we apply a permutation matrix to , it shouldn't change the result. Thus is permutation invariant if and only, for all permutation matrices :
This property is desirable for set neural networks when we assume that our domain is unordered sets. We then say that the set neural network is permutation invariant if and only if independently of the chosen matrix chosen, the result stills unchanged.
We can compare this with our requirement in the geometric deep learning framework for -invariant layer, and specifically in this case, our corresponds to the -invariant layer , and the group action corresponds to applying a permutation matrix. Therefore we have a one-to-one correspondence with the framework.
It's general procedure to perform representation learning on sets in a way that's permutation invariant. It applies a point-wise neural network on every single node in isolation. Then in order to guarantee permutation invariance, we sum up the features of all the nodes in the set to finally transform that with neural network, denoted by .
In this expression and are learnable functions.
Since this moment, whenever we see the generic aggregator operator, denoted by , we must conceive anything which is permutation invariant over its arguments.
Compare with GDL framework: Linear -equivariant layer , satisfying linearity and resistance to group action in an equivariant way. No matter which symmetry group transformation we take and featurization function .
Answer: One way we can enforce locality in equivariant set functions is through a shared function applied to every node in isolation:
Stacking in the same way we stacked the original , this yields an new matrix, denoted by , which is considered as the output of our function .
(stacking) functions, potentially with an tail yields any useful set neural nets.
This expression is quite close to what our framework would have construct. There is a point-wise function which corresponds precisely to a local equivariant function, denoted by , over nodes, and then if needed to generate predictions at set-level, we can add a global pooling layer at the very end. In the expression this is denoted by the generic aggregator operator
Now we can generalize the structure of the set of nodes, assuming they have edges between them constructing relations. Thus we consider graphs where is a subset of the cartesian product.
Our representation of edges will be consistent with the realms of linear algebra. So we can represent these edges in an adjacency matrix, which is a binary matrix, denoted by . By using this representation edges have now become part of the domain. Any further additions like edges features or weights to them are completely possible.
When the nodes are permuted, we need to appropriately permute both rows and columns of adjacency matrix . Thus if we are using a permutation matrix, denoted by , to permute our nodes, that means we are reordering our adjacency matrix as . The first is reponsible for permuting the rows and the permutation matrix transpose deals a permutation of the columns.
Node's neighbourhood: For a node , we can derive its 1-hop neighbourhood, denoted by , as the set of all nodes which are connected to exclusively with one edge.
Accordingly, we can extract neighbourhood features, , like so:
Now we can do something more generic than just having a point-wise local function, we can have a local function, denoted by , operates over the node itself and its neighbourhood.
Doing so, we can construct permutation equivariant functions , by appropriately applying the local function , over all neighbourhoods.
To ensure equivariance, it's sufficiente if doesn't depend on the order of the nodes
We have built an equivariant function on graphs by using another local permutation invariant function, learnable and denoted by , acting at node level. The graph neural network computes new features representation for a node given its neighbourhood of features, denoted by , which for the sake of extra context can contain the node itself. Apply the local function corresponds to the expression:
Given as inputs to a graph neural network, a graph with feature inputs, denoted by , such that a particular node and its feature representation, , and some adjacency matrix , the application of this function will update our features (inputs) to their latents representation.
classification: We can effectively learn a classifier acting at node level to predict likelihood node features output.
classification: Classifications at the level of the entire domain forces us to somehow transform our nodes features in a permutation invariant form. Generally this is achieved using a generic aggregator operator, , which could be the summation or average.
prediction (edge classifier): Because now edges are an important part of our domain, we can also do predictions over edges. For example, edge classification and edge regression are factible. We can also estimate the likelihood of an edge existing between one or several nodes. We would need a function that takes under consideration both neighborhoods, the one comming from the node and the one from . In addition, we can use any possible and existent information present in their edges, denoted by .
The local permutation invariant function, denoted by , shared applied to every node, guarantees that we can in fact construct permutation equivariant functions on graphs, . This can be called, depending on the literature as:
Whereas corresponds to a layer in the GNN architecture. Finding and building good candidates to become this local function ,is one of the most active areas of research nowadays in the field of GNNs. Nevertheless, all of these proposed local functions can be derived and reasoned under the geometric deep learning framework using three spatial flavours.
We have replaced the constant coefficient with an attention function that takes features of and and produces a scalar, denoted by .
The function acts now on both the sender and the receiver node. They cooperate to compute a message vector, denoted by
covolutional attentional message-passing