# Pay Attention, Relations are Important

### Overview

In recent years, Knowledge Graphs have been utilized to solve many real world problems such as Semantic Search, Dialogue Generation and Question Answering to name a few. Knowledge Graphs suffer from incompleteness in form of missing entities and relations, which has fueled a lot of research on Knowledge Base completion. Relation prediction is one of the widely used approaches to solve the problem of incompleteness.

Here we will present our ACL 2019 work, Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs, and introduce a novel neural network architecture which operates on Knowledge Graphs and learns to extract more expressive feature representations for entities and relations. Our model also addresses the shortcomings of previous methods like ConvKB, ConvE, RGCN, TransE, DistMult and ComplEx.

The code for this publication is provided here. We will discuss how to reproduce the results given in the paper at the end of this post.

### Graph Convolution Networks

Convolutional Neural Networks (CNNs) have helped in significantly improving the state-of-the-art in Computer Vision research. Image data can be seen as a *spatial grid* which is highly rigid (each pixel is connected to it’s 8 neighboring pixels). The CNNs exploit the rigidity and regular connectivity pattern of image data and thus give us an effective and trivial method to implement convolution operator.

Convolution operator in images gathers information from neighboring pixels commensurately. Similar idea is used when defining convolution operation on graphs.

Now Consider, a graph with nodes, specified as set of node features and the connectivity information in form of adjacency matrix . A *graph convolutional layer* then produces a transformed set of node feature vectors based on the structure of the graph and intial embeddings of the entities.

The convolution operation on graph can be summarized with the help of following two operations. First, in order to achieve a higher order representation of nodes, we do a linear transformation parametrized by a weight matrix . The transformed feature vectors are given as . Finally, to get the output features of node , we will aggregate the features across the neighborhood of the node. Final feature vectors can be defined as:

where is an activation function, defines the neighborhood of node , and specifies the importance of node features to node .

In most of the prior works, is defined explicitly, based on structural properties or as a learnable weight.

### Graph Attention Networks

We just saw how graph convolution works for arbitrary graph. Vaswani et al., 2017 It has been shown that self-attention can help us go above and beyond on the task of machine translation, Vaswani et al 2017(Transformer). Taking motivation from the previous success of self-attention mechanism, the GAT(cite) defines the value of **implicitly**.

Computation of is a result of an attentional mechanism applied over node features. The un-normalized attention coefficients over node pair are computed as follows:

where and are transformed feature vectors of nodes and , as described in previous section.

The node is allowed to attend over it’s neighborhood, . These un-normalized attention values cannot be used to compare the importance of different nodes, so typically the coefficients are normalized by using a *softmax function*. The final output of a **Graph Attention Layer** is calculated by gathering the information from the neighborhood as shown in the following equation:

Multi-head attention mechanism is employed to stabilize the learning process, taking the motivation from Vaswani et al, 2017. The above operations are independently applied to the node features and the outputs are featurewise aggregated, by concatenating or by adding (as is the case in last layer).

where represents concatenation, represents any non-linear activation function, are normalized attention coefficients calculated by the -th attention mechanism, and represents the corresponding linear transformation matrix of the -th attention mechanism.This fully specifies a Graph Attention Layer!

Now we can work with arbitrary structured graphs, Good! But, what about Knowledge Graphs, are GATs good enough to produce best results on those kind of graphs? How are Knowledge Graphs different? We discuss this in the next section and give a novel architecture to tackle some of the shortcomings of existing method.

### Brief introduction to Knowledge graphs Embeddings

A knowledge graph is denoted by , where and represent the set of entities (nodes) and relations (edges), respectively. A triple is represented as an edge between nodes and in . A triple in the Knowledge Graph denotes a fact, for example in the image, the triple *(London, capital_of, United Kingdom)* represents the fact that *London* is the *capital of* *United Kingdon*, so *capital_of* is the relatoin between two specified entities.

Embedding models try to learn an effective representation of entities, relations, and a scoring function , such that for a given input triple , gives the likelihood of being a valid triple. For example, above figure shows the subgraph from a KG which infers missing links represented by dashed lines using existing triples such as *(London, captial_of, United Kingdom)*.

It is the relations between entities that differentiates the Knowledge Graphs from other types of graphs. As can be understood by observing the structure of Knowledge Graphs, relations are an integral part of these graphs. An entity can play multiple roles depending on the relation by which they are connected. For example, in the above figure, *Christopher Nolan* plays different roles of *brother* and a *director*. Thus, while working with Knowledge Graphs, it’s just not enough to focus on entities and ignore relations!

### KBAT (Knowledge Base Attention) Network

With the preceding information as our motivation, we define a novel neural network architecture, Knowledge Base Attention Network, which uses the self-attention mechanism to attend over the neighborhood of every node and takes into account the relations in the Knowledge Graphs.

Consider a Knowledge Graph , with *Entity embedding* matrix , where the -th row is the embedding of entity , is the total number of entities, and is the feature dimension of each entity embedding. With a similar construction, the *relation embeddings* are represented by a matrix . Taking these two embedding matrices as input, the layer outputs the corresponding embedding matrices, and . We define an edge tuple as, , where are the entities and is the relation joining these two entities.

To incorporate the relations, we learn a new representation of every edge and aggregate the information by summing this over the neighborhood multiplying with the appropriate attention values. We learn these embeddings by performing a linear transformation over the concatenation of entity and relation feature vectors corresponding to a particular triple as shown:

where is the vector representation of a triple . Vectors , and denote embeddings of entities and relation , respectively. Additionally, denotes the linear transformation matrix.

We use a self-attention mechanism , to compute the un-normalized attention coefficients , for all edge tuple for all the tuples in the neighborhood of node :

where denote the embeddings of and entities.

The framework is agnostic to the attention mechanism . In this work, we use a single layered neural network as an attention mechanism and the calculation is shown in the equation below:

where is the linear transformation matrix.

Similar to the GAT layer, we use the *softmax function* to normalize these attention values to be useful across the neighborhoods and the normalized attention values are calculated:

where denotes the neighborhood of entity and denotes the set of relations connecting entities and .

The new embedding of the entity is calculated by aggregating the information from the neighborhood by summing each triple representation weighted by their attention values as shown:

We employ a similar multi-head attention mechanism to GAT, which was first introduced by Vaswani et al., 2017 , is used to stabilize the learning process and encapsulate more information about the neighborhood.

We perform a linear transformation on input *relation embedding* matrix , and get the transformed relation embeddings , where is shared output dimensionality of entity and relation embeddings.

In the Graph Convolution and Graph Attention Networks, its a good practice to add a self loop to every entity so that the information of that entity also plays a role in it’s new embeddings. However, if we cannot do the same in Knowledge graphs because adding a self loop means adding a new relation type which does not makes sense. On the other hand, ignoring the previous information stored in the embeddings doesn’t seem like a good idea. We resolve this issue by linearly transforming to obtain using a weight matrix , where represents the input entity embeddings to our model, represents the transformed entity embeddings, denotes the dimension of an initial entity embedding, and denotes the dimension of the final entity embedding. We add this initial entity embedding information to the entity embeddings obtained from the final attentional layer, as shown below:

With this preceding information, we have succefully defined a Knowledge Base Attention Layer!

### Auxiliary Edges

In our architecture, we extend the notion of an *edge* to a *directed path* by introducing an auxiliary relation for -hop neighbors between two entities. In the current mode, the embedding of this auxiliary relation is the featurewise summation of embeddings of all the relations in the path. However, the summation operation can be replaced with a *max pooling* operation.

Let’s see an easy example how KBAT works. KBAT iteratively accumulates knowledge from distant neighbors of an entity. As illustrated in the image above, in the first layer of this model, all entities capture information from their *direct in-flowing neighbors*. In the second layer, *U.S* gathers information from entities *Barack Obama, Ethan Horvath, Chevrolet, and Washington D.C*, which already possess information about their neighbors *Michelle Obama* and *Samuel L. Jackson*, from a previous layer. In general, for a layer model the incoming information is accumulated over a -hop neighborhood. We found that normalizing the entity embeddings after every generalized KBAT layer and prior to the first layer was useful.

### Is that enough?

We used a decoder network to decode the information collected by KBAT and use that decoded information for the ranking task. We used ConvKB as a decoder model. And we found that while KBAT networks does a good job at collecting information from the neighborhood, that information can not be directly used to make amazing predictions. This concludes the explanation of our model, KBAT (The Encoder) and a decoder network (ConvKB in this case).

### Reproducing the results

In this section we will first summarize how to use KBAT model on new datasets (not available in our github repo). Once we are done with the basic setup and initialization, there will be steps on reproducing the results given in the paper.

- Let’s start with cloning the github repository which contains pytorch implementation of KBAT network.
`$ git clone https://github.com/deepakn97/relationPrediction.git`

- If you wish to reuse the dataset provided please feel free to skip to step 4. Now we need to create a new data directory and populate the directory with some important files.
contains mapping of entity names to the id. id starts from 0.*entity2id.txt:*contains mapping of relation names to the id. id starts from 0.*relation2id.txt:*contains list of triples in the format*train.txt and test.txt:**entity1 relation entity2*For better example, please inspect one of the data directories.

- KBAT network requires to initialize the entity and relation embedding vectors before it can start training. We use TransE embeddings to initialize these vectors. Set of commands given below can be used to get embedding files. For more detailed information please refer here.

- Now we have everything in place and we can start training the model. Once the training completes, the model will automatically evaluate it’s performance on the given test set. To train the model we need to run
*main.py*using the following command.

To find about all the parameters available please look at the github repository. We also provide a values of all the parameters used for producing the results reported in the paper.

### Citation

Please cite the following paper if you use this code in your work.

```
@InProceedings{KBGAT2019,
author = "Nathani, Deepak and Chauhan, Jatin and Sharma, Charu and Kaul, Manohar",
title = "Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
year = "2019",
publisher = "Association for Computational Linguistics",
location = "Florence, Italy",
}
```

For any query or suggestion, please drop a mail at deepakn1019@gmail.com.