Homophily-Based Graph Node Classification Using GNNs

Blogs

Graph Neural Networks (GNNs) are becoming increasingly popular for tasks such as node classification, link prediction, and graph-level predictions. This project aims to classify nodes in a graph based on the homophily assumption using GNNs, with a particular focus on the CA-AstroPh dataset. The dataset represents collaborations between authors in astrophysics, and we demonstrate how to leverage graph structure for node classification.

Dataset and Graph Construction

We used the CA-AstroPh dataset, which captures collaborations between authors on papers submitted to the Arxiv Astro Physics category. The graph was constructed using NetworkX from a text file listing edges between nodes. Each node represents an author, and an edge signifies collaboration between two authors.

import networkx as nx

# Constructing the graph
G = nx.Graph()
# Assuming edges have been parsed from the dataset
edges = [(author1, author2), (author3, author4)]  # Example edge list
G.add_edges_from(edges)

Graph Statistics

To understand the complexity of the graph, we computed key statistics such as the number of nodes and edges:

print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")

Homophily Assumption

The homophily assumption states that nodes with similar labels are more likely to be connected. In the absence of ground truth labels, we assigned random labels to nodes for the purpose of simulation. The goal was to verify whether the homophily assumption holds in this scenario.

Label and Feature Assignment

We assigned random labels and features to the nodes:

import numpy as np

# Assign random labels
labels = {node: np.random.choice([0, 1]) for node in G.nodes()}

# Generate random features
features = {node: np.random.rand(10) for node in G.nodes()}  # 10-dimensional features

Random Walks and Node2Vec

To generate node embeddings, we employed random walks and trained a Node2Vec model using Word2Vec. Random walks capture the local graph structure, and these walks are treated as sentences for the Word2Vec model.

from gensim.models import Word2Vec

# Generate random walks
def generate_walks(graph, num_walks=10, walk_length=80):
    walks = []
    for node in graph.nodes():
        for _ in range(num_walks):
            walk = [node]
            while len(walk) < walk_length:
                neighbors = list(graph.neighbors(walk[-1]))
                if len(neighbors) > 0:
                    walk.append(np.random.choice(neighbors))
                else:
                    break
            walks.append(walk)
    return walks

walks = generate_walks(G)
# Train Node2Vec model
model = Word2Vec(walks, vector_size=64, window=5, min_count=0, sg=1, workers=4)
node_embeddings = {node: model.wv[str(node)] for node in G.nodes()}

Diagram: Random Walk Process

graph LR
  A[Start Node]
  B[Neighbor 1]
  C[Neighbor 2]
  D[Random Choice 1]
  E[Random Choice 2]
  A --> B --> D
  A --> C --> E

Training the Classifiers

Random Forest Classifier

The node embeddings were used as input features for a Random Forest classifier. Labels were predicted based on these embeddings.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Prepare data
X = np.array([node_embeddings[node] for node in G.nodes()])
y = np.array([labels[node] for node in G.nodes()])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

Probabilistic Relational Classifier

In addition to the RandomForestClassifier, a Probabilistic Relational Classifier (PRC) was employed to capture dependencies between neighboring nodes.

Model Evaluation and Results

Both classifiers were evaluated using the accuracy metric, as well as precision and recall.

Random Forest Classifier Results

Accuracy: 91.53%
Precision: 93.19%
Recall: 86.08%

Probabilistic Relational Classifier Results

Accuracy: 92.70%
Precision: 91.78%
Recall: 90.63%

pie
    title Classifier Performance
    "Random Forest Accuracy": 91.53
    "Probabilistic Relational Accuracy": 92.70

Conclusion

The project successfully demonstrated the application of GNNs to a node classification task, utilizing the homophily assumption. Both classifiers performed well, with the Probabilistic Relational Classifier slightly outperforming the Random Forest model. Future work could involve applying this methodology to real-world datasets with more specific features and labels.

Last updated on October 14, 2024

Implementation of MLOps CI/CD Pipeline Graph Neural Networks