Homophily-Based Graph Node Classification Using GNNs
Graph Neural Networks (GNNs) are becoming increasingly popular for tasks such as node classification, link prediction, and graph-level predictions. This project aims to classify nodes in a graph based on the homophily assumption using GNNs, with a particular focus on the CA-AstroPh dataset. The dataset represents collaborations between authors in astrophysics, and we demonstrate how to leverage graph structure for node classification.
Dataset and Graph Construction
We used the CA-AstroPh dataset, which captures collaborations between authors on papers submitted to the Arxiv Astro Physics category. The graph was constructed using NetworkX
from a text file listing edges between nodes. Each node represents an author, and an edge signifies collaboration between two authors.
import networkx as nx
# Constructing the graph
G = nx.Graph()
# Assuming edges have been parsed from the dataset
edges = [(author1, author2), (author3, author4)] # Example edge list
G.add_edges_from(edges)
Graph Statistics
To understand the complexity of the graph, we computed key statistics such as the number of nodes and edges:
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")
Homophily Assumption
The homophily assumption states that nodes with similar labels are more likely to be connected. In the absence of ground truth labels, we assigned random labels to nodes for the purpose of simulation. The goal was to verify whether the homophily assumption holds in this scenario.
Label and Feature Assignment
We assigned random labels and features to the nodes:
import numpy as np
# Assign random labels
labels = {node: np.random.choice([0, 1]) for node in G.nodes()}
# Generate random features
features = {node: np.random.rand(10) for node in G.nodes()} # 10-dimensional features
Random Walks and Node2Vec
To generate node embeddings, we employed random walks and trained a Node2Vec model using Word2Vec
. Random walks capture the local graph structure, and these walks are treated as sentences for the Word2Vec model.
from gensim.models import Word2Vec
# Generate random walks
def generate_walks(graph, num_walks=10, walk_length=80):
walks = []
for node in graph.nodes():
for _ in range(num_walks):
walk = [node]
while len(walk) < walk_length:
neighbors = list(graph.neighbors(walk[-1]))
if len(neighbors) > 0:
walk.append(np.random.choice(neighbors))
else:
break
walks.append(walk)
return walks
walks = generate_walks(G)
# Train Node2Vec model
model = Word2Vec(walks, vector_size=64, window=5, min_count=0, sg=1, workers=4)
node_embeddings = {node: model.wv[str(node)] for node in G.nodes()}
Diagram: Random Walk Process
graph LR A[Start Node] B[Neighbor 1] C[Neighbor 2] D[Random Choice 1] E[Random Choice 2] A --> B --> D A --> C --> E
Training the Classifiers
Random Forest Classifier
The node embeddings were used as input features for a Random Forest classifier. Labels were predicted based on these embeddings.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Prepare data
X = np.array([node_embeddings[node] for node in G.nodes()])
y = np.array([labels[node] for node in G.nodes()])
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
Probabilistic Relational Classifier
In addition to the RandomForestClassifier, a Probabilistic Relational Classifier (PRC) was employed to capture dependencies between neighboring nodes.
Model Evaluation and Results
Both classifiers were evaluated using the accuracy metric, as well as precision and recall.
Random Forest Classifier Results
- Accuracy: 91.53%
- Precision: 93.19%
- Recall: 86.08%
Probabilistic Relational Classifier Results
- Accuracy: 92.70%
- Precision: 91.78%
- Recall: 90.63%
pie title Classifier Performance "Random Forest Accuracy": 91.53 "Probabilistic Relational Accuracy": 92.70
Conclusion
The project successfully demonstrated the application of GNNs to a node classification task, utilizing the homophily assumption. Both classifiers performed well, with the Probabilistic Relational Classifier slightly outperforming the Random Forest model. Future work could involve applying this methodology to real-world datasets with more specific features and labels.