AI

Encoding implementations for creating, annotating and visualizing complex biological knowledge graphs

In this tutorial, we explore how to use it Pybel Directly build and analyze a rich ecosystem of biological knowledge graphs in Google Colab. We first install all the necessary packages including Pybel, NetworkX, Matplotlib, Seaborn and Pandas. We then demonstrate how to define proteins, processes, and modifications using PYBEL DSL. From there, we will guide you to create disease-related pathways that demonstrate how to encode causality, protein-protein interactions, and phosphorylation events. In addition to graphical structure, we introduce advanced network analysis, including centrality metrics, node classification and subgraph extraction, and techniques for extracting citations and evidence data. By the end of this section, you will have a fully annotated BEL graph that prepares downstream visualization and enrichment analysis, laying a solid foundation for interactive biology knowledge exploration.

!pip install pybel pybel-tools networkx matplotlib seaborn pandas -q


import pybel
import pybel.dsl as dsl
from pybel import BELGraph
from pybel.io import to_pickle, from_pickle
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')


print("PyBEL Advanced Tutorial: Biological Expression Language Ecosystem")
print("=" * 65)

We first install Pybel and its dependencies directly in Colab to ensure that all necessary libraries, NetworkX, Matplotlib, Seaborn and Pandas can be analyzed. After installation, we import the core module and suppress warnings to keep the notebook clean and focus on the results.

print("n1. Building a Biological Knowledge Graph")
print("-" * 40)


graph = BELGraph(
   name="Alzheimer's Disease Pathway",
   version="1.0.0",
   description="Example pathway showing protein interactions in AD",
   authors="PyBEL Tutorial"
)


app = dsl.Protein(name="APP", namespace="HGNC")
abeta = dsl.Protein(name="Abeta", namespace="CHEBI")
tau = dsl.Protein(name="MAPT", namespace="HGNC")
gsk3b = dsl.Protein(name="GSK3B", namespace="HGNC")
inflammation = dsl.BiologicalProcess(name="inflammatory response", namespace="GO")
apoptosis = dsl.BiologicalProcess(name="apoptotic process", namespace="GO")




graph.add_increases(app, abeta, citation="PMID:12345678", evidence="APP cleavage produces Abeta")
graph.add_increases(abeta, inflammation, citation="PMID:87654321", evidence="Abeta triggers neuroinflammation")


tau_phosphorylated = dsl.Protein(name="MAPT", namespace="HGNC",
                               variants=[dsl.ProteinModification("Ph")])
graph.add_increases(gsk3b, tau_phosphorylated, citation="PMID:11111111", evidence="GSK3B phosphorylates tau")
graph.add_increases(tau_phosphorylated, apoptosis, citation="PMID:22222222", evidence="Hyperphosphorylated tau causes cell death")
graph.add_increases(inflammation, apoptosis, citation="PMID:33333333", evidence="Inflammation promotes apoptosis")


graph.add_association(abeta, tau, citation="PMID:44444444", evidence="Abeta and tau interact synergistically")


print(f"Created BEL graph with {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges")

We initialized a Belgraph of the Alzheimer’s disease pathway with metadata and defined proteins and processes using Pybel DSL. By adding causality, protein modification and association, we construct an integrated network that captures the interactions of key molecules.

print("n2. Advanced Network Analysis")
print("-" * 30)


degree_centrality = nx.degree_centrality(graph)
betweenness_centrality = nx.betweenness_centrality(graph)
closeness_centrality = nx.closeness_centrality(graph)


most_central = max(degree_centrality, key=degree_centrality.get)
print(f"Most connected node: {most_central}")
print(f"Degree centrality: {degree_centrality[most_central]:.3f}")

We compute the degree, middle and intimate centers to quantify the importance of each node in the chart. By identifying the most connected nodes, we can gain insight into potential hubs that may drive disease mechanisms.

print("n3. Biological Entity Classification")
print("-" * 35)


node_types = Counter()
for node in graph.nodes():
   node_types[node.function] += 1


print("Node distribution:")
for func, count in node_types.items():
   print(f"  {func}: {count}")

We classify each node by its functions, such as proteins or biological processes, and calculate their counts. This decomposition helps us understand the composition of the network at a glance.

print("n4. Pathway Analysis")
print("-" * 20)


proteins = [node for node in graph.nodes() if node.function == 'Protein']
processes = [node for node in graph.nodes() if node.function == 'BiologicalProcess']


print(f"Proteins in pathway: {len(proteins)}")
print(f"Biological processes: {len(processes)}")


edge_types = Counter()
for u, v, data in graph.edges(data=True):
   edge_types[data.get('relation')] += 1


print("nRelationship types:")
for rel, count in edge_types.items():
   print(f"  {rel}: {count}")

We separate all proteins and processes to measure the scope and complexity of the pathway. Computing different relationship types further reveals which interactions (e.g., increase or association) dominate our model.

print("n5. Literature Evidence Analysis")
print("-" * 32)


citations = []
evidences = []
for _, _, data in graph.edges(data=True):
   if 'citation' in data:
       citations.append(data['citation'])
   if 'evidence' in data:
       evidences.append(data['evidence'])


print(f"Total citations: {len(citations)}")
print(f"Unique citations: {len(set(citations))}")
print(f"Evidence statements: {len(evidences)}")

We extract citation identifiers and evidence strings from each edge to evaluate our basis in published research. Summary summary and unique citations allow us to evaluate the breadth of the supporting literature.

print("n6. Subgraph Analysis")
print("-" * 22)


inflammation_nodes = [inflammation]
inflammation_neighbors = list(graph.predecessors(inflammation)) + list(graph.successors(inflammation))
inflammation_subgraph = graph.subgraph(inflammation_nodes + inflammation_neighbors)


print(f"Inflammation subgraph: {inflammation_subgraph.number_of_nodes()} nodes, {inflammation_subgraph.number_of_edges()} edges")

We insulate the inflammatory submap by collecting their direct neighbors, thus causing the focus of inflammatory crosstalk. The targeted subnet highlights how inflammation interfaces with other disease processes.

print("n7. Advanced Graph Querying")
print("-" * 28)


try:
   paths = list(nx.all_simple_paths(graph, app, apoptosis, cutoff=3))
   print(f"Paths from APP to apoptosis: {len(paths)}")
   if paths:
       print(f"Shortest path length: {len(paths[0])-1}")
except nx.NetworkXNoPath:
   print("No paths found between APP and apoptosis")


apoptosis_inducers = list(graph.predecessors(apoptosis))
print(f"Factors that increase apoptosis: {len(apoptosis_inducers)}")

We enumerate simple paths between application and apoptosis to explore mechanical routes and identify key intermediates. Listing all predecessors of all apoptosis also shows us what factors may trigger cell death.

print("n8. Data Export and Visualization")
print("-" * 35)


adj_matrix = nx.adjacency_matrix(graph)
node_labels = [str(node) for node in graph.nodes()]


plt.figure(figsize=(12, 8))


plt.subplot(2, 2, 1)
pos = nx.spring_layout(graph, k=2, iterations=50)
nx.draw(graph, pos, with_labels=False, node_color="lightblue",
       node_size=1000, font_size=8, font_weight="bold")
plt.title("BEL Network Graph")


plt.subplot(2, 2, 2)
centralities = list(degree_centrality.values())
plt.hist(centralities, bins=10, alpha=0.7, color="green")
plt.title("Degree Centrality Distribution")
plt.xlabel("Centrality")
plt.ylabel("Frequency")


plt.subplot(2, 2, 3)
functions = list(node_types.keys())
counts = list(node_types.values())
plt.pie(counts, labels=functions, autopct="%1.1f%%", startangle=90)
plt.title("Node Type Distribution")


plt.subplot(2, 2, 4)
relations = list(edge_types.keys())
rel_counts = list(edge_types.values())
plt.bar(relations, rel_counts, color="orange", alpha=0.7)
plt.title("Relationship Types")
plt.xlabel("Relation")
plt.ylabel("Count")
plt.xticks(rotation=45)


plt.tight_layout()
plt.show()

We prepare the adjacency matrix and node labels for downstream use and generate a multi-panel graph to show network structure, center distribution, node-type proportions and edge-type counts. These visualizations bring our BEL diagrams to life, supporting a deeper biological interpretation.

In this tutorial, we demonstrate Pybel’s functionality and flexibility in modeling complex biological systems. We show how people can easily construct a well-planned white box map of Alzheimer’s disease interactions, perform network-level analysis to identify key hub nodes, and extract biologically significant submaps for intensive research. We also introduce basic practices of literature mining and preparing data structures for visualization. In the next step, we encourage you to extend this framework to your path, integrate other OMICS data, run enrichment tests, or combine graphs with machine learning workflows.


Check The code is here. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button