Biological Knowledge Graphs for Drug Discovery

Jonathan Alles

EVOBYTE Digital Biology

Introduction

Drug discovery has a data problem, but not for lack of data. The challenge is fragmentation. Gene annotations live in one database, pathway maps in another, phenotypes and diseases in a third, and the literature sprawls everywhere. Each source uses different identifiers and schemas, which makes even simple questions—what targets connect this phenotype to this pathway?—surprisingly hard to answer.

Biological knowledge graphs (often shortened to KGs) cut through that noise. A knowledge graph represents entities such as genes, proteins, pathways, diseases, phenotypes, compounds, and even papers as nodes. It then connects those nodes with typed relationships like “encodes,” “participates_in,” “associated_with,” or “inhibits.” Because relationships are first‑class data, you can ask the graph for specific, explainable links across resources and across levels of biology. The result is a unified, queryable, and machine‑actionable view of your R&D landscape.

In this post, we’ll unpack how biological knowledge graphs unify heterogeneous bioinformatics resources, and then show how they support target prioritization, hypothesis generation, and retrieval‑augmented AI (RAG) workflows in biotech and pharma settings. Along the way we’ll define the key acronyms, offer short examples, and keep the focus on decisions you can make today.

Why biological knowledge graphs matter in drug discovery workflows

At its core, a knowledge graph makes integration a design principle rather than an afterthought. Where traditional data warehouses try to coerce diverse sources into one rigid schema, a KG embraces heterogeneity by modeling concepts and connections directly. It handles new node types and relationships without re‑architecting your whole stack, which is crucial when your inputs range from gene expression signatures to pathway diagrams and full‑text articles.

Two properties make KGs especially valuable in drug discovery. First, they align data using biomedical ontologies—controlled vocabularies that define terms and their relationships. Resources like the Gene Ontology (GO), the Human Phenotype Ontology (HPO), and disease ontologies help normalize language across datasets. Second, they preserve provenance. Every edge can carry metadata about where it came from, how it was inferred, and how confident you should be. That transparency enables explainable analytics and builds trust with wet‑lab teams and regulators alike.

Once built, the graph turns common informatics tasks into natural questions. Instead of scripting a dozen table joins, you can ask, “Which targets sit on pathways enriched for phenotypes seen in our patient cohort, and what tool compounds already modulate them?” Because the answer is a set of connected nodes and edges rather than a flat table, you immediately see the path of evidence.

From heterogeneous bioinformatics resources to a unified knowledge graph

Unifying genes, pathways, phenotypes, and literature starts with entity resolution and controlled vocabularies. Every resource favors its own identifiers—ENSEMBL vs. Entrez for genes, ChEMBL vs. PubChem for compounds, OMIM vs. MONDO for diseases. A robust KG normalizes these with curated cross‑references and compact URIs (often called CURIEs) so that “TP53,” “ENSG00000141510,” and “7157” resolve to the same gene node. Ontologies like GO and HPO sit at the center of this normalization layer, defining shared terms and parent‑child hierarchies that keep your graph semantically coherent.

Relationships come next. You’ll map edges like gene→pathway participation, gene→disease associations, phenotype→disease annotations, compound→target bioactivity, and paper→entity mentions. Importantly, edges are typed and directional. A “causes” edge is not the same as an “associated_with” edge, and neither is equivalent to “negatively_regulates.” Capturing that nuance lets you later constrain queries by mechanistic specificity rather than drowning in weak associations.

The final piece is literature. Text mining systems tag entities in abstracts and full texts, extracting candidate relationships with confidence scores and offsets that point back to the exact sentences. Those edges are noisier than curated databases, but they’re also the fastest way to surface emerging biology. Your KG should keep them, but always with provenance and a score so downstream consumers can apply thresholds. Because the literature evolves daily, the graph build should be reproducible and incremental, with pipelines that refresh specific edge types without rebuilding everything.

It’s worth noting that KGs come in two common flavors. Resource Description Framework (RDF) graphs store statements as subject‑predicate‑object “triples” and use standards like OWL (the Web Ontology Language) to represent rich semantics. Property graphs, popularized by systems like Neo4j, attach attributes to both nodes and edges and use labels to group types. Both models can represent biomedical data well. Teams usually pick based on downstream tooling: SPARQL for RDF stores and Cypher or Gremlin for property graphs. The key is not the label, but the discipline: normalized entities, typed edges, clear provenance, and repeatable ETL.

Target prioritization with graph analytics and embeddings

Once your knowledge graph is in place, it becomes a ranking engine for drug targets. You can score nodes by their position in the graph, their connectivity to high‑value biology, and the strength and recency of evidence.

Classic graph algorithms provide strong baselines. Degree centrality highlights hub genes, which can be risky but sometimes indispensable. PageRank favors nodes connected to other important nodes, surfacing targets that sit at the crossroads of relevant pathways. Shortest‑path and k‑hop queries reveal how close a target is to a disease via mechanistically meaningful edges such as “gene participates_in pathway,” “pathway enriched_in phenotype,” and “phenotype observed_in disease.” Random walk with restart seeds the walk on your disease or phenotype set and prioritizes genes visited more often than chance, which aligns well with the “guilt‑by‑association” principle biologists know.

Embeddings push this further. Knowledge graph embeddings (KGE) like TransE, RotatE, and ComplEx learn low‑dimensional vectors for nodes and relations so that valid triples score higher than invalid ones. In practice, you can train a KGE model on your graph and then score candidate triples like (target, involved_in, disease) or (compound, inhibits, target). High‑scoring, previously unseen edges become hypotheses, and the model’s neighborhoods suggest why they might hold. Because embeddings are differentiable, you can blend them with numeric features like tissue expression, CRISPR screen results, or safety annotations in a simple ranker or even a small neural model.

The payoff shows up in real decisions. Imagine your team is working on a neuroinflammatory indication characterized by microglial activation and synaptic loss. A graph query that starts from phenotypes such as “abnormal microglial cell morphology” and “impaired synaptic transmission,” walks through enriched pathways, and lands on targets with tool compounds could quickly surface candidates like kinases involved in microglial signaling. If those targets also connect to independent edges from the literature and human genetics, the rank climbs. Now you have a shortlist with explicit paths of evidence you can paste into a slide for the project team.

Here’s a compact example using Cypher, the query language for property graphs. It asks for targets connected to a disease via phenotype and pathway evidence, and it returns a simple score that combines edge confidence and recency.

MATCH (d:Disease {id:"MONDO:0007149"})<-[:OBSERVED_IN]-(p:Phenotype)
MATCH (g:Gene)-[rp:PARTICIPATES_IN]->(pw:Pathway)<-[:ENRICHED_IN]-(p)
MATCH (g)<-[i:INHIBITS]-(c:Compound)
WHERE rp.source IN ["Reactome","GO"] AND i.pchembl_value >= 6.0
WITH g, c, d, p, pw,
     coalesce(rp.confidence,0.7) + coalesce(p.confidence,0.7) + coalesce(i.confidence,0.7) AS econf,
     (date().year - coalesce(i.year, date().year)) AS age
RETURN g.symbol AS target, c.name AS compound, pw.name AS pathway,
       round(econf - 0.01*age,3) AS score
ORDER BY score DESC
LIMIT 15;

Even a simple query like this beats ad hoc joins because it is explainable. You can click into any path and inspect which edges drove the score, which paper they came from, and how recent the claim is.

Hypothesis generation that stays grounded in mechanism

Target prioritization is only half the story. Hypothesis generation is where a KG earns its keep because it encourages mechanism‑first thinking. Rather than asking the model “what else might work,” you ask the graph to reveal plausible paths: a compound modulates a target, the target sits on a pathway, the pathway aligns with observed phenotypes, and those phenotypes characterize your disease. Because each hop is typed, you’re not just correlating; you’re proposing a mechanistic chain that a bench scientist can test.

Embeddings make these chains wider without making them fuzzier. A model like RotatE can flag an unobserved edge between a target and disease if similar patterns exist elsewhere in the graph. Combined with constraints—only consider targets expressed in the right tissue or with clean safety annotations—you get hypotheses that are both novel and practical. Some teams also run link‑prediction over compound–target pairs to scout repurposing options. In each case, the KG doesn’t replace biology; it narrows the search space to the handful of ideas most consistent with what we already know.

To keep hypotheses trustworthy, enforce provenance discipline. Every edge should carry enough metadata to trace it back to its source. When a model suggests a new link, immediately pull the nearest supporting paths and citations. This “evidence bundle” allows your therapeutic area experts to judge plausibility quickly and decide the next experiment.

Retrieval‑augmented generation (RAG) that speaks graph and text

Large language models (LLMs) shine when they can retrieve the right context. Retrieval‑augmented generation, or RAG, feeds an LLM with documents or facts relevant to a question and asks it to synthesize an answer grounded in that evidence. In life sciences, RAG works best when it blends two channels: vector search over literature and symbolic search over the knowledge graph.

A graph‑aware RAG pipeline looks like this. First, you entity‑normalize the user’s question so that “p53,” “TP53,” and “tumor protein p53” resolve to the same node. Next, you run a constrained graph query to collect high‑value facts: targets near the disease, pathways enriched for the phenotypes, compounds with strong bioactivity, and the provenance for each edge. In parallel, you run vector search over curated text chunks from papers and databases to find passages that mention those same entities in context. Finally, you pack both the symbolic facts and the best text snippets into the model’s context window with a prompt that forces citation and asks for step‑by‑step, mechanism‑first reasoning.

Because KGs structure the retrieval space, they also make RAG safer. You can filter out low‑confidence edges, prefer human‑curated sources for mechanistic claims, and bound the model’s scope to entities already vetted by your team. The LLM then becomes a narrative engine that stitches together the graph’s paths and the literature’s language into a crisp, citable answer.

Here’s a tiny sketch of how you might wire a KG query into a RAG step in Python. It’s deliberately minimal to highlight the flow.

from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# 1) Query the knowledge graph for grounded facts
driver = GraphDatabase.driver(NEO4J_URI, auth=(USER, PASS))
cypher = """
MATCH (d:Disease {id:$disease})<-[:OBSERVED_IN]-(p:Phenotype)
MATCH (g:Gene)-[:PARTICIPATES_IN]->(:Pathway)<-[:ENRICHED_IN]-(p)
OPTIONAL MATCH (c:Compound)-[i:INHIBITS]->(g)
RETURN g.symbol AS target, collect(DISTINCT p.label)[0..5] AS phenos,
       collect(DISTINCT c.name)[0..3] AS compounds
LIMIT 20
"""
with driver.session() as s:
    rows = s.run(cypher, {"disease":"MONDO:0007149"}).data()

# 2) Gather literature snippets (pretend we have pre-embedded chunks)
model = SentenceTransformer("all-MiniLM-L6-v2")
question = "Which microglial targets link patient phenotypes to pathway biology and have tool compounds?"
qvec = model.encode([question])
snippets = load_candidate_snippets()  # list[{"text":..., "vector":...}]
top = sorted(snippets, key=lambda x: cosine_similarity([x["vector"]], qvec)[0,0], reverse=True)[:5]

# 3) Build the RAG prompt for your LLM of choice
facts = "\n".join([f"- {r['target']}: phenotypes={r['phenos']}, compounds={r['compounds']}" for r in rows])
context = "\n".join([s["text"] for s in top])
prompt = f"Use the facts to propose 2 testable targets.\nFacts:\n{facts}\nEvidence:\n{context}\nExplain mechanisms and cite evidence."

print(prompt[:500])

This pattern scales from notebook prototypes to production services. The KG constrains retrieval to relevant, explainable facts. Vector search pulls in the best supporting prose. The LLM writes, but the graph keeps it honest.

Summary / Takeaways

Biological knowledge graphs give drug discovery a shared map. By unifying genes, pathways, phenotypes, and literature under a common set of identifiers, ontologies, and typed relationships, they replace brittle data stitching with flexible, explainable queries. Once in place, the graph becomes a ranking engine for targets, a generator of mechanistically grounded hypotheses, and a retrieval backbone for AI assistants that need to stay anchored to real evidence.

If you’re starting fresh, begin with your program’s critical questions and work backward to the entities and edges you need. Normalize identifiers through GO, HPO, and trusted cross‑references. Preserve provenance from day one. Add graph analytics for quick wins in target ranking, then layer in embeddings and RAG to explore novel ideas with confidence. Most importantly, keep the loop tight between computational insights and bench experiments. The best knowledge graphs don’t just store what we know—they help us decide what to test next.