KEGG Pathway Analysis in R

KEGG pathway analysis is a practical way to move from a raw gene list to biological interpretation. In R, a common workflow is to use clusterProfiler for enrichment testing, enrichplot for summary plots, and pathview to paint expression changes directly onto KEGG pathway diagrams. That combination is especially useful after RNA-seq, microarray, proteomics, or CRISPR screens, where the main question is often not “which genes changed?” but “which pathways changed?” (bioconductor.org)

Steps and Expected Outcome

You will install a small set of Bioconductor packages, load a reproducible example gene list, run KEGG over-representation analysis with enrichKEGG(), inspect the result table, visualize the top pathways, and optionally run ranked-list GSEA with gseKEGG(). By the end, you should have a pathway enrichment table, publication-ready summary plots, and a rendered KEGG pathway image with your fold changes overlaid. (bioconductor.org)

Requirements

Basic R skills and familiarity with differential expression results.
A local R environment with internet access for live KEGG queries and pathway downloads; no GPU is required. (bioconductor.org)
R packages: clusterProfiler, enrichplot, pathview, DOSE, org.Hs.eg.db, dplyr, and ggplot2.

Step 1: Install the packages

Install the Bioconductor packages once. After that, you can simply load them in future sessions.

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install(c(
  "clusterProfiler",
  "enrichplot",
  "pathview",
  "DOSE",
  "org.Hs.eg.db"
))

install.packages(c("dplyr", "ggplot2"))

Load the libraries:

library(clusterProfiler)
library(enrichplot)
library(pathview)
library(DOSE)
library(org.Hs.eg.db)
library(dplyr)
library(ggplot2)

Step 2: Load a reproducible example gene list

To keep the tutorial runnable, use the built-in geneList object from DOSE. It is a named numeric vector, which is exactly what you need for a ranked analysis, and its names can also be reused as input IDs for over-representation analysis. (yulab-smu.top)

data(geneList, package = "DOSE")

# Make sure the vector is ranked from high to low
geneList <- sort(geneList, decreasing = TRUE)

# Inspect the first few entries
head(geneList)

Example output structure:

# Named numeric vector:
#   4312    8318   10874    55143 ...
#  4.57    4.51    4.42     4.18  ...

Create two objects:

sig_genes: the genes you want to test in over-representation analysis
bg_genes: the background universe of all measured genes

sig_genes <- names(geneList)[abs(geneList) > 2]
bg_genes  <- names(geneList)

length(sig_genes)
length(bg_genes)

Using a defined background is good practice because it makes the enrichment test reflect the genes that were actually measured in your experiment.

Step 3: Run KEGG over-representation analysis

enrichKEGG() performs the classical pathway enrichment test. The key inputs are the gene IDs, the organism code, and optionally the background universe. The function supports several KEGG ID types, including KEGG IDs, NCBI gene IDs, NCBI protein IDs, and UniProt IDs. (bioconductor.org)

kk <- enrichKEGG(
  gene           = sig_genes,
  universe       = bg_genes,
  organism       = "hsa",
  keyType        = "ncbi-geneid",
  pvalueCutoff   = 0.05,
  pAdjustMethod  = "BH",
  qvalueCutoff   = 0.20,
  minGSSize      = 10,
  maxGSSize      = 500
)

kk

What the main arguments mean:

organism = "hsa" means human
keyType = "ncbi-geneid" tells the function your IDs are Entrez-style NCBI gene IDs
universe = bg_genes sets the tested background
minGSSize and maxGSSize filter out very small or very large pathways

If your result is empty, the most common causes are incorrect ID type, the wrong organism code, or a gene list that is too small.

Step 4: Convert the result to a tidy table

The enrichment object is easy to inspect directly, but converting it to a data frame makes filtering and reporting simpler.

kegg_table <- as.data.frame(kk) %>%
  arrange(p.adjust) %>%
  select(
    ID, Description, GeneRatio, BgRatio,
    pvalue, p.adjust, qvalue, Count, geneID
  )

print(kegg_table, n = 10)

Extract the genes for the top pathway:

top_pathway_genes <- strsplit(kegg_table$geneID[1], "/")[[1]]
top_pathway_genes

This is useful when you want to inspect the exact drivers behind a significant pathway. You can also pull genes directly from the enrichment object by pathway ID. (yulab-smu.top)

Step 5: Visualize the enriched pathways

For a fast overview, barplot() and dotplot() are usually enough. If you want to see which genes connect to which pathways, use cnetplot(). The enrichment visualization methods in enrichplot support KEGG results directly. (yulab-smu.top)

barplot(kk, showCategory = 10, title = "Top KEGG pathways")

dotplot(kk, showCategory = 10, title = "KEGG enrichment")

cnetplot(
  kk,
  showCategory = 5,
  foldChange   = geneList,
  circular     = FALSE,
  colorEdge    = TRUE
)

A practical interpretation pattern is:

use dotplot() to rank the top pathways
use cnetplot() to see shared driver genes across pathways
then move to pathway rendering for one or two top hits

Step 6: Render a KEGG pathway map with expression values

pathview downloads the KEGG pathway graph, maps your numeric values to pathway nodes, and renders an image. This is often the most intuitive figure in the whole workflow. (bioconductor.org)

Pick the top enriched pathway:

top_id <- kegg_table$ID[1]          # for example "hsa04110"
top_pathway <- sub("^hsa", "", top_id)
top_pathway

Render the pathway:

pv_out <- pathview(
  gene.data  = geneList,
  pathway.id = top_pathway,
  species    = "hsa",
  out.suffix = "demo"
)

After the command finishes, look in your working directory. You should see pathway image files generated by pathview. Those images show your fold changes painted onto the KEGG map.

Step 7: Run KEGG GSEA on the full ranked list

Use gseKEGG() when you have a ranked vector for all genes and do not want to choose an arbitrary significance cutoff first. This is usually the better option when small but coordinated shifts matter. gseKEGG() expects an ordered named numeric vector. (bioconductor.org)

gkk <- gseKEGG(
  geneList      = geneList,
  organism      = "hsa",
  keyType       = "ncbi-geneid",
  pvalueCutoff  = 0.05,
  minGSSize     = 10,
  maxGSSize     = 500,
  verbose       = FALSE
)

gsea_table <- as.data.frame(gkk) %>%
  arrange(p.adjust) %>%
  select(ID, Description, enrichmentScore, NES, pvalue, p.adjust)

print(gsea_table, n = 10)

Plot the GSEA result:

dotplot(gkk, showCategory = 10, title = "KEGG GSEA")

A simple rule:

use enrichKEGG() for a discrete gene list
use gseKEGG() for a fully ranked vector

Step 8: Adapt the workflow to your own data

In real projects, your input often starts as gene symbols or Ensembl IDs rather than NCBI gene IDs. A common first step is ID conversion. bitr() translates IDs using an OrgDb, while bitr_kegg() uses the KEGG API and is especially helpful for species without a convenient annotation database package. search_kegg_organism() helps find the correct KEGG organism code. (bioconductor.org)

Convert human gene symbols to Entrez IDs

de <- data.frame(
  SYMBOL = c("TP53", "EGFR", "BRCA1", "CDK1"),
  logFC  = c(2.4, 1.9, -1.8, 2.2)
)

id_map <- bitr(
  de$SYMBOL,
  fromType = "SYMBOL",
  toType   = "ENTREZID",
  OrgDb    = org.Hs.eg.db
)

de2 <- inner_join(de, id_map, by = "SYMBOL")

user_geneList <- de2$logFC
names(user_geneList) <- de2$ENTREZID
user_geneList <- sort(user_geneList, decreasing = TRUE)

user_geneList

Find the KEGG organism code

search_kegg_organism("Homo sapiens", by = "scientific_name")
search_kegg_organism("Oryza sativa", by = "scientific_name")

Convert IDs with the KEGG API

bitr_kegg(
  geneID   = c("10458", "7157"),
  fromType = "ncbi-geneid",
  toType   = "kegg",
  organism = "hsa"
)

Once your own ranked vector is ready, you can drop it into the same enrichKEGG() and gseKEGG() pattern shown above.

Recap

A clean KEGG workflow in R usually has four stages: prepare gene IDs, run enrichKEGG() or gseKEGG(), summarize the result with enrichplot, and render individual pathways with pathview. If you keep your ID type, organism code, and background universe consistent, the analysis is straightforward and highly reusable across RNA-seq and other omics projects. (bioconductor.org)

FAQ

Why are many of my genes missing from the KEGG result?

Usually this is an ID mapping issue or a coverage issue. KEGG does not annotate every gene to a pathway, so some genes are dropped even when the IDs are valid. A quick check is to test the mapping with bitr_kegg() before running enrichment. (yulab-smu.top)

Should I use `enrichKEGG()` or `gseKEGG()`?

Use enrichKEGG() when you already have a filtered gene list, such as significantly up- and down-regulated genes. Use gseKEGG() when you have a ranked vector for all genes and want to avoid choosing a hard threshold first. (bioconductor.org)

How do I make KEGG analyses more reproducible?

Online KEGG annotations can change over time. If you need stronger reproducibility, save a local snapshot of KEGG data and reuse it later, for example through a local GSON object or local internal data workflow instead of relying only on live queries. (yulab-smu.top)