KEGG pathway analysis is a practical way to move from a raw gene list to biological interpretation. In R, a common workflow is to use clusterProfiler for enrichment testing, enrichplot for summary plots, and pathview to paint expression changes directly onto KEGG pathway diagrams. That combination is especially useful after RNA-seq, microarray, proteomics, or CRISPR screens, where the main question is often not “which genes changed?” but “which pathways changed?” (bioconductor.org)
Steps and Expected Outcome
You will install a small set of Bioconductor packages, load a reproducible example gene list, run KEGG over-representation analysis with enrichKEGG(), inspect the result table, visualize the top pathways, and optionally run ranked-list GSEA with gseKEGG(). By the end, you should have a pathway enrichment table, publication-ready summary plots, and a rendered KEGG pathway image with your fold changes overlaid. (bioconductor.org)
Requirements
- Basic R skills and familiarity with differential expression results.
- A local R environment with internet access for live KEGG queries and pathway downloads; no GPU is required. (bioconductor.org)
- R packages:
clusterProfiler,enrichplot,pathview,DOSE,org.Hs.eg.db,dplyr, andggplot2.
Step 1: Install the packages
Install the Bioconductor packages once. After that, you can simply load them in future sessions.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c(
"clusterProfiler",
"enrichplot",
"pathview",
"DOSE",
"org.Hs.eg.db"
))
install.packages(c("dplyr", "ggplot2"))
Load the libraries:
library(clusterProfiler)
library(enrichplot)
library(pathview)
library(DOSE)
library(org.Hs.eg.db)
library(dplyr)
library(ggplot2)
Step 2: Load a reproducible example gene list
To keep the tutorial runnable, use the built-in geneList object from DOSE. It is a named numeric vector, which is exactly what you need for a ranked analysis, and its names can also be reused as input IDs for over-representation analysis. (yulab-smu.top)
data(geneList, package = "DOSE")
# Make sure the vector is ranked from high to low
geneList <- sort(geneList, decreasing = TRUE)
# Inspect the first few entries
head(geneList)
Example output structure:
# Named numeric vector:
# 4312 8318 10874 55143 ...
# 4.57 4.51 4.42 4.18 ...
Create two objects:
sig_genes: the genes you want to test in over-representation analysisbg_genes: the background universe of all measured genes
sig_genes <- names(geneList)[abs(geneList) > 2]
bg_genes <- names(geneList)
length(sig_genes)
length(bg_genes)
Using a defined background is good practice because it makes the enrichment test reflect the genes that were actually measured in your experiment.
Step 3: Run KEGG over-representation analysis
enrichKEGG() performs the classical pathway enrichment test. The key inputs are the gene IDs, the organism code, and optionally the background universe. The function supports several KEGG ID types, including KEGG IDs, NCBI gene IDs, NCBI protein IDs, and UniProt IDs. (bioconductor.org)
kk <- enrichKEGG(
gene = sig_genes,
universe = bg_genes,
organism = "hsa",
keyType = "ncbi-geneid",
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
qvalueCutoff = 0.20,
minGSSize = 10,
maxGSSize = 500
)
kk
What the main arguments mean:
organism = "hsa"means humankeyType = "ncbi-geneid"tells the function your IDs are Entrez-style NCBI gene IDsuniverse = bg_genessets the tested backgroundminGSSizeandmaxGSSizefilter out very small or very large pathways
If your result is empty, the most common causes are incorrect ID type, the wrong organism code, or a gene list that is too small.
Step 4: Convert the result to a tidy table
The enrichment object is easy to inspect directly, but converting it to a data frame makes filtering and reporting simpler.
kegg_table <- as.data.frame(kk) %>%
arrange(p.adjust) %>%
select(
ID, Description, GeneRatio, BgRatio,
pvalue, p.adjust, qvalue, Count, geneID
)
print(kegg_table, n = 10)
Extract the genes for the top pathway:
top_pathway_genes <- strsplit(kegg_table$geneID[1], "/")[[1]]
top_pathway_genes
This is useful when you want to inspect the exact drivers behind a significant pathway. You can also pull genes directly from the enrichment object by pathway ID. (yulab-smu.top)
Step 5: Visualize the enriched pathways
For a fast overview, barplot() and dotplot() are usually enough. If you want to see which genes connect to which pathways, use cnetplot(). The enrichment visualization methods in enrichplot support KEGG results directly. (yulab-smu.top)
barplot(kk, showCategory = 10, title = "Top KEGG pathways")
dotplot(kk, showCategory = 10, title = "KEGG enrichment")
cnetplot(
kk,
showCategory = 5,
foldChange = geneList,
circular = FALSE,
colorEdge = TRUE
)
A practical interpretation pattern is:
- use
dotplot()to rank the top pathways - use
cnetplot()to see shared driver genes across pathways - then move to pathway rendering for one or two top hits
Step 6: Render a KEGG pathway map with expression values
pathview downloads the KEGG pathway graph, maps your numeric values to pathway nodes, and renders an image. This is often the most intuitive figure in the whole workflow. (bioconductor.org)
Pick the top enriched pathway:
top_id <- kegg_table$ID[1] # for example "hsa04110"
top_pathway <- sub("^hsa", "", top_id)
top_pathway
Render the pathway:
pv_out <- pathview(
gene.data = geneList,
pathway.id = top_pathway,
species = "hsa",
out.suffix = "demo"
)
After the command finishes, look in your working directory. You should see pathway image files generated by pathview. Those images show your fold changes painted onto the KEGG map.
Step 7: Run KEGG GSEA on the full ranked list
Use gseKEGG() when you have a ranked vector for all genes and do not want to choose an arbitrary significance cutoff first. This is usually the better option when small but coordinated shifts matter. gseKEGG() expects an ordered named numeric vector. (bioconductor.org)
gkk <- gseKEGG(
geneList = geneList,
organism = "hsa",
keyType = "ncbi-geneid",
pvalueCutoff = 0.05,
minGSSize = 10,
maxGSSize = 500,
verbose = FALSE
)
gsea_table <- as.data.frame(gkk) %>%
arrange(p.adjust) %>%
select(ID, Description, enrichmentScore, NES, pvalue, p.adjust)
print(gsea_table, n = 10)
Plot the GSEA result:
dotplot(gkk, showCategory = 10, title = "KEGG GSEA")
A simple rule:
- use
enrichKEGG()for a discrete gene list - use
gseKEGG()for a fully ranked vector
Step 8: Adapt the workflow to your own data
In real projects, your input often starts as gene symbols or Ensembl IDs rather than NCBI gene IDs. A common first step is ID conversion. bitr() translates IDs using an OrgDb, while bitr_kegg() uses the KEGG API and is especially helpful for species without a convenient annotation database package. search_kegg_organism() helps find the correct KEGG organism code. (bioconductor.org)
Convert human gene symbols to Entrez IDs
de <- data.frame(
SYMBOL = c("TP53", "EGFR", "BRCA1", "CDK1"),
logFC = c(2.4, 1.9, -1.8, 2.2)
)
id_map <- bitr(
de$SYMBOL,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db
)
de2 <- inner_join(de, id_map, by = "SYMBOL")
user_geneList <- de2$logFC
names(user_geneList) <- de2$ENTREZID
user_geneList <- sort(user_geneList, decreasing = TRUE)
user_geneList
Find the KEGG organism code
search_kegg_organism("Homo sapiens", by = "scientific_name")
search_kegg_organism("Oryza sativa", by = "scientific_name")
Convert IDs with the KEGG API
bitr_kegg(
geneID = c("10458", "7157"),
fromType = "ncbi-geneid",
toType = "kegg",
organism = "hsa"
)
Once your own ranked vector is ready, you can drop it into the same enrichKEGG() and gseKEGG() pattern shown above.
Recap
A clean KEGG workflow in R usually has four stages: prepare gene IDs, run enrichKEGG() or gseKEGG(), summarize the result with enrichplot, and render individual pathways with pathview. If you keep your ID type, organism code, and background universe consistent, the analysis is straightforward and highly reusable across RNA-seq and other omics projects. (bioconductor.org)
Further Reading
- clusterProfiler package page — package overview, installation, and documentation.
- KEGG enrichment chapter — practical KEGG examples, organism lookup, and ID conversion.
- KEGGREST vignette — KEGG REST access from R. (bioconductor.org)
- pathview package page — pathway rendering and package details. (bioconductor.org)
- enrichplot chapter — visualization patterns for enrichment results. (yulab-smu.top)
FAQ
Why are many of my genes missing from the KEGG result?
Usually this is an ID mapping issue or a coverage issue. KEGG does not annotate every gene to a pathway, so some genes are dropped even when the IDs are valid. A quick check is to test the mapping with bitr_kegg() before running enrichment. (yulab-smu.top)
Should I use enrichKEGG() or gseKEGG()?
Use enrichKEGG() when you already have a filtered gene list, such as significantly up- and down-regulated genes. Use gseKEGG() when you have a ranked vector for all genes and want to avoid choosing a hard threshold first. (bioconductor.org)
How do I make KEGG analyses more reproducible?
Online KEGG annotations can change over time. If you need stronger reproducibility, save a local snapshot of KEGG data and reuse it later, for example through a local GSON object or local internal data workflow instead of relying only on live queries. (yulab-smu.top)
