enrichlite - methods

1. What it does

enrichlite runs over-representation analysis (ORA). You paste a list of genes; it finds the gene sets, pathways, and Gene Ontology terms that your list overlaps more than you would expect if the genes had been picked at random. The idea is simple: if your genes cluster into "cell cycle" far more than chance would predict, that is a signal worth noticing.

Everything runs in your browser. The gene-set data is downloaded once as static files and the computation happens locally in a Web Worker. Your gene list is never uploaded; no data leaves your machine.

2. The statistic

For each gene set, the question is: of my query genes, how many landed in this set, and is that more than chance? enrichlite answers with a one-tailed hypergeometric test - the upper tail, P(X >= k) - the probability of seeing at least as large an overlap as observed, drawing without replacement from the background.

The contingency is defined by four numbers:

N - genes in the background universe
K - genes in this term that fall within the universe
n - query genes that fall within the universe
k - the overlap (query genes that are in the term)

This is exactly the one-sided Fisher's exact test on the corresponding 2x2 table. The effect size reported alongside the p-value is the fold enrichment:

fold enrichment = (k / n) / (K / N)

A fold of 1 means the overlap is what you would expect by chance; larger means enriched. The tail probability is summed in log space using a Lanczos approximation of the log-gamma function (so binomial coefficients stay stable at universe sizes around 20,000), with the sum accumulated via the probability-mass recurrence to avoid overflow.

3. Background / universe

The background is the pool of genes the test assumes your list was drawn from. It is the single most consequential choice you make, because it sets what "by chance" means.

Three options:

Annotated in collection (default) - the universe is every gene that has any annotation in the selected collection. This is the conventional choice for ORA: a gene with no annotation in the collection could never have been a hit, so it is excluded.
All protein-coding - the universe is the full protein-coding gene count for the species (human ~20,593; mouse ~26,310). This makes the universe larger and is usually more permissive.
Custom - you supply your own background gene list (for example, the genes expressed/detected in your experiment).

Changing the background changes N (and which genes count toward K and n), so it changes every p-value. A smaller, more specific universe generally makes overlaps look more significant; a larger universe makes them look less so. The app always shows the N it used. Per-collection annotated-universe sizes:

Collection	Human N	Mouse N
Reactome	11,963	8,851
Hallmark	4,384	4,291
GO-BP	17,318	17,371
GO-MF	15,274	14,668
GO-CC	10,789	11,327

4. Multiple testing

A collection contains hundreds or thousands of gene sets, and testing all of them inflates the chance of false positives. enrichlite corrects for this and reports adjusted values.

Benjamini-Hochberg FDR (default) - controls the expected proportion of false positives among the terms you call significant. The default cutoff is FDR < 0.05.
Bonferroni - controls the probability of any false positive at all; much stricter, and conservative when many terms are tested.

The correction family is the set of terms tested in the current run. Because only one collection is analysed at a time, the family is that collection's terms - not all collections combined.

5. Data sources, versions, and licenses

All gene-set data comes from authoritative public sources, pinned to specific versions. None of it is hand-curated or invented here.

Gene Ontology - release 2026-05-19, concept DOI 10.5281/zenodo.1205166. License CC BY 4.0.
MSigDB Hallmark - version 2026.1. License CC BY 4.0; attribution to the Broad Institute, MIT, and the Regents of the University of California.
Reactome - current release. License CC0.

Human and mouse are kept as entirely separate universes with separate symbol tables. Human uses HGNC symbols (all upper-case, e.g. BRAF); mouse uses MGI symbols (title-case, e.g. Braf). There is no on-the-fly ortholog conversion: a human gene list is tested only against human sets, and likewise for mouse. Within a species, matching is case-insensitive, so pasting braf still matches.

6. Gene Ontology specifics

GO terms are built from the ontology graph plus per-species annotation files, with the standard filtering and propagation that ORA expects.

Ontology: go-basic (the filtered version safe for propagation), split into the three namespaces GO-BP, GO-MF, GO-CC.
Annotations: the goa_human and mgi GAF files.
Evidence policy: annotations with evidence code ND are always dropped; annotations carrying a NOT qualifier are dropped; electronic annotations (IEA) are excluded by default. An "Include IEA" toggle in the app switches to a separate IEA-included variant of the selected GO collection (which adds coverage, especially for less-studied genes, at lower confidence); the default remains IEA-excluded. The chosen variant is recorded in the figure legend, methods text, and exported file provenance.
True-path-rule propagation over is_a and part_of: a gene annotated to a specific term also counts toward every ancestor of that term. Propagation stays within a namespace.
Display size bounds: terms with between 5 and 500 genes (after propagation) are kept.

7. Redundancy collapse and the GO hierarchy view

Two display-time tools help make sense of long, overlapping result lists - especially for GO, where a term and its parent are often both significant on the same genes. Neither changes the statistics or the multiple-testing family; they only reorganise terms that were already computed.

Redundancy collapse - groups terms that are significant because of the same query genes and shows one representative per group, with the collapsed terms expandable. Grouping is greedy and significance-first: the most significant term becomes a representative, and any less-significant term whose overlapping query genes are mostly contained in it (at a similarity threshold you choose) joins its group. It compares the overlapping genes, not term labels, so it de-duplicates parent/child redundancy and works on any collection.
GO hierarchy (tree) view - arranges the significant GO terms by their is_a / part_of relationships as a collapsible indented outline, colored by FDR and sized by overlap. A solid edge is a direct GO parent; a dashed edge marks an ancestor with intervening non-significant terms omitted, which you can expand to splice those terms in as faded nodes. It is built from a nearest-shipped-ancestor reduction of the ontology and is available for GO collections only.
Why displayed roots are not the ontology roots - because GO terms are size-bounded to 5-500 genes (Section 6), the broad terms near the top of each namespace are not shipped, so the tree cannot show the true ontology roots. Terms whose nearest shipped ancestor is missing appear as roots and are flagged as such. Every solid edge that is drawn is a genuine direct GO parent: the reduction never hides an intermediate term between a shipped term and its nearest shipped ancestor.

8. Reproducibility

The analysis is fully reproducible: the same input always gives the same output, and the published data can be regenerated from the pinned sources.

A Python build script downloads the source files at the versions listed above, applies the filtering and propagation described, and writes compact JSON. The build is deterministic - rebuilding from the same cached sources produces byte-identical data files. The web app never computes gene-set data; it only reads those static JSON files and runs the statistics in the browser. Source code and the build script are in the repository: github.com/robinson-vidva/enrichlite.

9. Caveats

Small input lists are fragile. With only a handful of genes, an enrichment often rests on a 1-2 gene overlap. Such results are unstable and should be read with caution, not as firm conclusions.
Background choice strongly affects significance. The same gene list can look highly significant or not, depending on the universe. Choose a background that reflects what could have been detected, and report it.
The size filter is display-only. The min/max set-size filter on the results table changes what is shown; it does not re-run the test or change any p-value or FDR.
Redundancy collapse and the tree view are display-only too. They regroup already-computed terms - collapse by shared overlapping genes, the GO tree by ontology relationships - and never change any p-value, FDR, or the multiple-testing family. In the tree, displayed roots are not the ontology roots, because broad size-excluded ancestors are not shown.
ORA ignores ranking. Every query gene is treated equally; fold changes, effect sizes, and ordering are not used. This is over-representation analysis, not gene-set enrichment analysis (GSEA).
Results reflect a snapshot. They depend on the data versions above and on current annotation knowledge, which is incomplete and changes over time.