enrichlite methods

How the analysis works, what the numbers mean, and where the data comes from.

< Back to the tool

1. What it does

enrichlite runs over-representation analysis (ORA). You paste a list of genes; it finds the gene sets, pathways, and Gene Ontology terms that your list overlaps more than you would expect if the genes had been picked at random. The idea is simple: if your genes cluster into "cell cycle" far more than chance would predict, that is a signal worth noticing.

Everything runs in your browser. The gene-set data is downloaded once as static files and the computation happens locally in a Web Worker. Your gene list is never uploaded; no data leaves your machine.

2. The statistic

For each gene set, the question is: of my query genes, how many landed in this set, and is that more than chance? enrichlite answers with a one-tailed hypergeometric test - the upper tail, P(X >= k) - the probability of seeing at least as large an overlap as observed, drawing without replacement from the background.

The contingency is defined by four numbers:

This is exactly the one-sided Fisher's exact test on the corresponding 2x2 table. The effect size reported alongside the p-value is the fold enrichment:

fold enrichment = (k / n) / (K / N)

A fold of 1 means the overlap is what you would expect by chance; larger means enriched. The tail probability is summed in log space using a Lanczos approximation of the log-gamma function (so binomial coefficients stay stable at universe sizes around 20,000), with the sum accumulated via the probability-mass recurrence to avoid overflow.

3. Background / universe

The background is the pool of genes the test assumes your list was drawn from. It is the single most consequential choice you make, because it sets what "by chance" means.

Three options:

Changing the background changes N (and which genes count toward K and n), so it changes every p-value. A smaller, more specific universe generally makes overlaps look more significant; a larger universe makes them look less so. The app always shows the N it used. Per-collection annotated-universe sizes:

CollectionHuman NMouse N
Reactome11,9638,851
Hallmark4,3844,291
GO-BP17,31817,371
GO-MF15,27414,668
GO-CC10,78911,327

4. Multiple testing

A collection contains hundreds or thousands of gene sets, and testing all of them inflates the chance of false positives. enrichlite corrects for this and reports adjusted values.

The correction family is the set of terms tested in the current run. Because only one collection is analysed at a time, the family is that collection's terms - not all collections combined.

5. Data sources, versions, and licenses

All gene-set data comes from authoritative public sources, pinned to specific versions. None of it is hand-curated or invented here.

Human and mouse are kept as entirely separate universes with separate symbol tables. Human uses HGNC symbols (all upper-case, e.g. BRAF); mouse uses MGI symbols (title-case, e.g. Braf). There is no on-the-fly ortholog conversion: a human gene list is tested only against human sets, and likewise for mouse. Within a species, matching is case-insensitive, so pasting braf still matches.

6. Gene Ontology specifics

GO terms are built from the ontology graph plus per-species annotation files, with the standard filtering and propagation that ORA expects.

7. Redundancy collapse and the GO hierarchy view

Two display-time tools help make sense of long, overlapping result lists - especially for GO, where a term and its parent are often both significant on the same genes. Neither changes the statistics or the multiple-testing family; they only reorganise terms that were already computed.

8. Reproducibility

The analysis is fully reproducible: the same input always gives the same output, and the published data can be regenerated from the pinned sources.

A Python build script downloads the source files at the versions listed above, applies the filtering and propagation described, and writes compact JSON. The build is deterministic - rebuilding from the same cached sources produces byte-identical data files. The web app never computes gene-set data; it only reads those static JSON files and runs the statistics in the browser. Source code and the build script are in the repository: github.com/robinson-vidva/enrichlite.

9. Caveats