From Reads to Expression
RNA-seq measures gene expression by sequencing the mRNA molecules in a biological sample. After alignment to a reference genome, the number of reads mapping to each gene serves as a proxy for its expression level. However, raw counts must be normalized for sequencing depth and gene length before comparison across samples, using methods like TPM (transcripts per million) or the size factors employed by DESeq2.
The Negative Binomial Model
Gene expression counts follow a negative binomial distribution, which accounts for both the Poisson sampling noise of sequencing and the biological variability between replicates. The key parameter is the dispersion φ, which quantifies how much the variance exceeds the mean. DESeq2 estimates dispersion for each gene using empirical Bayes shrinkage, borrowing information across genes to stabilize estimates when sample sizes are small.
Statistical Testing and Multiple Correction
For each gene, a statistical test compares expression between conditions, producing a p-value. With 20,000+ genes tested simultaneously, multiple testing correction is essential — at p < 0.05, we would expect 1,000 false positives by chance alone. The Benjamini-Hochberg procedure controls the false discovery rate (FDR), ensuring that among all genes called significant, at most a specified fraction (typically 5%) are false positives.
Experimental Design Matters
The most impactful design choice in RNA-seq is the number of biological replicates. Adding replicates improves power far more than increasing sequencing depth beyond ~20 million reads. The volcano plot — plotting log2 fold change versus -log10 p-value — reveals the tradeoff: large fold changes are easy to detect even with few replicates, while small but biologically important changes require substantial replication to distinguish from noise.