NotFromMe:Simulating genes and counts for DESeq2 analysis

February 02, 2017

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

# Get all human gene symbols from biomaRt

library("biomaRt")

mart <- useMart(biomart="ensembl", dataset = "hsapiens_gene_ensembl")

my_results <- getBM(attributes = c("hgnc_symbol"), mart=mart)

head(my_results)

# Simulate 100 gene names to be used for our cnts matrix

set.seed(32268)

my_genes <- with(my_results, sample(hgnc_symbol, size=100, replace=FALSE))

head(my_genes)

# Simulate a cnts matrix

cnts = matrix(runif(600, min=0, max=1200), ncol=6)

cnts = apply(cnts, c(1,2), as.integer)

head(cnts)

dim(cnts)

Now, say we run DESeq2 to look for differentially expressed genes between our two simulated groups.

# Running DESEQ2 based on https://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf

library("DESeq2")

grp.idx <- rep(c("KO", "WT"), each=3)

coldat=DataFrame(grp=factor(grp.idx, levels=c("WT", "KO")))

# Add the column names and gene names

colnames(cnts) <- paste(grp.idx, 1:6, sep="_")

rownames(cnts) <- my_genes

head(cnts)

# Run DESeq2 analysis on the simulated counts

dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)

dds <- DESeq(dds)

deseq2.res <- results(dds)

deseq2.fc=deseq2.res$log2FoldChange

names(deseq2.fc)=rownames(deseq2.res)

exp.fc=deseq2.fc

head(exp.fc)

#  SDAD1 SVOPL SRGAP2C MTND1P2 CNN2P8 IL13

# -0.48840808 0.32122109 -0.55584857 0.00184246 -0.15371042 0.11555792

Now let’s see how many simulated genes had a log2 fold change greater than 1 by chance.

# Load the fold changes from DESeq2 analysis and order in decreasing order

geneList = sort(exp.fc, decreasing = TRUE) # log FC is shown

head(geneList)

gene <- geneList[abs(geneList) >= 1]

head(gene)

# C1orf216

#-1.129836

Now it’s your turn! What other probability distributions could we simulate data from to perform a mock RNA seq experiment to determine how many genes could be different by chance? You can even use a bootstrap approach to calculate the p-value after running 1000 permutations of the code. Of course, to circumvent these problems we use adjusted p values but it is always nice to go back to basics and stress the importance of applying statistical methods when looking at differentially expressed genes. I encourage you all to leave your answers in the comment section below to inspire others.

Happy R programming!

References:

https://rjbioinformatics.com/2017/01/31/simulating-genes-and-counts-for-deseq2-analysis/

https://rjbioinformatics.com/

Search This Blog

Omics Academy

NotFromMe:Simulating genes and counts for DESeq2 analysis

Comments

Post a Comment

Popular posts from this blog

gspread error:gspread.exceptions.SpreadsheetNotFound

Miniconda installation problem: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

P and q values in RNA Seq