NotFromMe:Simulating genes and counts for DESeq2 analysis



Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Get all human gene symbols from biomaRt
library("biomaRt")
mart <- useMart(biomart="ensembl", dataset = "hsapiens_gene_ensembl")
my_results <- getBM(attributes = c("hgnc_symbol"), mart=mart)
head(my_results)
# Simulate 100 gene names to be used for our cnts matrix
set.seed(32268)
my_genes <- with(my_results, sample(hgnc_symbol, size=100, replace=FALSE))
head(my_genes)
# Simulate a cnts matrix
cnts = matrix(runif(600, min=0, max=1200), ncol=6)
cnts = apply(cnts, c(1,2), as.integer)
head(cnts)
dim(cnts)

Now, say we run DESeq2 to look for differentially expressed genes between our two simulated groups.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library("DESeq2")
grp.idx <- rep(c("KO", "WT"), each=3)
coldat=DataFrame(grp=factor(grp.idx, levels=c("WT", "KO")))
# Add the column names and gene names
colnames(cnts) <- paste(grp.idx, 1:6, sep="_")
rownames(cnts) <- my_genes
head(cnts)
# Run DESeq2 analysis on the simulated counts
dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
dds <- DESeq(dds)
deseq2.res <- results(dds)
deseq2.fc=deseq2.res$log2FoldChange
names(deseq2.fc)=rownames(deseq2.res)
exp.fc=deseq2.fc
head(exp.fc)
#  SDAD1 SVOPL SRGAP2C MTND1P2 CNN2P8 IL13
# -0.48840808 0.32122109 -0.55584857 0.00184246 -0.15371042 0.11555792
Now let’s see how many simulated genes had a log2 fold change greater than 1 by chance.
1
2
3
4
5
6
7
8
9
# Load the fold changes from DESeq2 analysis and order in decreasing order
geneList = sort(exp.fc, decreasing = TRUE) # log FC is shown
head(geneList)
gene <- geneList[abs(geneList) >= 1]
head(gene)
# C1orf216
#-1.129836
Now it’s your turn!  What other probability distributions could we simulate data from to perform a mock RNA seq experiment to determine how many genes could be different by chance? You can even use a bootstrap approach to calculate the p-value after running 1000 permutations of the code. Of course, to circumvent these problems we use adjusted p values but it is always nice to go back to basics and stress the importance of applying statistical methods when looking at differentially expressed genes. I encourage you all to leave your answers in the comment section below to inspire others.
Happy R programming!

References:
https://rjbioinformatics.com/2017/01/31/simulating-genes-and-counts-for-deseq2-analysis/
https://rjbioinformatics.com/


Comments

Popular posts from this blog

gspread error:gspread.exceptions.SpreadsheetNotFound

Miniconda installation problem: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

转载:彻底搞清楚promoter, exon, intron, and UTR