NotFromMe:Simulating genes and counts for DESeq2 analysis
Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.
Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Get all human gene symbols from biomaRt library ( "biomaRt" ) mart <- useMart (biomart= "ensembl" , dataset = "hsapiens_gene_ensembl" ) my_results <- getBM (attributes = c ( "hgnc_symbol" ), mart=mart) head (my_results) # Simulate 100 gene names to be used for our cnts matrix set.seed (32268) my_genes <- with (my_results, sample (hgnc_symbol, size=100, replace= FALSE )) head (my_genes) # Simulate a cnts matrix cnts = matrix ( runif (600, min=0, max=1200), ncol=6) cnts = apply (cnts, c (1,2), as.integer) head (cnts) dim (cnts) |
Now, say we run DESeq2 to look for differentially expressed genes between our two simulated groups.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # Running DESEQ2 based on https://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf library ( "DESeq2" ) grp.idx <- rep ( c ( "KO" , "WT" ), each=3) coldat= DataFrame (grp= factor (grp.idx, levels= c ( "WT" , "KO" ))) # Add the column names and gene names colnames (cnts) <- paste (grp.idx, 1:6, sep= "_" ) rownames (cnts) <- my_genes head (cnts) # Run DESeq2 analysis on the simulated counts dds <- DESeqDataSetFromMatrix (cnts, colData=coldat, design = ~ grp) dds <- DESeq (dds) deseq2.res <- results (dds) deseq2.fc=deseq2.res$log2FoldChange names (deseq2.fc)= rownames (deseq2.res) exp.fc=deseq2.fc head (exp.fc) # SDAD1 SVOPL SRGAP2C MTND1P2 CNN2P8 IL13 # -0.48840808 0.32122109 -0.55584857 0.00184246 -0.15371042 0.11555792 |
Now let’s see how many simulated genes had a log2 fold change greater than 1 by chance.
1
2
3
4
5
6
7
8
9
| # Load the fold changes from DESeq2 analysis and order in decreasing order geneList = sort (exp.fc, decreasing = TRUE ) # log FC is shown head (geneList) gene <- geneList[ abs (geneList) >= 1] head (gene) # C1orf216 #-1.129836 |
Now it’s your turn! What other probability distributions could we simulate data from to perform a mock RNA seq experiment to determine how many genes could be different by chance? You can even use a bootstrap approach to calculate the p-value after running 1000 permutations of the code. Of course, to circumvent these problems we use adjusted p values but it is always nice to go back to basics and stress the importance of applying statistical methods when looking at differentially expressed genes. I encourage you all to leave your answers in the comment section below to inspire others.
Happy R programming!
References:
https://rjbioinformatics.com/2017/01/31/simulating-genes-and-counts-for-deseq2-analysis/
https://rjbioinformatics.com/
Comments
Post a Comment