Posts

Showing posts from September, 2016

NotFromMe: stringsAsFactors: An unauthorized biography

This is not written by me. I just copy and paste it here. The origninal link is : http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like ‘read.table()’ and ‘read.csv()’ in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of Why does stringsAsFactors not default to FALSE???? The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in ‘read.table()’ and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, ‘strin...

R tip: read.table or read.csv a table with quotes

Error message: Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :   EOF within quoted string Answer: You need to disable quoting. cit <- read.csv ( "citations.CSV" , quote = "" , row.names = NULL , stringsAsFactors = FALSE ) str ( cit ) ## 'data.frame': 112543 obs. of 13 variables: ## $ row.names : chr "10.2307/675394" "10.2307/30007362" "10.2307/4254931" "10.2307/20537934" ... ## $ id : chr "10.2307/675394\t" "10.2307/30007362\t" "10.2307/4254931\t" "10.2307/20537934\t" ... ## $ doi : chr "Archaeological Inference and Inductive Confirmation\t" "Sound and Sense in Cath Almaine\t" "Oak Galls Preserved by the Eruption of Mount Vesuvius in A.D. 79_ and Their Probable Use\t" "The Arts Four Thousand Years Ago\t" ....

R: DESeq2 analysis: outliers and refitting

When I was running DESeq2, I got the message shown as below: converting counts to integer mode estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing -- replacing outliers and refitting for 47 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing I didn't encounter this before. Here are the reasons: Answers: The count outlier flagging is useful when there are a minority of outliers in the dataset, but as you have noted, something else is going on here with so many genes flagged. There are two reasons for so many genes being flagged as outlier: either the method for flagging outliers is not appropriate for the distribution of counts in your data and should be turned off (by setting minReplicatesForReplace=Inf and cooksCutoff=FALSE), or you have a sample...

PerlHowTo: How to sort perl hash on values and order the keys correspondingly (in two arrays maybe)?

First sort the keys by the associated value. Then get the values (e.g. by using a hash slice). my @keys = sort { $h { $a } <=> $h { $b } } keys (% h ); my @vals = @h { @keys }; Or if you have a hash reference. my @keys = sort { $h ->{ $a } <=> $h ->{ $b } } keys (% $h ); my @vals = @{ $h }{ @keys }; References: http://stackoverflow.com/questions/10901084/how-to-sort-perl-hash-on-values-and-order-the-keys-correspondingly-in-two-array

Boioinformatics Tips: Contrast file for cuffdiff

Cuffdiff, by default, compares each pair of conditions in your experiment. If you have many conditions, this can create a lot of additional work for the program. These extra conditions can cause Cuffdiff's output files to be very large, which can slow down  CummeRbund  and other downstream analysis software. Often, you are not interested in all pairwise contrasts. Rather, you'd like to compare all conditions to a common control, or only look at matched pairs of samples. You can specify the contrasts Cuffdiff should perform using a contrast file. Contrast files are simple, tab delimited text files. They should have a single header line as the first line in the file, followed by one line for each contrast you'd like to perform. The files should have two columns, as specified below: Column number Column name Example Description 1 condition_A Ctrl A condition label. Must match one of the labels specified through -L or in the sample sheet. 1 condition_B Ctrl A condition...

Boioinformatics Tips: Contrast file for cuffdiff

Cuffdiff, by default, compares each pair of conditions in your experiment. If you have many conditions, this can create a lot of additional work for the program. These extra conditions can cause Cuffdiff's output files to be very large, which can slow down  CummeRbund  and other downstream analysis software. Often, you are not interested in all pairwise contrasts. Rather, you'd like to compare all conditions to a common control, or only look at matched pairs of samples. You can specify the contrasts Cuffdiff should perform using a contrast file. Contrast files are simple, tab delimited text files. They should have a single header line as the first line in the file, followed by one line for each contrast you'd like to perform. The files should have two columns, as specified below: Column number Column name Example Description 1 condition_A Ctrl A condition label. Must match one of the labels specified through -L or in the sample sheet. 1 condition_B Ctrl A condition...

Boioinformatics Tips: Contrast file for cuffdiff

Cuffdiff, by default, compares each pair of conditions in your experiment. If you have many conditions, this can create a lot of additional work for the program. These extra conditions can cause Cuffdiff's output files to be very large, which can slow down  CummeRbund  and other downstream analysis software. Often, you are not interested in all pairwise contrasts. Rather, you'd like to compare all conditions to a common control, or only look at matched pairs of samples. You can specify the contrasts Cuffdiff should perform using a contrast file. Contrast files are simple, tab delimited text files. They should have a single header line as the first line in the file, followed by one line for each contrast you'd like to perform. The files should have two columns, as specified below: Column number Column name Example Description 1 condition_A Ctrl A condition label. Must match one of the labels specified through -L or in the sample sheet. 1 condition_B Ctrl A condition...

Boioinformatics Tips: Contrast file for cuffdiff

Cuffdiff, by default, compares each pair of conditions in your experiment. If you have many conditions, this can create a lot of additional work for the program. These extra conditions can cause Cuffdiff's output files to be very large, which can slow down  CummeRbund  and other downstream analysis software. Often, you are not interested in all pairwise contrasts. Rather, you'd like to compare all conditions to a common control, or only look at matched pairs of samples. You can specify the contrasts Cuffdiff should perform using a contrast file. Contrast files are simple, tab delimited text files. They should have a single header line as the first line in the file, followed by one line for each contrast you'd like to perform. The files should have two columns, as specified below: Column number Column name Example Description 1 condition_A Ctrl A condition label. Must match one of the labels specified through -L or in the sample sheet. 1 condition_B Ctrl A condition l...