Posts

Showing posts from March, 2016

Linux Tip: SORT of linux

If you want to sort a tab limited file in  ascending  order according to one column (like 5th column), you can use: sort -k5,5n input However, if you want to sort the file in descending order ccording to one column (like 5th column), you cann't use: sort -r -k5,5n input  You can use: sort -r n-k5,5 input  -n, --numeric-sort compare according to string numerical value -r, --reverse reverse the result of comparisons

NGS tips: FPKM for short genes in cufflinks

Don't trust Cufflinks FPKM for short genes Here is what Cole (author of Cufflinks) commented to the observation of " very high RPKM values from Cufflink ": This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus,  Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias agai

Biology: Full list of Arabidopsis database

Image
AtGenExpress AtGenExpress is a multinational effort designed to uncover the transcriptome of the multicellular model organism Arabidopsis thaliana. The project is coordinated by Detlef Weigel, Thomas Altmann and Lutz Nover with funding from the German Arabidopsis Functional Genomics Network (AFGN), and includes contributions from Germany, supported by DFG, as well as substantial contributions by RIKEN (Japan), NSF (USA; via funding of TAIR and the 2010 program), BBSRC (UK; via funding of the GARNET initiative), and the Max-Planck-Society.  PLncDB: Plant  Long noncoding RNA Database The authors have identified a large number of Arabidopsis long noncoding RNAs by analysis of 200 tiling array datasets and RNA-seq data derived from 4 libraries. More than 13,000 RNAs were found transcribed from intergenic regions of the  Arabidopsis thaliana  genome. These intergenic transcription units (TUs) could be further classified into following groups: Long noncoding RNAs are ex

NGS tips: Useful information for duplicate reads

When I want to remove duplicate reads from bam files, I'm always confused about the differences between samtools and Picard in terms of duplicate reads removal.  I think here is the answer: Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base  have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.  samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates. By lh3: I

NGS tips: Picard MarkDuplicates - whole bam file marked as duplicates?

Image
If you used Picard MarkDuplicates to mark duplicate reads in your bam files, you may notice that each line has a tag "PG:Z:MarkDuplicates".  You would be confused. Each read has this mark. Does this mean that every read is duplicate?  The short answer is NO.  In SAM/BAM file, whether a read is duplicate is determined by FLAG (the 2nd column).  You can decode SAM flags under the link: https://broadinstitute.github.io/picard/explain-flags.html For example, I have a read which has a tag "PG:Z:MarkDuplicates". You can type 83 in the box and press "Expain" button. Then you can see the summary shown in the left.  Each read has 12 properties. The 11th is to determine whether this is duplicated or not. As you can see, for 83 the 11th property is not marked meaning that this one is not a duplicate one.  You can also use REMOVE_DUPLICATES = true to remove duplicate reads in the bam files.