NGS tips: Useful information for duplicate reads
When I want to remove duplicate reads from bam files, I'm always confused about the differences between samtools and Picard in terms of duplicate reads removal.
I think here is the answer:
Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.
samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates.
By lh3:
Pierre Lindenbaum:
I looked at the sources.
I think here is the answer:
Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.
samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates.
By lh3:
I believe Picard is doing the same for SE and this is the right thing to do. PCR duplicates can have different sequences. Optical duplicates have identical sequences.
|
Pierre Lindenbaum:
I looked at the sources.
SamTools rmdup 'only' compare two reads on chrom and pos (which could be wrong if two reads come from two different libraries) and **removes** reads from the BAM: information is lost.
picard set the sam flag 1024 but do not delete the reads. two pairs of reads are compared , as far as I know, using the chrom, the pos, the group-id (sample...) + (flowcell , lane, X,Y for optical dups) (,and the cigar string ?).
lh3:
In Illumina sequencing, there are typically two types of duplicates: PCR duplicates and optical duplicates. Optical duplicates are sequences from one cluster in fact but identified by software from multiple adjacent clusters. They can be identified without alignment. You just check sequence and the coordinates on the image.
PCR duplicates are usually identified after alignment. Although there are alignment-free approaches, they are less powerful when alignment is available. Dedup typically works by identifying read pairs having identical 5'-end coordinates (3'-end coordinates are not considered). The chance of two 5'-end coordinates being identical depends on the coverage, but usually small. You can calculate the theoretical false dedup rate using my formula (0.28*m/s/L, where m is the number of read pairs, s is the standard deviation of insert size distribution and L is the length of the genome; sharp insert size distribution (small s) increases false dedup rate unfortunately). If the empirical duplicate rate is much higher than this 0.28*m/s/L, they are true PCR duplicates.
Dedup for single-end reads has high false positive rate given deep data. This is fine for SNP calling because the 40X coverage is enough for SNP calling. However, it is dangerous to dedup SE reads for RNA-seq or ChIP-seq where read count matters. As mrawlins has suggested, it would be better to account for duplicates in your read counting model rather than run a dedup program.
The rate of PCR duplicates is 0.5*m/N (m is the number of sequenced reads and N is the number of DNA molecules before amplification). It is not affected by the PCR cycles, at least not greatly. The key to reducing PCR duplicates is to get enough DNA (get large N). Also, the dup rate is proportional to M. The more you sequence, the higher dup rate you get.
From: http://seqanswers.com/forums/showthread.php?t=6854PCR duplicates are usually identified after alignment. Although there are alignment-free approaches, they are less powerful when alignment is available. Dedup typically works by identifying read pairs having identical 5'-end coordinates (3'-end coordinates are not considered). The chance of two 5'-end coordinates being identical depends on the coverage, but usually small. You can calculate the theoretical false dedup rate using my formula (0.28*m/s/L, where m is the number of read pairs, s is the standard deviation of insert size distribution and L is the length of the genome; sharp insert size distribution (small s) increases false dedup rate unfortunately). If the empirical duplicate rate is much higher than this 0.28*m/s/L, they are true PCR duplicates.
Dedup for single-end reads has high false positive rate given deep data. This is fine for SNP calling because the 40X coverage is enough for SNP calling. However, it is dangerous to dedup SE reads for RNA-seq or ChIP-seq where read count matters. As mrawlins has suggested, it would be better to account for duplicates in your read counting model rather than run a dedup program.
The rate of PCR duplicates is 0.5*m/N (m is the number of sequenced reads and N is the number of DNA molecules before amplification). It is not affected by the PCR cycles, at least not greatly. The key to reducing PCR duplicates is to get enough DNA (get large N). Also, the dup rate is proportional to M. The more you sequence, the higher dup rate you get.
Alec:
First of all, in Picard MarkDuplicates, unless you set REMOVE_DUPLICATES=true, it will not remove anything, but rather will set the 0x0400 flag in the alignment record to indicate that the record is a duplicate.
Are you saying that Picard will mark (or remove) one end of a mate pair but not the other end? That is not the way it should work. It should either mark both ends of the mate pair as being duplicates, or not mark either end. If you could provide an example of it doing otherwise, I'll try to figure out what is going on.
Heng Li:
Samtools rmdup always removes a pair but it does not work two ends mapped to different chromosomes. Nonetheless it is much faster than MarkDuplicates due to this simplification. Samtools works for paired reads only, not for unpaired reads.
Comments
Post a Comment