Bioinfor tips: How do I get transcript sequences of difference plant accessions with RNA-seq data?
Alternative questions:
How can I extract transcript sequences based on GTF file?
How can I extract transcript sequences based on GTF file?
TOPICS
How do I get transcript sequences of difference plant accessions with RNA-seq data?
Dear everyone:
I am working on the RNA-seq data of several new arabidopsis thaliana accessions, while the genome sequence is not available.
I aligned the RNA-seq data to Col-o reference genome sequence. I generated gtf gene structure annotation file using cufflinks, and called variations using GATK software with RNA-seq data and got vcf files.
I want to get the transcript sequences of these new accession. How could I do it with the gtf annotation file, the Col-0 genome sequence file and at the same time also considering the variation between the new accessions and Col-0(VCF file) please?
I agree with Upendra Kumar Devisetty. You can get the detailed information of gffread here: http://cole-trapnell-lab.github.io/cufflinks/file_formats/#extracting-transcript-sequences
The gffread usually locates in the same directory which cufflinks locates.
Extracting transcript sequences
The gffread utility can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation a fasta file with the genomic sequences have to be provided as well. For example, one might want to extract the sequence of all transfrags assembled from a Cufflinks assembly session. This can be accomplished with a command line like this:
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
The file genome.fa in this example would be a multi fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa. This should be the case in our example if genome.fa is the file corresponding to the same genome (index) that was used for mapping the reads with Tophat. Note that the retrieval of the transcript sequences this way is going to be quicker if a fasta index file (genome.fa.fai in this example) is found in the same directory with the genomic fasta file. Such an index file can be created with samtools prior to running gffread, like this:
samtools faidx genome.fa
Comments
Post a Comment