Posts

Showing posts from April, 2019

(By Cyriac Kandoth) Tutorial: Working with MAF files (Mutation Annotation Format) from the TCGA (The Cancer Genome Atlas)

Update:  (2/8/2017) This tutorial applies to TCGA MAFs in the  GDC Legacy Archive . Most of this tutorial is still valid, but I'll need to update some notes and broken links. Purpose For folks familiar with the  VCF format , TCGA's  MAF files  can be quite a pain to work with. You might just download the latest MAFs, pull loci and alleles for each variant, and redo annotations with  ANNOVAR ,  snpEff , or  Ensembl's VEP . Problem solved, right? Nope. You don't know the half of it! There are lots of caveats you should know about, and I try to document them below. Most of these caveats are handled with safe solutions in the MAFs at  this page , and the specificity of variant calls are made more comparable across MAFs, at  this page . How TCGA MAFs are made Tumor-specific Analysis Working Groups ( AWGs ) take the auto-generated variant calls from the Genome Sequencing Centers ( GSCs ), and remove false-positive variants, or recover...

Math land

Desmos:  https://www.desmos.com Wolframalpha:  https://www.wolframalpha.com/

TCGA 相关

只有raw sequence BAM和FASTQ在CGHub, 其他所有的都在TCGA DCC. DCC有两个入口,controlled里面是所有有序列的data, 包括genotyping,  variant,和一些pcr sequencing。如果是mutation (tumor - normal), CNA 这些,全部都是open access.  CNA主要是AFFY SNP6的,也有一些其他的平台,包括low coverage WGS, 这些都是open access。问题是TCGA号称 no platform left behind, 所以方法 虽然多,但是并不是所有的disease都有所有的data type. TCGA的data我一般在三个地方找,DCC是一处,或者用TCGA Assembler 拉,另外常见的open data可以去firebrowse.org下载。firebrowse的好处是 QC有问题的data都扔掉了,然后都combine成matrix form。你要自己找DCC 的data,必须去TCGA annotation database去找哪些data 必须QC扔掉( 这个非常重要!!!我和许多做过TCGA data analysis的人聊过,大多数 不知道这个annotation是啥)。 ICGC是另外一回事。他们主要看sequencing data,而且只用一部分他们 认为好分析的或者质量好的BAM来分析,你要看非TCGA的data很有用, 但是对TCGA来说,非常的不全。 借块地方贴广告。前两周贴了一个,加了个link估计被老刑干掉了。 Center for Data Intensive Science at the University of Chicago招 bioinformatician. 自己google,我就不贴link了。 有人问为啥我还在贴,因为组越来越大,一直在招人。我们就是属于楼主说的 专门拿special funding/contract的这种。前几年在做Genomics Data Commons (GDC),  现在有许多新项目,多数是生物数据方面的,partner包括NIH下面数个 机构,NOAA, NASA...

Data cleanup and summary statistics with R (By Jasleen Grewal)

Reference: https://jasgrewal.github.io/common/seminars/teaching/r_stats_beginners_12022019/ggplot_basicstats.html

Cancerscope Tutorial

This tutorial will go through the use of cancerscope to predict the cancer type from a) an input file, or b) from pre-loaded RNA-Seq data. We will be using some example RNA-Seq data from TCGA. You can download the data file which has been pre-collated for you  here . (Optional) Collating example data yourself You can also prepare this data yourself. Download the data using the  gdc-rnaseq-tool  and this  TCGA query The example data used is then sourced as follows: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool   gdc-client download -m gdc_manifest_age22.txt Getting started Please install cancerscope, and download all files in this directory. Particularly, make sure you have downloaded the file  combined_tcga_fpkm.txt Package import and setup Start by importing the package into your python instance. >>> import cancerscope as cs If this is your first time importing  cancerscope , You will be greeted with the following...