Omics Academy

Posts

Showing posts from April, 2019

(By Cyriac Kandoth) Tutorial: Working with MAF files (Mutation Annotation Format) from the TCGA (The Cancer Genome Atlas)

April 30, 2019

Update: (2/8/2017) This tutorial applies to TCGA MAFs in the GDC Legacy Archive . Most of this tutorial is still valid, but I'll need to update some notes and broken links. Purpose For folks familiar with the VCF format , TCGA's MAF files can be quite a pain to work with. You might just download the latest MAFs, pull loci and alleles for each variant, and redo annotations with ANNOVAR , snpEff , or Ensembl's VEP . Problem solved, right? Nope. You don't know the half of it! There are lots of caveats you should know about, and I try to document them below. Most of these caveats are handled with safe solutions in the MAFs at this page , and the specificity of variant calls are made more comparable across MAFs, at this page . How TCGA MAFs are made Tumor-specific Analysis Working Groups ( AWGs ) take the auto-generated variant calls from the Genome Sequencing Centers ( GSCs ), and remove false-positive variants, or recover...

Math land

April 30, 2019

Desmos: https://www.desmos.com Wolframalpha: https://www.wolframalpha.com/

TCGA 相关

April 30, 2019

只有raw sequence BAM和FASTQ在CGHub, 其他所有的都在TCGA DCC. DCC有两个入口，controlled里面是所有有序列的data, 包括genotyping, variant，和一些pcr sequencing。如果是mutation (tumor - normal), CNA 这些，全部都是open access. CNA主要是AFFY SNP6的，也有一些其他的平台，包括low coverage WGS, 这些都是open access。问题是TCGA号称 no platform left behind, 所以方法虽然多，但是并不是所有的disease都有所有的data type. TCGA的data我一般在三个地方找，DCC是一处，或者用TCGA Assembler 拉，另外常见的open data可以去firebrowse.org下载。firebrowse的好处是 QC有问题的data都扔掉了，然后都combine成matrix form。你要自己找DCC 的data，必须去TCGA annotation database去找哪些data 必须QC扔掉（这个非常重要！！！我和许多做过TCGA data analysis的人聊过，大多数不知道这个annotation是啥）。 ICGC是另外一回事。他们主要看sequencing data，而且只用一部分他们认为好分析的或者质量好的BAM来分析，你要看非TCGA的data很有用，但是对TCGA来说，非常的不全。借块地方贴广告。前两周贴了一个，加了个link估计被老刑干掉了。 Center for Data Intensive Science at the University of Chicago招 bioinformatician. 自己google，我就不贴link了。有人问为啥我还在贴，因为组越来越大，一直在招人。我们就是属于楼主说的专门拿special funding/contract的这种。前几年在做Genomics Data Commons (GDC), 现在有许多新项目，多数是生物数据方面的，partner包括NIH下面数个机构，NOAA， NASA...

Data cleanup and summary statistics with R (By Jasleen Grewal)

April 29, 2019

Reference: https://jasgrewal.github.io/common/seminars/teaching/r_stats_beginners_12022019/ggplot_basicstats.html

Cancerscope Tutorial

April 29, 2019

This tutorial will go through the use of cancerscope to predict the cancer type from a) an input file, or b) from pre-loaded RNA-Seq data. We will be using some example RNA-Seq data from TCGA. You can download the data file which has been pre-collated for you here . (Optional) Collating example data yourself You can also prepare this data yourself. Download the data using the gdc-rnaseq-tool and this TCGA query The example data used is then sourced as follows: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool gdc-client download -m gdc_manifest_age22.txt Getting started Please install cancerscope, and download all files in this directory. Particularly, make sure you have downloaded the file combined_tcga_fpkm.txt Package import and setup Start by importing the package into your python instance. >>> import cancerscope as cs If this is your first time importing cancerscope , You will be greeted with the following...