Cancerscope Tutorial

This tutorial will go through the use of cancerscope to predict the cancer type from a) an input file, or b) from pre-loaded RNA-Seq data.
We will be using some example RNA-Seq data from TCGA. You can download the data file which has been pre-collated for you here.

(Optional) Collating example data yourself

You can also prepare this data yourself. Download the data using the gdc-rnaseq-tool and this TCGA query
The example data used is then sourced as follows:
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool gdc-client download -m gdc_manifest_age22.txt

Getting started

Please install cancerscope, and download all files in this directory. Particularly, make sure you have downloaded the file combined_tcga_fpkm.txt

Package import and setup

Start by importing the package into your python instance.
>>> import cancerscope as cs
If this is your first time importing cancerscope, You will be greeted with the following output:
Thankyou for using cancerscope. The initial run requires download of dependent model files. Proceeding with download now...
 Models will be downloaded to /PATH_TO_PTYHON_LIB/lib/python2.7/site-packages/cancerscope/data/

Downloading model files for v1_rm500dropout 
 Data Downloaded at: /PATH_TO_PYTHON_LIB/lib/python2.7/site-packages/cancerscope/data/
Downloading model files for v1_none17kdropout 
 Data Downloaded at: /PATH_TO_PYTHON_LIB/lib/python2.7/site-packages/cancerscope/data/
Downloading model files for v1_rm500 
 Data Downloaded at: /PATH_TO_PYTHON_LIB/lib/python2.7/site-packages/cancerscope/data/
Downloading model files for v1_smotenone17k 
 Data Downloaded at: /PATH_TO_PYTHON_LIB/lib/python2.7/site-packages/cancerscope/data/
Downloading model files for v1_none17k 
 Data Downloaded at: /PATH_TO_PYTHON_LIB/lib/python2.7/site-packages/cancerscope/data/

Fitting the models

Set up the predictor...
>>> scope_obj=cs.scope()

Obtaining predictions

Process data from file...
>>> pfromfile = scope_obj.get_predictions_from_file("/PATH/TO/combined_tcga_fpkm.txt")
You should see the following output:
Your genenames are of format ENSG.version. Truncating to ENSG ID only

Read in sample file /PATH/TO/combined_tcga_fpkm.txt, 
 Data shape (26, 60483)
 Number of samples 26
 Number of genes in input 60483, with gene code ENSEMBL
...Only 17638 of input features mapped to expected number of features. Setting the rest to 0.0...Normalization function being applied: rastminmax
norm result shape (26, 17688)

...Only 17638 of input features mapped to expected number of features. Setting the rest to 0.0...Normalization function being applied: none
norm result shape (26, 17688)

...Only 17638 of input features mapped to expected number of features. Setting the rest to 0.0...Normalization function being applied: rastminmax
norm result shape (26, 17688)

...Only 17638 of input features mapped to expected number of features. Setting the rest to 0.0...Normalization function being applied: none
norm result shape (26, 17688)

...Only 17638 of input features mapped to expected number of features. Setting the rest to 0.0...Normalization function being applied: none
norm result shape (26, 17688)
In some cases, the provided gene names will not map over exactly to the list of gene names used by SCOPE (thankyou 100 different ways of naming a gene). The count of matching genes will be output to you for every model.
The output object will look like this:
>>> pfromfile
    sample_ix    label      pred  freq                                             models  rank_pred             sample_name
0           0  SKCM_TS  0.977220     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_SKCM_65312630
1           1  THCA_TS  0.999638     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_THCA_b2016510
2           2  THCA_NS  0.912180     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_THCA_ffb8427a
3           3  TGCT_TS  0.853217     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_TGCT_78264e6b
4           4  THCA_TS  0.943195     3         v1_rm500dropout,v1_none17kdropout,v1_rm500          1      TCGA_THCA_4869e2a4
5           4  LUAD_TS  0.570270     2                         v1_smotenone17k,v1_none17k          2      TCGA_THCA_4869e2a4
6           5  THCA_TS  0.900277     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_THCA_73451252
7           6  HNSC_TS  0.993280     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_HNSC_26019321
8           7  SKCM_TS  0.915794     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_SKCM_22632bc1
9           8  PCPG_TS  0.998625     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_PCPG_cf680d44
10          9  THCA_TS  0.987204     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_THCA_34826584
11         10  THCA_TS  0.994746     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_THCA_5fedc450
12         11  PCPG_TS  0.999457     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_PCPG_4f16b358
13         12  THCA_TS  0.992644     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_THCA_4640600f
14         13  TGCT_TS  0.821062     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_TGCT_d8ad327f
15         14  THCA_NS  0.915730     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_THCA_e32c7fe0
16         15  LIHC_TS  0.996771     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_LIHC_abe89868
17         16  PCPG_TS  0.987906     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_PCPG_1182a295
18         17  TGCT_TS  0.947846     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_TGCT_a51c7a87
19         18  SKCM_TS  0.863664     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_SKCM_bcc52bb7
20         19  THCA_TS  0.856618     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1  TCGA_THCA_NHL_a7229653
21         20  TGCT_TS  0.824202     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_TGCT_af5c9e80
22         21  SKCM_TS  0.935001     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1      TCGA_SKCM_074f955e
23         22   LGG_TS  0.998371     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1       TCGA_LGG_9cd81de0
24         23  THCA_TS  0.975574     3         v1_rm500dropout,v1_none17kdropout,v1_rm500          1      TCGA_THCA_7429a1e0
25         23  LUAD_TS  0.691227     2                         v1_smotenone17k,v1_none17k          2      TCGA_THCA_7429a1e0
26         24  LUAD_TS  0.765155     3       v1_none17kdropout,v1_smotenone17k,v1_none17k          1      TCGA_THCA_94cda1d7
27         24  THCA_TS  0.994180     2                           v1_rm500dropout,v1_rm500          2      TCGA_THCA_94cda1d7
28         25   LGG_TS  0.998748     5  v1_rm500dropout,v1_none17kdropout,v1_rm500,v1_...          1       TCGA_LGG_cfd39475

Interpreting the output

The sample index is the order the samples were read in. The label is the predicted class. \_[N,T]S indicates a healthy normal (N) or a tumour type (T). The tumor codes follow TCGA naming guidelines. Please refer to the publication for detailed names.
The pred is the prediction confidence for the label, from the models listed in the column models. The freq is the count of models whose top-prediction was label. The freq column matches the number of columns listed in models.
The rank_pred ranks the predictions per sample. In cases where the voting was unanimous, you see that the sample has a single rank (rank_pred == 1), and freq == 5.
In some instances in the example above, the prediction is split between two classes. The user can pick the highest ranked prediction for each case (rank_pred == 1), or use their better judgement to infer the results.

Why don't I just use the top-ranked prediction?

The intrepretation of confidence can be nuanced in some cases.
- For example, consider the sample "TCGA_THCA_94cda1d7" (sample_ix = 24). Here, 3 models predicted that the cancertype was LUAD (lung adenocarcinoma), with an average confidence of 0.765155. As a higher number of models predicted this, the rank of prediction is 1 (rank_pred = 1).  

- However, 2 models also predicted the THCA (thyroid carcinoma) tumor type, which is correct. The rank of prediction in this case is 2, because a fewer number of models predicted this (even though the average confidence was much higher, at 0.994180). The two results are provided to the user to facilitate a judgement call.     

- In contrast, see the case "TCGA_THCA_7429a1e0". Here, the average confidence *and* the number of models contributing to the call of 'THCA_TS' is higher, than those calling the case 'LUAD_TS'. In such a scenario, the user can simply choose to discard the 2nd ranked prediction, and go only with the 1st one, since both measures of certainty are higher in rank_pred == 1.  

Plotting the output, or saving the output as txt file

If you provide the prediction call with an output directory, it will generate the following:
  1. A txt file listing the dataframe returned above.
  2. A txt file listing the prediction confidence from each model, across all 66 classes, for all samples.
  3. An individual svg file labelled with the sample name, for each sample.
>>> pfromfile = scope_obj.get_predictions_from_file("/PATH/TO/combined_tcga_fpkm.txt", outdir="/PATH/TO/DESIRED/OUTPUT/FOLDER/")

Only generating predictions from a subset of models

If you like to play favorites, or generally dislike the sound of a particular model, you can tell SCOPE which models to consider when assessing an input.
>>> pfromfile = scope_obj.get_predictions_from_file("/PATH/TO/combined_tcga_fpkm.txt", modelnames=[LIST OF MODELS TO KEEP])
The model names for Version 1.0 of cancerscope are as follows:
v1_rm500
v1_rm500dropout
v1_smotenone17k
v1_none17k
v1_none17kdropout
Please refer to the supplementary data in the publication for details on each model.

FAQs

What is the GENE_IDENTIFIER? The Gene Identifier has to be uniform across the list (don't provide a mix of ENSG, HUGO, and ENTREZ ids, for example). If passing in a file, the column name for the first column ('GENE_IDENTIFIER') should indicate the type of genenames. Options are as follows:
SCOPEENTREZHUGOGENENAMEENSEMBLHGNCGSC1GSC2HUGO_ENSGSCOPE_ENSG
A1BG1A1BGA1BGENSG000001214105A1BG|1_calculatedmerged_AIBG|1_calculatedA1BG_ENSG00000121410A1BG_ENSG00000121410
WBSCR1764409GALNT17GALNT17ENSG0000018527416347WBSCR17|64409_calculatedmerged_WBSCR17|64409_calculatedGALNT17_ENSG00000185274WBSCR17_ENSG00000185274

Comments

Popular posts from this blog

gspread error:gspread.exceptions.SpreadsheetNotFound

Miniconda installation problem: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

转载:彻底搞清楚promoter, exon, intron, and UTR