TCGA barcode
接触和分析过TCGA数据的朋友肯定会经常处理TCGA barcode的前15位(有时12位),实际从上图可以看出TCGA的barcode设计总共有28位之多。
每一个短横杠衔接的都是含不同意义的序列,如下图
Create Barcode
具体的解释如下表:
Label | Identifier for | Value | Value Description | Possible Values |
---|---|---|---|---|
Analyte | Molecular type of analyte for analysis | D | The analyte is a DNA sample | See Code Tables Report |
Plate | Order of plate in a sequence of 96-well plates | 182 | The 182nd plate | 4-digit alphanumeric value |
Portion | Order of portion in a sequence of 100 - 120 mg sample portions | 1 | The first portion of the sample | 01-99 |
Vial | Order of sample in a sequence of samples | C | The third vial | A to Z |
Project | Project name | TCGA | TCGA project | TCGA |
Sample | Sample type | 1 | A solid tumor | Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29. See Code Tables Report for a complete list of sample codes |
Center | Sequencing or characterization center that will receive the aliquot for analysis | 1 | The Broad Institute GCC | See Code Tables Report |
Participant | Study participant | 1 | The first participant from MD Anderson for GBM study | Any alpha-numeric value |
TSS | Tissue source site | 2 | GBM (brain tumor) sample from MD Anderson | See Code Tables Report |
另外,将barcode的组成从层次结构(树)来看,是这样的:
barcode tree
参考 https://docs.gdc.cancer.gov/Encyclopedia/pages/images/TCGA-TCGAbarcode-080518-1750-4378.pdf
可以看到同一个样本(一个病人的某一个组织块),在实际的实验处理中是分了很多分析试样的,特别是plate部分。这也就导致在实际的分析中有可能会出现多个barcode对应同一个样本(即前15位是一致的),那么分析的时候用哪个呢?
之前我做TCGA的相关分析一般是用UCSC Xena与Broad研究所的数据(属于level 4了),它们已经对这种问题进行了比较妥善的处理,然而最近处理从GDC下载的数据确实碰到了这样的问题,需要自己解决。
通过谷歌引擎找到Biostars上有人对这个问题加以讨论,我按照着提供的链接找到了Broad研究所进行barcode去重的策略:
主要内容如下:
In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter. Analyte Replicate Filter The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number. Sort Replicate Filter The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
翻译成中文,大致有以下3点:
另外,分析不使用福尔马林处理的样本(DNA与RNA分析数据失真,但这一点TCGA已经考虑了)
因此我写了个函数来处理这个问题: