「R」TCGA barcode（样本ID）以及重名过滤

王诗翔呀

发布于 2020-07-03 18:06:34

2K0

发布于 2020-07-03 18:06:34

文章被收录于专栏：优雅R优雅R

TCGA barcode

接触和分析过TCGA数据的朋友肯定会经常处理TCGA barcode的前15位（有时12位），实际从上图可以看出TCGA的barcode设计总共有28位之多。

每一个短横杠衔接的都是含不同意义的序列，如下图

Create Barcode

具体的解释如下表：

Label	Identifier for	Value	Value Description	Possible Values
Analyte	Molecular type of analyte for analysis	D	The analyte is a DNA sample	See Code Tables Report
Plate	Order of plate in a sequence of 96-well plates	182	The 182nd plate	4-digit alphanumeric value
Portion	Order of portion in a sequence of 100 - 120 mg sample portions	1	The first portion of the sample	01-99
Vial	Order of sample in a sequence of samples	C	The third vial	A to Z
Project	Project name	TCGA	TCGA project	TCGA
Sample	Sample type	1	A solid tumor	Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29. See Code Tables Report for a complete list of sample codes
Center	Sequencing or characterization center that will receive the aliquot for analysis	1	The Broad Institute GCC	See Code Tables Report
Participant	Study participant	1	The first participant from MD Anderson for GBM study	Any alpha-numeric value
TSS	Tissue source site	2	GBM (brain tumor) sample from MD Anderson	See Code Tables Report

另外，将barcode的组成从层次结构（树）来看，是这样的：

barcode tree

参考 https://docs.gdc.cancer.gov/Encyclopedia/pages/images/TCGA-TCGAbarcode-080518-1750-4378.pdf

可以看到同一个样本（一个病人的某一个组织块），在实际的实验处理中是分了很多分析试样的，特别是plate部分。这也就导致在实际的分析中有可能会出现多个barcode对应同一个样本（即前15位是一致的），那么分析的时候用哪个呢？

之前我做TCGA的相关分析一般是用UCSC Xena与Broad研究所的数据（属于level 4了），它们已经对这种问题进行了比较妥善的处理，然而最近处理从GDC下载的数据确实碰到了这样的问题，需要自己解决。

通过谷歌引擎找到Biostars上有人对这个问题加以讨论，我按照着提供的链接找到了Broad研究所进行barcode去重的策略：

主要内容如下：

In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter. Analyte Replicate Filter The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number. Sort Replicate Filter The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.

翻译成中文，大致有以下3点：

对于RNA分析， Analyte序列 H>R>T
对于DNA分析，Analyte序列中D>G,W,X
如果经常前面的过滤还重复样本，考虑portion和plate序列，选择更大的

另外，分析不使用福尔马林处理的样本（DNA与RNA分析数据失真，但这一点TCGA已经考虑了）

因此我写了个函数来处理这个问题：

本文参与?腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2020-03-24，如有侵权请联系?cloudcommunity@tencent.com 删除

编程算法

本文分享自优雅R 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与?腾讯云自媒体分享计划? ，欢迎热爱写作的你一起参与！

编程算法

登录后参与评论

0 条评论

热度

「R」TCGA barcode（样本ID）以及重名过滤

「R」TCGA barcode（样本ID）以及重名过滤

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐