前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >「R」TCGA barcode(样本ID)以及重名过滤

「R」TCGA barcode(样本ID)以及重名过滤

作者头像
王诗翔呀
发布2020-07-03 18:06:34
2K0
发布2020-07-03 18:06:34
举报
文章被收录于专栏:优雅R优雅R

TCGA barcode

接触和分析过TCGA数据的朋友肯定会经常处理TCGA barcode的前15位(有时12位),实际从上图可以看出TCGA的barcode设计总共有28位之多。

每一个短横杠衔接的都是含不同意义的序列,如下图

Create Barcode

具体的解释如下表:

Label

Identifier for

Value

Value Description

Possible Values

Analyte

Molecular type of analyte for analysis

D

The analyte is a DNA sample

See Code Tables Report

Plate

Order of plate in a sequence of 96-well plates

182

The 182nd plate

4-digit alphanumeric value

Portion

Order of portion in a sequence of 100 - 120 mg sample portions

1

The first portion of the sample

01-99

Vial

Order of sample in a sequence of samples

C

The third vial

A to Z

Project

Project name

TCGA

TCGA project

TCGA

Sample

Sample type

1

A solid tumor

Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29. See Code Tables Report for a complete list of sample codes

Center

Sequencing or characterization center that will receive the aliquot for analysis

1

The Broad Institute GCC

See Code Tables Report

Participant

Study participant

1

The first participant from MD Anderson for GBM study

Any alpha-numeric value

TSS

Tissue source site

2

GBM (brain tumor) sample from MD Anderson

See Code Tables Report

另外,将barcode的组成从层次结构(树)来看,是这样的:

barcode tree

参考 https://docs.gdc.cancer.gov/Encyclopedia/pages/images/TCGA-TCGAbarcode-080518-1750-4378.pdf


可以看到同一个样本(一个病人的某一个组织块),在实际的实验处理中是分了很多分析试样的,特别是plate部分。这也就导致在实际的分析中有可能会出现多个barcode对应同一个样本(即前15位是一致的),那么分析的时候用哪个呢?

之前我做TCGA的相关分析一般是用UCSC Xena与Broad研究所的数据(属于level 4了),它们已经对这种问题进行了比较妥善的处理,然而最近处理从GDC下载的数据确实碰到了这样的问题,需要自己解决。

通过谷歌引擎找到Biostars上有人对这个问题加以讨论,我按照着提供的链接找到了Broad研究所进行barcode去重的策略:

主要内容如下:

In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter. Analyte Replicate Filter The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number. Sort Replicate Filter The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.

翻译成中文,大致有以下3点:

  1. 对于RNA分析, Analyte序列 H>R>T
  2. 对于DNA分析,Analyte序列中D>G,W,X
  3. 如果经常前面的过滤还重复样本,考虑portion和plate序列,选择更大的

另外,分析不使用福尔马林处理的样本(DNA与RNA分析数据失真,但这一点TCGA已经考虑了)

因此我写了个函数来处理这个问题:

本文参与?腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-03-24,如有侵权请联系?cloudcommunity@tencent.com 删除

本文分享自 优雅R 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与?腾讯云自媒体分享计划? ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
http://www.vxiaotou.com