前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >生物信息学入门~利用购买的云服务器学习有参转录组数据处理(fastq到差异表达)

生物信息学入门~利用购买的云服务器学习有参转录组数据处理(fastq到差异表达)

作者头像
用户7010445
发布2024-04-30 19:40:11
980
发布2024-04-30 19:40:11
举报

拟南芥有参转录组数据处理,基本流程是

  • 1 转录组数据过滤
  • 2 转录组数据与参考基因组比对
  • 3 基于比对结果计算count值
  • 4 基于count值进行差异表达分析

准备数据

参考基因组 拟南芥基因组 来源于 论文 The genetic and epigenetic landscape of the Arabidopsis centromeres。下载链接 https://github.com/schatzlab/Col-CEN/tree/main/v1.2

只保留染色体序列,我使用Heliex这个工具对蛋白编码基因进行了注释,获得蛋白编码基因注释的gff格式文件

转录组数据来源于论文 Rapid and Dynamic Alternative Splicing Impacts the Arabidopsis Cold Response Transcriptome

下载 时间点8 和时间点 14 的转录组数据

ERR1886187,ERR1886352,ERR1886283 timepoint8 ERR1886289,ERR1886358,ERR1886193 timepoint14

这里的数据值选取了总数据的百分之一

准备好三个输入数据

  • 1 参考基因组的fasta文件
  • 2 参考基因组的蛋白编码基因注释文件 gff格式
  • 3 转录组测序数据 fastq 格式

安装软件

这里安装软件都用conda来安装,需要用到的软件有

  • fastp
  • hisat2
  • samtools
  • stringtie
  • R语言 tidyverse包(整理数据)DEseq2 包 (差异表达分析)

数据处理

第一步:原始数据过滤

代码语言:javascript
复制
fastp -i reads/ERR1886187_subset_1.fastq.gz -I reads/ERR1886187_subset_2.fastq.gz -o 01.clean.reads/ERR1886187_1.fq.gz -O 01.clean.reads/ERR1886187_2.fq.gz -j 01.fastp.output/ERR1886187.json -h 01.fastp.output/ERR1886187.html -w 1
fastp -i reads/ERR1886193_subset_1.fastq.gz -I reads/ERR1886193_subset_2.fastq.gz -o 01.clean.reads/ERR1886193_1.fq.gz -O 01.clean.reads/ERR1886193_2.fq.gz -j 01.fastp.output/ERR1886193.json -h 01.fastp.output/ERR1886193.html -w 1
fastp -i reads/ERR1886283_subset_1.fastq.gz -I reads/ERR1886283_subset_2.fastq.gz -o 01.clean.reads/ERR1886283_1.fq.gz -O 01.clean.reads/ERR1886283_2.fq.gz -j 01.fastp.output/ERR1886283.json -h 01.fastp.output/ERR1886283.html -w 1

fastp -i reads/ERR1886289_subset_1.fastq.gz -I reads/ERR1886289_subset_2.fastq.gz -o 01.clean.reads/ERR1886289_1.fq.gz -O 01.clean.reads/ERR1886289_2.fq.gz -j 01.fastp.output/ERR1886289.json -h 01.fastp.output/ERR1886289.html -w 1
fastp -i reads/ERR1886352_subset_1.fastq.gz -I reads/ERR1886352_subset_2.fastq.gz -o 01.clean.reads/ERR1886352_1.fq.gz -O 01.clean.reads/ERR1886352_2.fq.gz -j 01.fastp.output/ERR1886352.json -h 01.fastp.output/ERR1886352.html -w 1
fastp -i reads/ERR1886358_subset_1.fastq.gz -I reads/ERR1886358_subset_2.fastq.gz -o 01.clean.reads/ERR1886358_1.fq.gz -O 01.clean.reads/ERR1886358_2.fq.gz -j 01.fastp.output/ERR1886358.json -h 01.fastp.output/ERR1886358.html -w 1

第二步:对参考基因组构建索引

代码语言:javascript
复制
hisat2-build fa/at.col0.chr.fna index/at

第三步:转录组数据与参考基因组进行比对

代码语言:javascript
复制
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886187_1.fq.gz -2 01.clean.reads/ERR1886187_2.fq.gz -S 02.sam/ERR1886187.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886193_1.fq.gz -2 01.clean.reads/ERR1886193_2.fq.gz -S 02.sam/ERR1886193.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886283_1.fq.gz -2 01.clean.reads/ERR1886283_2.fq.gz -S 02.sam/ERR1886283.sam

hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886289_1.fq.gz -2 01.clean.reads/ERR1886289_2.fq.gz -S 02.sam/ERR1886289.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886352_1.fq.gz -2 01.clean.reads/ERR1886352_2.fq.gz -S 02.sam/ERR1886352.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886358_1.fq.gz -2 01.clean.reads/ERR1886358_2.fq.gz -S 02.sam/ERR1886358.sam


samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886187.sorted.bam 02.sam/ERR1886187.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886193.sorted.bam 02.sam/ERR1886193.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886283.sorted.bam 02.sam/ERR1886283.sam

samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886289.sorted.bam 02.sam/ERR1886289.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886352.sorted.bam 02.sam/ERR1886352.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886358.sorted.bam 02.sam/ERR1886358.sam

第四步:计算基因表达量

代码语言:javascript
复制
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886187/ERR1886187.gtf -A 04.stringtie.output/ERR1886187/ERR1886187.gene_abund.tsv 03.sorted.bam/ERR1886187.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886193/ERR1886193.gtf -A 04.stringtie.output/ERR1886193/ERR1886193.gene_abund.tsv 03.sorted.bam/ERR1886193.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886283/ERR1886283.gtf -A 04.stringtie.output/ERR1886283/ERR1886283.gene_abund.tsv 03.sorted.bam/ERR1886283.sorted.bam

stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886289/ERR1886289.gtf -A 04.stringtie.output/ERR1886289/ERR1886289.gene_abund.tsv 03.sorted.bam/ERR1886289.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886352/ERR1886352.gtf -A 04.stringtie.output/ERR1886352/ERR1886352.gene_abund.tsv 03.sorted.bam/ERR1886352.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886358/ERR1886358.gtf -A 04.stringtie.output/ERR1886358/ERR1886358.gene_abund.tsv 03.sorted.bam/ERR1886358.sorted.bam

第五步:合并基因表达量

代码语言:javascript
复制
Rscript getCounts.R

getCounts.R 的内容

代码语言:javascript
复制
library(tidyverse)

myfun<-function(x){
  return(read_tsv(x) %>% 
           select(`Gene ID`,`Coverage`) %>% 
           mutate(sampleName=str_extract(x,pattern = "ERR[0-9]+")))
}
list.files("04.stringtie.output",
           pattern = "gene_abund.tsv",
           recursive = TRUE,
           full.names = TRUE) %>% 
  map(myfun) %>% 
  bind_rows() %>% 
  pivot_wider(names_from = "sampleName",values_from = "Coverage") %>% 
  write_tsv("counts.tsv")

第六步:差异表达分析

代码语言:javascript
复制
library(tidyverse)
library(DESeq2)

mycounts<-read_tsv("D:/Jupyter/practice/RNAseqAT/counts.tsv") %>% 
  column_to_rownames("Gene ID") %>% 
  round()

mycounts %>% 
  head()

mycounts %>% dim()

mycounts_1<-mycounts[rowSums(mycounts) != 0,]
dim(mycounts_1)

mymeta<-read_csv("D:/Jupyter/practice/RNAseqAT/mymeta.csv")

colnames(mycounts_1) == mymeta$id

mycounts_1 %>% 
  select(mymeta$id) -> mycounts_1

colnames(mycounts_1) == mymeta$id

dds <- DESeqDataSetFromMatrix(countData=mycounts_1, 
                              colData=mymeta, 
                              design=~dex)

dds <- DESeq(dds)
res <- results(dds)
res


res %>% 
  as.data.frame() %>% 
  mutate(group = case_when(
    log2FoldChange >= 1 & padj <= 0.05 ~ "UP",
    log2FoldChange <= -1 & padj <= 0.05 ~ "DOWN",
    TRUE ~ "NOT_CHANGE"
  )) %>% 
  pull(group) %>% table()


res %>% 
  as.data.frame() %>% head()

res %>% 
  as.data.frame() %>% 
  mutate(group = case_when(
    log2FoldChange >= 1 & padj <= 0.05 ~ "UP",
    log2FoldChange <= -1 & padj <= 0.05 ~ "DOWN",
    TRUE ~ "NOT_CHANGE"
  )) %>% 
  ggplot(aes(x=log2FoldChange,y=-log10(padj)))+
  geom_point(aes(fill=group),size=8,shape=21,
             color="gray",
             alpha=0.6)+
  scale_fill_manual(values = c("UP"="red",
                                "DOWN"="blue",
                                "NOT_CHANGE"="gray"))+
  theme_bw(base_size = 20)+
  theme(panel.grid = element_blank(),
        legend.position = "top")
本文参与?腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2024-04-26,如有侵权请联系?cloudcommunity@tencent.com 删除

本文分享自 小明的数据分析笔记本 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与?腾讯云自媒体分享计划? ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 准备数据
  • 安装软件
  • 数据处理
相关产品与服务
云服务器
云服务器(Cloud Virtual Machine,CVM)提供安全可靠的弹性计算服务。 您可以实时扩展或缩减计算资源,适应变化的业务需求,并只需按实际使用的资源计费。使用 CVM 可以极大降低您的软硬件采购成本,简化 IT 运维工作。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
http://www.vxiaotou.com