前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >转录组实战01: 从数据下载到定量fastp+STAR

转录组实战01: 从数据下载到定量fastp+STAR

原创
作者头像
生信探索
发布2023-02-11 16:01:05
6620
发布2023-02-11 16:01:05
举报
文章被收录于专栏:生信探索生信探索
生信交流与合作请关注公众号@生信探索
生信交流与合作请关注公众号@生信探索

01.建立工作目录

代码语言:shell
复制
cd ~
mkdir -p Project/Human_16_Asthma_Bulk
cd Project/Human_16_Asthma_Bulk

# 建立数据存放目录 data
mkdir -p data/rawdata  data/cleandata/fastp

# 建立STAR目录
mkdir STAR

02. 环境准备

代码语言:shell
复制
micromamba create -n RNA
micromamba activate RNA 
micromamba install -y -c hcc aspera-cli
micromamba install -y -c bioconda fastqc multiqc fastp samtools star

03.准备文件

https://www.gencodegenes.org/human/

代码语言:shell
复制
mkdir -p ~/DataHub/Genomics/GENCODE/release_42
cd ~/DataHub/Genomics/GENCODE/release_42

# 下载参考基因组及注释文件
#>>>downGENCODE.sh
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/gencode.v42.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.p13.genome.fa.gz

mv gencode.v42.annotation.gtf.gz HS.gencode.v42.annotation.gtf.gz
mv GRCh38.p13.genome.fa.gz HS.GRCh38.p13.genome.fa.gz

gzip -kd HS.gencode.v42.annotation.gtf.gz
gzip -kd HS.GRCh38.p13.genome.fa.gz
#<<<downGENCODE.sh

nohup zsh downGENCODE.sh &> downGENCODE.sh.log &

04.建立索引

  • 构建star的索引
代码语言:shell
复制
mkdir -p ~/DataHub/Genomics/star_index/human
cd ~/DataHub/Genomics/star_index/human
n_jobs=12
#>>>star_index.sh>>>
STAR \
--runMode genomeGenerate \
--genomeDir ~/DataHub/Genomics/star_index/human \
--genomeFastaFiles ~/DataHub/Genomics/GENCODE/release_42/HS.GRCh38.p13.genome.fa \
--sjdbOverhang 100 \
--sjdbGTFfile ~/DataHub/Genomics/GENCODE/release_42/HS.gencode.v42.annotation.gtf \
--runThreadN ${n_jobs} 
#<<<star_index.sh<<<

nohup sh star_index.sh &> star_index.sh.log &

05.下载原始数据

https://www.ebi.ac.uk/ena/browser/view/PRJNA229998,下载TSV文件

代码语言:shell
复制
cd ~/Project/Human_16_Asthma_Bulk/data/rawdata
cut -f 9 filereport_read_run_PRJNA229998_tsv.txt | tr ';' '\n' |grep '_[12].fastq.gz' > fq.txt

#>>>downloadFQ.sh
cat fq.txt |while read i
do
ascp -QT -l 300m -P33001 \
-i ~/micromamba/envs/RNA/etc/asperaweb_id_dsa.openssh \
era-fasp@$i \
.
done
#<<<downloadFQ.sh
nohup sh downloadFQ.sh &> downloadFQ.log & 

#制作了check.md5文件
cut -f 7 filereport_read_run_PRJNA229998_tsv.txt | sed '1d' | awk -F ';' '{print $(NF-1)"\n"$NF}' > md5.txt
cut -f 7 -d '/' fq.txt | paste md5.txt - > check.md5
rm md5.txt

#检查md5值是否一样,文件是否下载完整
d5sum -c check.md5 

05.质控

质量评估

代码语言:shell
复制
#fastqc
nohup fastqc -t 6 -o .  SRR*.fastq.gz > fastqc.log &
#multiqc
multiqc *.zip

fastp数据过滤

代码语言:shell
复制
#>>>fastp.sh>>>
rawdata_dir=~/Project/Human_16_Asthma_Bulk/data/rawdata
cleandata_dir=~/Project/Human_16_Asthma_Bulk/data/cleandata/fastp
n_jobs=8
cut -f 4 ${rawdata_dir}/filereport_read_run_PRJNA229998_tsv.txt | sed '1d' > ${rawdata_dir}/run_accession.txt

cat ${rawdata_dir}/run_accession.txt | while read i
do
fastp \
--in1 ${rawdata}/${i}_1.fastq.gz \
--in2 ${rawdata}/${i}_2.fastq.gz  \
--out1 ${cleandata}/${i}_1.fastp.fq.gz \
--out2 ${cleandata}/${i}_2.fastp.fq.gz \
--json ${cleandata}/${i}.fastp.json \
--html ${cleandata}/${i}.fastp.html \
--report_title ${cleandata}/${i} \
--thread ${n_jobs}
done
#<<<fastp.sh<<<

cd ~/Project/Human_16_Asthma_Bulk/data/cleandata/fastp
nohup sh fastp.sh &> fastp.sh.log &

06.STAR数据比对和定量

  • 统计对比结果
代码语言:shell
复制
#>>>star.sh>>>
rawdata_dir=~/Project/Human_16_Asthma_Bulk/data/rawdata
index_dir=~/DataHub/Genomics/star_index/human
input_dir=~/Project/Human_16_Asthma_Bulk/data/cleandata/fastp
out_dir=~/Project/Human_16_Asthma_Bulk/STAR
n_jobs=8

cat ${rawdata_dir}/run_accession.txt | while read i
do
STAR \
--readFilesIn ${input_dir}/${i}_1.fastp.fq.gz ${input_dir}/${i}_2.fastp.fq.gz \
--outSAMattrRGline ID:sample SM:sample PL:ILLUMINA \
--genomeDir ${index_dir} \
--readFilesCommand zcat \
--runThreadN ${n_jobs} \
--twopassMode Basic \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outFilterType BySJout \
--outFilterScoreMinOverLread 0.33 \
--outFilterMatchNminOverLread 0.33 \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix ${out_dir}/${i} \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs None \
--alignSoftClipAtReferenceEnds Yes \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--genomeLoad NoSharedMemory \
--chimSegmentMin 15 \
--chimJunctionOverhangMin 15 \
--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \
--chimOutJunctionFormat 1 \
--chimMainSegmentMultNmax 1 \
--outSAMattributes NH HI AS nM NM ch
done
#<<<star.sh<<<

nohup sh star.sh &> star.sh.log &

整理count数据,想不出怎么用linux合并数据框比较优雅,所以用Python吧还是

代码语言:Python
复制
from pathlib import Path
import pandas as pd
import datatable as dt

dir="STAR"
count_list = []
tpm_list = []
paths = Path(dir).glob("*ReadsPerGene.out.tab")

for x,y in enumerate(list(paths)):
    if x < 1:
        Geneid =  pd.read_csv(y,usecols=[0],sep='\t',skiprows=4,header=None)
        Geneid.rename(columns={0:'Geneid'},inplace=True)
    _count_df = pd.read_csv(y,usecols=[1],sep='\t',skiprows=4,header=None)
    count_list.append(_count_df.rename(columns={1:y.name.split('Reads')[0]}))

count_df = pd.concat(count_list,axis=1)
count_df.insert(0,column='Geneid',value=Geneid)
dt.Frame(count_df).to_csv('star_count.csv.gz')
  • 对比质量
代码语言:shell
复制
#>>>star_qc.sh>>>
cd ~/Human_16_Asthma_Bulk/STAR
n_jobs=12
ls *sortedByCoord.out.bam | while read i
do
samtools flagstat -@ ${n_jobs} $i > ${i/bam/flagstat}
done
multiqc -o ./  *.flagstat
#<<<star_qc.sh<<<
nohup zsh star_qc.sh &> star_qc.sh.log &

Reference

https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

https://github.com/OpenGene/fastp

https://samtools.github.io/

https://www.jianshu.com/p/2fa4f5b870f5

https://www.jianshu.com/p/6da36135e2d1

https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

https://zhuanlan.zhihu.com/p/360427232

https://www.jianshu.com/p/2fa4f5b870f5

https://zhuanlan.zhihu.com/p/345896914

https://mp.weixin.qq.com/s/OJ1MpHhW08BcZSjMvFs3eg

https://mp.weixin.qq.com/s/OJ1MpHhW08BcZSjMvFs3eg

环境

代码语言:shell
复制
List of packages in environment: "~/micromamba/envs/RNA"

  Name                           Version       Build             Channel                
──────────────────────────────────────────────────────────────────────────────────────────
  _libgcc_mutex                  0.1           main              anaconda/pkgs/main     
  _openmp_mutex                  5.1           1_gnu             anaconda/pkgs/main     
  aspera-cli                     3.9.6         h5e1937b_0        hcc                    
  backports                      1.1           pyhd3eb1b0_0      anaconda/pkgs/main     
  backports.functools_lru_cache  1.6.4         pyhd3eb1b0_0      anaconda/pkgs/main     
  backports_abc                  0.5           py_1              anaconda/pkgs/main     
  blas                           1.0           mkl               anaconda/pkgs/main     
  bz2file                        0.98          py27_1            anaconda/pkgs/main     
  bzip2                          1.0.8         h7b6447c_0        anaconda/pkgs/main     
  c-ares                         1.18.1        h7f8727e_0        anaconda/pkgs/main     
  ca-certificates                2022.10.11    h06a4308_0        anaconda/pkgs/main     
  certifi                        2020.6.20     pyhd3eb1b0_3      anaconda/pkgs/main     
  click                          7.1.2         pyhd3eb1b0_0      anaconda/pkgs/main     
  curl                           7.87.0        h5eee18b_0        anaconda/pkgs/main     
  cutadapt                       1.18          py27h14c3975_1    anaconda/cloud/bioconda
  cycler                         0.10.0        py27_0            anaconda/pkgs/main     
  dbus                           1.13.18       hb2f20db_0        anaconda/pkgs/main     
  expat                          2.4.9         h6a678d5_0        anaconda/pkgs/main     
  fastp                          0.22.0        h2e03b76_0        anaconda/cloud/bioconda
  fastqc                         0.11.9        hdfd78af_1        anaconda/cloud/bioconda
  font-ttf-dejavu-sans-mono      2.37          hd3eb1b0_0        anaconda/pkgs/main     
  fontconfig                     2.14.1        h52c9d5c_1        anaconda/pkgs/main     
  freetype                       2.12.1        h4a9f257_0        anaconda/pkgs/main     
  functools32                    3.2.3.2       py27_1            anaconda/pkgs/main     
  futures                        3.3.0         py27_0            anaconda/pkgs/main     
  gdbm                           1.18          hd4cb3f1_4        anaconda/pkgs/main     
  gettext                        0.21.0        hf68c758_0        anaconda/pkgs/main     
  git                            2.34.1        pl5262hc120c5b_0  anaconda/pkgs/main     
  glib                           2.69.1        h4ff587b_1        anaconda/pkgs/main     
  gst-plugins-base               1.14.0        h8213a91_2        anaconda/pkgs/main     
  gstreamer                      1.14.0        h28cd5cc_2        anaconda/pkgs/main     
  hisat2                         2.1.0         py27hc9558a2_4    anaconda/cloud/bioconda
  htslib                         1.13          h9093b5e_0        anaconda/cloud/bioconda
  icu                            58.2          he6710b0_3        anaconda/pkgs/main     
  intel-openmp                   2022.1.0      h9e868ea_3769     anaconda/pkgs/main     
  jinja2                         2.11.3        pyhd3eb1b0_0      anaconda/pkgs/main     
  jpeg                           9e            h7f8727e_0        anaconda/pkgs/main     
  kiwisolver                     1.1.0         py27he6710b0_0    anaconda/pkgs/main     
  krb5                           1.19.2        hac12032_0        anaconda/pkgs/main     
  libcurl                        7.87.0        h91b91d3_0        anaconda/pkgs/main     
  libdeflate                     1.7           h27cfd23_5        anaconda/pkgs/main     
  libedit                        3.1.20210714  h7f8727e_0        anaconda/pkgs/main     
  libev                          4.33          h7f8727e_1        anaconda/pkgs/main     
  libffi                         3.3           he6710b0_2        anaconda/pkgs/main     
  libgcc-ng                      11.2.0        h1234567_1        anaconda/pkgs/main     
  libgfortran-ng                 7.5.0         ha8ba4b0_17       anaconda/pkgs/main     
  libgfortran4                   7.5.0         ha8ba4b0_17       anaconda/pkgs/main     
  libgomp                        11.2.0        h1234567_1        anaconda/pkgs/main     
  libnghttp2                     1.46.0        hce63b2e_0        anaconda/pkgs/main     
  libpng                         1.6.37        hbc83047_0        anaconda/pkgs/main     
  libssh2                        1.10.0        h8f2d780_0        anaconda/pkgs/main     
  libstdcxx-ng                   11.2.0        h1234567_1        anaconda/pkgs/main     
  libuuid                        1.41.5        h5eee18b_0        anaconda/pkgs/main     
  libxcb                         1.15          h7f8727e_0        anaconda/pkgs/main     
  libxml2                        2.9.14        h74e7548_0        anaconda/pkgs/main     
  markupsafe                     1.1.1         py27h7b6447c_0    anaconda/pkgs/main     
  matplotlib                     2.2.3         py27hb69df0a_0    anaconda/pkgs/main     
  mkl                            2020.2        256               anaconda/pkgs/main     
  mkl-service                    2.3.0         py27he904b0f_0    anaconda/pkgs/main     
  mkl_fft                        1.0.15        py27ha843d7b_0    anaconda/pkgs/main     
  mkl_random                     1.1.0         py27hd6b4f25_0    anaconda/pkgs/main     
  multiqc                        0.9.1a0       py27_4            anaconda/cloud/bioconda
  ncurses                        6.2           he6710b0_1        anaconda/pkgs/main     
  numpy                          1.16.6        py27hbc911f0_0    anaconda/pkgs/main     
  numpy-base                     1.16.6        py27hde5b4d6_0    anaconda/pkgs/main     
  openjdk                        11.0.13       h87a67e3_0        anaconda/pkgs/main     
  openssl                        1.1.1s        h7f8727e_0        anaconda/pkgs/main     
  pcre                           8.45          h295c915_0        anaconda/pkgs/main     
  pcre2                          10.37         he7ceb23_1        anaconda/pkgs/main     
  perl                           5.34.0        h5eee18b_2        anaconda/pkgs/main     
  pigz                           2.6           h27cfd23_0        anaconda/pkgs/main     
  pip                            19.3.1        py27_0            anaconda/pkgs/main     
  pyparsing                      2.4.7         pyhd3eb1b0_0      anaconda/pkgs/main     
  pyqt                           5.9.2         py27h05f1152_2    anaconda/pkgs/main     
  python                         2.7.18        ha1903f6_2        anaconda/pkgs/main     
  python-dateutil                2.8.2         pyhd3eb1b0_0      anaconda/pkgs/main     
  pytz                           2021.3        pyhd3eb1b0_0      anaconda/pkgs/main     
  pyyaml                         5.2           py27h7b6447c_0    anaconda/pkgs/main     
  qt                             5.9.7         h5867ecd_1        anaconda/pkgs/main     
  readline                       8.1           h27cfd23_0        anaconda/pkgs/main     
  samtools                       1.13          h8c37831_0        anaconda/cloud/bioconda
  setuptools                     44.0.0        py27_0            anaconda/pkgs/main     
  simplejson                     3.8.1         py27_0            anaconda/cloud/bioconda
  singledispatch                 3.7.0         pyhd3eb1b0_1001   anaconda/pkgs/main     
  sip                            4.19.8        py27hf484d3e_0    anaconda/pkgs/main     
  six                            1.16.0        pyhd3eb1b0_1      anaconda/pkgs/main     
  sqlite                         3.40.1        h5082296_0        anaconda/pkgs/main     
  star                           2.7.10b       h9ee0642_0        anaconda/cloud/bioconda
  subprocess32                   3.5.4         py27h7b6447c_0    anaconda/pkgs/main     
  subread                        2.0.1         h5bf99c6_1        anaconda/cloud/bioconda
  tbb                            2020.3        hfd86e86_0        anaconda/pkgs/main     
  tk                             8.6.12        h1ccaba5_0        anaconda/pkgs/main     
  tornado                        5.1.1         py27h7b6447c_0    anaconda/pkgs/main     
  trim-galore                    0.6.7         hdfd78af_0        anaconda/cloud/bioconda
  wheel                          0.37.1        pyhd3eb1b0_0      anaconda/pkgs/main     
  xopen                          0.7.3         py_0              anaconda/cloud/bioconda
  xz                             5.2.10        h5eee18b_1        anaconda/pkgs/main     
  yaml                           0.1.7         had09818_2        anaconda/pkgs/main     
  zlib                           1.2.13        h5eee18b_0        anaconda/pkgs/main     

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 01.建立工作目录
  • 02. 环境准备
  • 03.准备文件
  • 04.建立索引
  • 05.下载原始数据
  • 05.质控
    • 质量评估
      • fastp数据过滤
      • 06.STAR数据比对和定量
      • Reference
      • 环境
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
      http://www.vxiaotou.com