利用BLAST进行序列比对和寻找同源基因

dpsh_6

43374人浏览 · 2019-03-10 21:46:04

dpsh_6 · 2019-03-10 21:46:04 发布

NCBI BLAST是最常用的短序列局部比对软件之一，学习一下软件参数，并这记录一下我的软件操作的命令行。

NR 库分割参考Gu Kai老师的笔记：http://www.bioinfo-scrounger.com/archives/673

1. 软件版本

我部署在服务器上的是version 2.8.1 ，因为其支持物种注释以及分割的NR子库的索引作为比对的库，使用比较方便，但是需要对应使用新版的NR数据库 (BLASTDBv5，ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/ )；另外官方也给了常用子NR库：

env_nr_v5
swissprot_v5
taxdb
tsa_nr_v5

2. 使用makeblastdb格式化数据库

makeblastdb的参数有很多，具体参数如下：

$ makeblastdb -help
USAGE
  makeblastdb [-h] [-help] [-in input_file] [-input_type type]
    -dbtype molecule_type [-title database_title] [-parse_seqids]
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
    [-mask_desc mask_algo_descriptions] [-gi_mask]
    [-gi_mask_name gi_based_mask_names] [-out database_name]
    [-blastdb_version version] [-max_file_sz number_of_bytes]
    [-logfile File_Name] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-version]

DESCRIPTION
   Application to create BLAST databases, version 2.8.1+

REQUIRED ARGUMENTS
 -dbtype <String, `nucl', `prot'>
   Molecule type of target db

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -version
   Print version number;  ignore other arguments

 *** Input options
 -in <File_In>
   Input file/database name
   Default = `-'
 -input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>
   Type of the data specified in input_file
   Default = `fasta'

 *** Configuration options
 -title <String>
   Title for BLAST database
   Default = input file name provided to -in argument
 -parse_seqids
   Option to parse seqid for FASTA input if set, for all other input types
   seqids are parsed automatically
 -hash_index
   Create index of sequence hash values.

 *** Sequence masking options
 -mask_data <String>
   Comma-separated list of input files containing masking data as produced by
   NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
 -mask_id <String>
   Comma-separated list of strings to uniquely identify the masking algorithm
    * Requires:  mask_data
    * Incompatible with:  gi_mask
 -mask_desc <String>
   Comma-separated list of free form strings to describe the masking algorithm
   details
    * Requires:  mask_id
 -gi_mask
   Create GI indexed masking data.
    * Requires:  parse_seqids
    * Incompatible with:  mask_id
 -gi_mask_name <String>
   Comma-separated list of masking data output files.
    * Requires:  mask_data, gi_mask

 *** Output options
 -out <String>
   Name of BLAST database to be created
   Default = input file name provided to -in argumentRequired if multiple
   file(s)/database(s) are provided as input
 -blastdb_version <Integer, 4..5>
   Version of BLAST database to be created
   Default = `4'
 -max_file_sz <String>
   Maximum file size for BLAST database files
   Default = `1GB'
 -logfile <File_Out>
   File to which the program log should be redirected

 *** Taxonomy options
 -taxid <Integer, >=0>
   Taxonomy ID to assign to all sequences
    * Incompatible with:  taxid_map
 -taxid_map <File_In>
   Text file mapping sequence IDs to taxonomy IDs.
   Format:<SequenceId> <TaxonomyId><newline>
    * Requires:  parse_seqids
    * Incompatible with:  taxid

然后举个例子，来说明参数。输入文件是：1.）待格式化的序列；2）序列的物种信息表（可选），物种信息是NCBI taxid，可以使用get_species_taxids.sh脚本或者提取accession2taxid文件获得

$ cat test.fsa 
>seq1
MSFSTKPLDMATWPDFAALVERHNGVWGGCWCMAFHAKGSGAVGNREAKEARVREGSTHAALVFDGSACVGWCQFGPTGE
LPRIKHLRAYEDGQAVLPDWRITCFFSDKAFRGKGVAAAALAGALAEIGRLGGGTVESYPEDAQGRTVAGAFLHNGTLAM
>seq2
MKAIDLKAEEKKRLIEGIQDFFYEERNEEIGIIAAEKALDFFLSGVGKLIYNKALDESKIWFSRRLEDISLDYELLYK
>seq3 
MTLAAAAQSATWTFIDGDWYEGNVAILGPRSHAMWLGTSVFDGARWFEGVAPDLELHAARVNASAIALGLAPNMTPEQIV
GLTWDGLKKFDGKTAVYIRPMYWAEHGGYMGVPADPASTRFCLCLYESPMISPTGFSVTVSPFRRPTIETMPTNAKAGCL
YPNNGRAILEAKARGFDNALVLDMLGNVAETGSSNIFLVKDGHVLTPAPNGTFLSGITRSRTMTLLGDYGFRTTEKTLSV
RDFLEADEIFSTGNHSKVVPITRIEGRDLQPGPVAKKARELYWDWAHSASVG
>seq4
MRSFFHHVAAADPASFGVAQRVLTIPIKRAHIEVTHHLTKAEVDALIAAPNPRTSRGRRDRTFLLFLARTGARVSEATGV
NANDLQLERSHPQVLLRGKGRRDRVIPIPQDLARALTALLAEHGIANHEPRPIFIGARQERLTRFGATHIVRRAAAQAVT
IKPALAHKPISPHIFRHSLAMKLLQSGVDLLTIQAWLGHAQVATTHRYAAADVEMMRKGLEKAGVSGDLGLRFRPNDAVL
QLLTSI
>seq5
MTISRVCGSRTEAMLTNGQEIAMTSILKSTGAVALLLLYTLTANATSLMISPSSIERVAPDRAAVFHLRNQMDRPISIKV
RVFRWSQKGGVEKLEPTGDVVASPISAQLSPNGNRAVRVVRVSKEPLRSEEGYRVVIDEADPTRNTPEAESLSARHVLPV
LFRPPDVLGPEIELSLTRSDGWLMLVVENKGASRLRRSDVTLAQGSAGIARREGFVGYVLPGLTRHWRVGREDSYSGGIV
TVSANSSGGAIGEQLVVSGR
>seq6
TTLLLQVPIGWGVLHQGGALVVLGFAIAHWRGFVGTYTRDTAIEMRD

$ cat test_map.txt
seq1 68287
seq2 2382161
seq3 68287
seq4 382
seq5 382
seq6 382

执行格式化命令：

$ makeblastdb -in test.fsa -parse_seqids -blastdb_version 5 -taxid_map test_map.txt -title "test_demo" -dbtype prot -out test_db


Building a new DB, current time: 03/09/2019 17:08:14
New DB name:   test_db
New DB title:  test_demo
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6 sequences in 0.00222588 seconds.

如果不需要物种信息的话，可以执行：（-parse_seqids 参数需要加上，根据序列ID来分裂，可以使用)

$ makeblastdb -in test.fsa -parse_seqids -title test_demo -dbtype prot -out test_db -blastdb_version 5

3. 使用blastp, blastx或者blastn进行序列比对

blast的-outfmt参数的选择与说明，可以使用blastp -help打印每个输出格式的信息：

*** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    18 = Organism Report

   Options 6, 7 and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
        qseqid means Query Seq-id
           qgi means Query GI
          qacc means Query accesion
       qaccver means Query accesion.version
          qlen means Query sequence length
        sseqid means Subject Seq-id
     sallseqid means All subject Seq-id(s), separated by a ';'
           sgi means Subject GI
        sallgi means All subject GIs
          sacc means Subject accession
       saccver means Subject accession.version
       sallacc means All subject accessions
          slen means Subject sequence length
        qstart means Start of alignment in query
          qend means End of alignment in query
        sstart means Start of alignment in subject
          send means End of alignment in subject
          qseq means Aligned part of query sequence
          sseq means Aligned part of subject sequence
        evalue means Expect value
      bitscore means Bit score
         score means Raw score
        length means Alignment length
        pident means Percentage of identical matches
        nident means Number of identical matches
      mismatch means Number of mismatches
      positive means Number of positive-scoring matches
       gapopen means Number of gap openings
          gaps means Total number of gaps
          ppos means Percentage of positive-scoring matches
        frames means Query and subject frames separated by a '/'
        qframe means Query frame
        sframe means Subject frame
          btop means Blast traceback operations (BTOP)
        staxid means Subject Taxonomy ID
      ssciname means Subject Scientific Name
      scomname means Subject Common Name
    sblastname means Subject Blast Name
     sskingdom means Subject Super Kingdom
       staxids means unique Subject Taxonomy ID(s), separated by a ';'
             (in numerical order)
     sscinames means unique Subject Scientific Name(s), separated by a ';'
     scomnames means unique Subject Common Name(s), separated by a ';'
    sblastnames means unique Subject Blast Name(s), separated by a ';'
             (in alphabetical order)
    sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
             (in alphabetical order) 
        stitle means Subject Title
    salltitles means All Subject Title(s), separated by a '<>'
       sstrand means Subject Strand
         qcovs means Query Coverage Per Subject
       qcovhsp means Query Coverage Per HSP
        qcovus means Query Coverage Per Unique Subject (blastn only)

一般常用的就是-outfmt 5, 6 或者 7 参数，“5”输出XML格式，“6”输出TAB分割格式，“7”输出带注释的TAB分割格式；

TAB格式每列信息如下，或者使用“7”输出带注释的格式

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

如果像输出其他信息，比如比对覆盖度信息（qcovs：Query Coverage Per Subject），需要如在outfmt 6基础上加上 “qcovs”（覆盖度信息），添加位置和顺序就是输出TAB文件中列的排列形式：

-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"

如果需要blast比对返回一个最优的比对结果，需要控制-max_target_seqs , -num_alignments 和 -max_hsps 选项：

-max_target_seqs <Integer, >=1>Maximum number of aligned sequences to keepNot applicable for outfmt <= 4* Incompatible with: num_descriptions, num_alignments 
-num_alignments <Integer, >=0>Number of database sequences to show alignments for* Incompatible with: max_target_seqs

$ blastp -query query.fas -db /NAS2/bio_db/test/test_pep -outfmt 6 -max_hsps 1  -num_alignments 1 -evalue 1e-5 -num_threads 20
Warning: [blastp] Examining 5 or more matches is recommended
NP_001262865.1  XP_027222871.1  59.500  200     59      2       424     622     282     460     1.27e-71        241
APW79906.1      XP_027222871.1  32.765  528     258     17      206     705     2       460     4.30e-61        213
ASV48202.1      XP_027222871.1  54.592  196     73      3       227     421     280     460     9.60e-63        211

$ blastp -query query.fas -db /NAS2/bio_db/test/test_pep -outfmt 6 -max_target_seqs 5 -evalue 1e-5 -num_threads 20
NP_001262865.1  XP_027222871.1  59.500  200     59      2       424     622     282     460     1.27e-71        241
NP_001262865.1  XP_027222871.1  43.038  79      32      4       104     172     89      164     4.20e-07        53.5
NP_001262865.1  XP_027222869.1  59.500  200     59      2       424     622     323     501     4.44e-71        241
NP_001262865.1  XP_027222870.1  59.500  200     59      2       424     622     322     500     4.56e-71        241
APW79906.1      XP_027222871.1  32.765  528     258     17      206     705     2       460     4.30e-61        213
APW79906.1      XP_027222869.1  54.872  195     70      4       515     705     321     501     1.96e-60        212
APW79906.1      XP_027222869.1  54.167  48      20      1       294     341     133     178     6.30e-06        49.7
APW79906.1      XP_027222870.1  54.872  195     70      4       515     705     320     500     2.14e-60        212
APW79906.1      XP_027222870.1  54.167  48      20      1       294     341     132     177     6.29e-06        49.7
ASV48202.1      XP_027222871.1  54.592  196     73      3       227     421     280     460     9.60e-63        211
ASV48202.1      XP_027222870.1  54.592  196     73      3       227     421     320     500     2.51e-62        210
ASV48202.1      XP_027222869.1  54.592  196     73      3       227     421     321     501     2.62e-62        210

3. 分割NR子库

NCB blast-2.8版本可支持用NCBI自带代码分割的NR子库的索引作为比对的库，使用比较方便

Support for a new version of the BLAST database that allows you to limit search by taxonomy as well some other improvements.

当然如果用这个版本的话，NR库也要重新下载了ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/

使用方式也比较简单（至少比之前的方法方便了），如果只想比对单一物种（如人:9606的话），命令如下：

    blastp –db nr –query query.fasta –taxids 9606 –outfmt 6 –out blast.outfm6

如果想比对NR子库哺乳动物的话，需要先建个哺乳动物子库tax_id索引

    get_species_taxids.sh -t 40674 > 40674.txids

然后再将序列比对至NR哺乳动物子库

    blastp –db nr –query query.fasta –taxidlist 40674.txids –outfmt 6 –out blast.outfm6

具体说明可看：https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf

AI大模型社区

聚焦前沿AI与大模型技术探索，汇聚开发者及爱好者，共享开源项目、学习资源与行业资讯。

更多推荐

智能攻防元年：渗透测试Agent迎来大考，AI如何从“能打”走向“可控”

AI大模型社区

解析200万次对话数据：ChatGPT引用内容的核心特征与优化策略

AI大模型社区

AI时代技术管理者新定位：在秩序与混沌的边缘掌舵

AI大模型社区

所有评论(0)

查看更多评论

dpsh_6

@dpsh_6

已为社区贡献1条内容

利用BLAST进行序列比对和寻找同源基因

dpsh_6

1. 软件版本

2. 使用makeblastdb格式化数据库

3. 使用blastp, blastx或者blastn进行序列比对

3. 分割NR子库

所有评论(0)

温馨提示：您尚未绑定手机号

dpsh_6