Yesterday, we talked about NCBI BLAST parsing. I shared two PERL script than can parsed BLAST results in tabulated manner so that you can access your NCBI BLAST output easily. Finally you can extract out those queries without hits or with hits. By the help of those PERL script you can find you best hits also. Suppose you have identified those queries which have no hits in BLAST and now want to remove those queried from main FASTA sequence file then it would not be easy if number of those queries are high. In that case, this PERL script can save your day. this PERL codelet was written by my friend HARRY
Extract Seq.pl
#!/usr/local/bin/perl
print "Enter Input code file name.\n";
$f=<STDIN>;
$f=~s/\n//g;
#f="test.out";
print "Enter Input Sequence file name.\n";
$fs=<STDIN>;
$fs=~s/\n//g;
print "Output codefile name.\n";
$fw=<STDIN>;
$fw=~s/\n//g;
$flag=0;
print "Output mainfile name.\n";
$fw1=<STDIN>;
$fw1=~s/\n//g;
$flag1=0;
unless (open(DA,$f))
{
print "Input file is not open.\n";
}
@da=<DA>;
close DA;
unless (open(DATA,$fs))
{
print "Input file is not open.\n";
}
@daseq=<DATA>;
close DATA;
unless (open(WR,">$fw"))
{
print "Output file is not open.\n";
}
unless (open(WR1,">$fw1"))
{
print "Output file is not open.\n";
}
foreach $fl(@daseq)
{
$ID="ZZZZZZZZZZZZZZZZZZZZZZZZ";
if($fl=~/^\>(.........)/)
{
$ID=$1;
foreach $line(@da)
{
if($line=~/(.........)/)
{
if($ID eq $1)
{
$flag1=1;
}
}
}
}
if($flag1==1)
{
print WR "$fl";
$flag++;
}
else
{
print WR1 "$fl";
}
if($flag==2)
{
$flag=0;
$flag1=0;
}
}
close WR;
close WR1;
Uses
it is very easy to use this PERL script. Just run the script from your command prompt, it will ask four file name -
- File name with accession number
- File with FASTA sequences
- File name where you want to write the FASTA sequences of your queries
- File name where you would like to write rest of sequences
Limitations
I run this PERL script using these kind of accession numbers but hope it will work for other kind of queries also>Vocar20001498m
Brachypodium distachyon
AT1G51370
Instead of being a wonderful tool, this script has one limitation. You sequence must be in single line FASTA file format not in multi - line FASTA format. But it should not be a problem as I have already shared a PERL script to convert multi - line FASTA format to single line FASTA format.
>right format
MVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGLDLDPYASSNTNTIVSFVESMVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGLDLDPYASSNTNTIVSFVES
>wrong format
MVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSAL
STKWRYLWQSVPGLDLDPYASSNTNTIVSFVESMVGGKKKTKICDKV
SHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGL
DLDPYASSNTNTIVSFVES
Hope this Bioinformatics tutorial would be useful to extract out the sequence from your multi-fasta files in sequence analysis studies.
Post a Comment