PERL scripts are most wonderful stuff because they can help you in almost all aspect of Bioinformatics. PERL script can perform BLAST, parsing, gene prediction and many more for you in very simple way even without going to a particular website. Here let me share another beautiful work of Dr. David Rosenkranz, Institue of Anthropology Johannes Gutenberg University Mainz, Germany. Suppose you have thousand of genes in a file in FASTA format and want to delete or save (in another file) those sequences which are shorter than 300 bp or amino acid then this PERL script will save your life from editing file manually.
Script name | Download |
---|---|
Seq_filter.pl |
This PERL script will generate three files
sequence length ok - Sequence with defined length
sequence too short - Sequence shorter than defined length
sequence too long - Sequence longer than defined length
Options
-i = File name with sequences
-min = Minimum length of sequences
-max = Maximum length of sequences
-0 = Remove sequences and write to output file sequences_too_long.fas
-1 = Cut the end of the sequence and save to file sequences_ok.fas
-I = File containing ist of files with sequences
Uses
If my sequences are stored in INPUT.TXT and I want to extract sequences with minimum length 30 and maximum length 60 then my command will be
perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60
If I want to cut the end of the sequence and save to file sequences_ok.fas fiile then my command will be
perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -1
If I want to remove sequences and write to output file sequences_too_long.fas then my command will be
perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -0
If I have so many file with thousand of sequences, then I will write the name of those file(one
file name per line) in a text file (file_name.txt, for example then my command will be
perl length_cutoff.pl -I file_name.txt -min 30 -max 60
I can use all command together also like this
perl Seq_filter.pl -i INPUT.TXT -I file_name.txt -min 30 -max 60 -1
Hope this Bioinformatics tutorial would be useful to extract out the sequence from your multi-fasta files in sequence analysis studies.
Update
Yoy can also use this perl script to remove FASTA sequences shorter that your specified number
#!/usr/bin/perl
use strict;
use warnings;
my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
local $/=">";
while(<>) {
chomp;
next unless /\w/;
s/>$//gs;
my @chunk = split /\n/;
my $header = shift @chunk;
my $seqlen = length join "", @chunk;
print ">$_" if($seqlen >= $minlen);
}
local $/="\n";
}
Uses
perl remove_small.pl 200 input.fasta > result.fasta
Here 200 is the length of sequence you want to keep. So all fasta sequences less than 200 in length will be removed from input.fasta and saved into result.fasta.
Post a Comment