How to Filter the Sequence by Their Length

PERL scripts are most wonderful stuff because they can help you in almost all aspect of Bioinformatics. PERL script can perform BLAST, parsing, gene prediction and many more for you in very simple way even without going to a particular website. Here let me share another beautiful work of Dr. David Rosenkranz, Institue of Anthropology Johannes Gutenberg University Mainz, Germany. Suppose you have thousand of genes in a file in FASTA format and want to delete or save (in another file) those sequences which are shorter than 300 bp or amino acid then this PERL script will save your life from editing file manually.

PERL Script

Options

Uses

Script name	Download
Seq_filter.pl

This PERL script will generate three files

sequence length ok - Sequence with defined length

sequence too short - Sequence shorter than defined length

sequence too long - Sequence longer than defined length

Options


-i   = File name with sequences
-min = Minimum length of sequences
-max = Maximum length of sequences
-0   = Remove sequences and write to output file sequences_too_long.fas
-1   = Cut the end of the sequence and save to file sequences_ok.fas
-I   = File containing ist of files with sequences

Uses

If my sequences are stored in INPUT.TXT and I want to extract sequences with minimum length 30 and maximum length 60 then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60

If I want to cut the end of the sequence and save to file sequences_ok.fas fiile then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -1

If I want to remove sequences and write to output file sequences_too_long.fas then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -0

If I have so many file with thousand of sequences, then I will write the name of those file(one

file name per line) in a text file (file_name.txt, for example then my command will be

perl length_cutoff.pl -I file_name.txt -min 30 -max 60

I can use all command together also like this

perl Seq_filter.pl -i INPUT.TXT -I file_name.txt -min 30 -max 60 -1

Hope this Bioinformatics tutorial would be useful to extract out the sequence from your multi-fasta files in sequence analysis studies.

Update

Yoy can also use this perl script to remove FASTA sequences shorter that your specified number


#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if($seqlen >= $minlen);
    }
    local $/="\n";
}

Uses

perl remove_small.pl 200 input.fasta > result.fasta

Here 200 is the length of sequence you want to keep. So all fasta sequences less than 200 in length will be removed from input.fasta and saved into result.fasta.

Sanjay Singh

How to Filter the Sequence by Their Length

Update

Comments

Post a Comment