Sort FASTA File by Sequence Name

Problem

I got a multi fasta file containing several isoform of the same locus. So my task was to get rid of the multiple version of same gene and retain the longest isoform. To do so I decided to sort the FASTA sequences alphanumerically by their header and then discard the smaller versions of the same locus. Previously I share a method to sort FASTA file by sequence length but in this case I was interested to filter the FASTA file by sequence name.

Solution

I found on PERL script for FASTA file sorting from Dr. Naoki Takebayashi's lab from University of Alaska. This perl script depends upon to different modules of Bioperl: Bio::SeqIO and Getopt::Std

How to install Bioperl HERE

PERL script

#!/usr/bin/perl -w

my $usage="\nUsage: $0 [-hrg] [fastaFileName1 ...]\n".
    "  -h: help\n".
    "  -r: reverse\n" .
    "  -g: remove gaps '-' from the sequence\n".
    "Sort FASTA sequences alphabetically by names.  If multiple files are \n".
    "given, sequences in all files are marged before sorting.  If no \n".
    "argument is given, it will take STDIN as the input.\n" .
    "Note that the entire sequence label including spaces is used as\n".
    "the name.\n";

our($opt_h, $opt_g, $opt_r);

use Bio::SeqIO;

use Getopt::Std;
getopts('hgr') || die "$usage\n";
die "$usage\n" if (defined($opt_h));

my $format = "fasta";
my @seqArr = ();

@ARGV = ('-') unless @ARGV;
while (my $file = shift) {
    my $seqio_obj = Bio::SeqIO->new(-file => $file, -format => $format);
    while (my $seq = $seqio_obj->next_seq()) {
       # need to deal with spaces
 $seq->desc( $seq->id . " ". $seq->desc);

 push(@seqArr, $seq);
    }
}

if (defined($opt_r)) {
    @seqArr = sort { - ($a->desc() cmp $b->desc()) } @seqArr;
} else {
    @seqArr = sort { $a->desc() cmp $b->desc() } @seqArr;
}

my $seqOut = Bio::SeqIO->new(-fs => \*STDOUT, -format => $format);
foreach my $s (@seqArr) {
    # prints "id desc", and desc was modified, returning it to original
    my $thisDesc = $s->desc;
    $thisDesc =~ s/^\S+ //; # remove the first word.
    $s->desc($thisDesc);

    if(defined($opt_g)) {
 my $tmp = $s->seq();
 $tmp =~ s/-//g;
 $s->seq($tmp);
    }
    $seqOut->write_seq($s);
}

exit;

Uses

Suppose you save the PERL script as Sort_fasta.pl then you can use it as follow

Sort_fasta.pl -h to know more about script
Sort_fasta.pl -r My_input     to sort fasta file in descending order
Sort_fasta.pl -g My_input     remove gaps '-' from the sequence
Sort_fasta.pl [options] My_input

Note that sorting in ascending order is default.

Extract Part of a FASTA Sequences with Position by python script HERE

Sanjay Singh

Sort FASTA File by Sequence Name

Problem

Solution

Comments

Post a Comment