Ronald Robertson

Sanjay Singh

Scientist/Writer
  • Emailsanjaysingh765@gmail.com
  • Socail@lampatlex
  • VisitorSince 1982
  • LocationKentucky, USA



Sort FASTA File by Sequence Name




Problem


I got a multi fasta file containing several isoform of the same locus. So my task was to get rid of the multiple version of same gene and retain the longest isoform. To do so I decided to sort the FASTA sequences alphanumerically by their header and then discard the smaller versions of the same locus. Previously I share a method to sort FASTA file by sequence length but in this case I was interested to filter the FASTA file by sequence name.  




Solution


I found on PERL script for FASTA file sorting from Dr. Naoki Takebayashi's lab from University of Alaska. This perl script depends upon to different modules of Bioperl: Bio::SeqIO and Getopt::Std



How to install Bioperl HERE

PERL script


#!/usr/bin/perl -w

my $usage="\nUsage: $0 [-hrg] [fastaFileName1 ...]\n".
" -h: help\n".
" -r: reverse\n" .
" -g: remove gaps '-' from the sequence\n".
"Sort FASTA sequences alphabetically by names. If multiple files are \n".
"given, sequences in all files are marged before sorting. If no \n".
"argument is given, it will take STDIN as the input.\n" .
"Note that the entire sequence label including spaces is used as\n".
"the name.\n";

our($opt_h, $opt_g, $opt_r);

use Bio::SeqIO;

use Getopt::Std;
getopts('hgr') || die "$usage\n";
die "$usage\n" if (defined($opt_h));

my $format = "fasta";
my @seqArr = ();

@ARGV = ('-') unless @ARGV;
while (my $file = shift) {
my $seqio_obj = Bio::SeqIO->new(-file => $file, -format => $format);
while (my $seq = $seqio_obj->next_seq()) {
# need to deal with spaces
$seq->desc( $seq->id . " ". $seq->desc);

push(@seqArr, $seq);
}
}

if (defined($opt_r)) {
@seqArr = sort { - ($a->desc() cmp $b->desc()) } @seqArr;
} else {
@seqArr = sort { $a->desc() cmp $b->desc() } @seqArr;
}

my $seqOut = Bio::SeqIO->new(-fs => \*STDOUT, -format => $format);
foreach my $s (@seqArr) {
# prints "id desc", and desc was modified, returning it to original
my $thisDesc = $s->desc;
$thisDesc =~ s/^\S+ //; # remove the first word.
$s->desc($thisDesc);

if(defined($opt_g)) {
my $tmp = $s->seq();
$tmp =~ s/-//g;
$s->seq($tmp);
}
$seqOut->write_seq($s);
}

exit;


Uses


Suppose you save the PERL script as Sort_fasta.pl then you can use it as follow


Sort_fasta.pl -h to know more about script
Sort_fasta.pl -r My_input to sort fasta file in descending order
Sort_fasta.pl -g My_input remove gaps '-' from the sequence
Sort_fasta.pl [options] My_input

Note that sorting in ascending order is default.



Extract Part of a FASTA Sequences with Position by python script HERE

Comments