Ronald Robertson

Sanjay Singh

Scientist/Writer
  • Emailsanjaysingh765@gmail.com
  • Socail@lampatlex
  • VisitorSince 1982
  • LocationKentucky, USA



Extracting FASTA Sequences Based on Position WITH PERL Script








Previously, I have shared two different methods to extract fasta sequences based on position. For example, if you have coordinates for the nucleotide or amino acid positions and you want obtained the nucleotide or amino acid sequence from that position then you can easily use these methods. I generally predict the domain boundry from several protein sequences from NCBI's CDD database and then use these methods to isolate the amino acid sequence from those domain from several sequences.


Extract Part of a FASTA Sequences with Position by python script HERE

 Extract Part of a FASTA Sequences with Position online HERE

Here I am going to share some PERL scrip that can extract the nucleotide or amino acid sequences for the user defined positions.



PERL Script : sub-seq.pl


#!/usr/bin/env perl

#Uses: perl sub-seq.pl input.txt range

use strict;
use warnings;

my $end = pop;
my $start = pop;
local $/ = '>';

while (<>) {
chomp;
next unless /(.+)/;
my ($header) = "$/$1_$start-$end\n";
my $seq = ${^POSTMATCH};
$seq =~ s/\s//g;
print $header;
print +( substr $seq, $start - 1, $end ) . "\n";
}




HOW TO SAVE THE PERL SCRIPT HERE


Uses


perl sub-seq.pl input.txt 10 15

Here

input.txt

is the file which contains your nucleotide or amino acid sequences while


10 15

is the range of which you want to extract the sequence. So this script will count the 15 bases/amino acid from position 10 and print them in your result file.


input file


>Seq1
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC
>Seq2
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT




Result


>seq1_10-15
AGCTGAAAATATGTA
>seq2_10-15
AGCTGAAAATATGTA




Comments