Previously, I have shared two different methods to extract fasta sequences based on position. For example, if you have coordinates for the nucleotide or amino acid positions and you want obtained the nucleotide or amino acid sequence from that position then you can easily use these methods. I generally predict the domain boundry from several protein sequences from NCBI's CDD database and then use these methods to isolate the amino acid sequence from those domain from several sequences.
Extract Part of a FASTA Sequences with Position by python script HERE
Extract Part of a FASTA Sequences with Position online HERE
Here I am going to share some PERL scrip that can extract the nucleotide or amino acid sequences for the user defined positions.
PERL Script : sub-seq.pl
#!/usr/bin/env perl
#Uses: perl sub-seq.pl input.txt range
use strict;
use warnings;
my $end = pop;
my $start = pop;
local $/ = '>';
while (<>) {
chomp;
next unless /(.+)/;
my ($header) = "$/$1_$start-$end\n";
my $seq = ${^POSTMATCH};
$seq =~ s/\s//g;
print $header;
print +( substr $seq, $start - 1, $end ) . "\n";
}
HOW TO SAVE THE PERL SCRIPT HERE
Uses
perl sub-seq.pl input.txt 10 15
Here
input.txt
is the file which contains your nucleotide or amino acid sequences while
10 15
is the range of which you want to extract the sequence. So this script will count the 15 bases/amino acid from position 10 and print them in your result file.
input file
>Seq1
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC
>Seq2
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT
Result
>seq1_10-15
AGCTGAAAATATGTA
>seq2_10-15
AGCTGAAAATATGTA
Post a Comment