Ronald Robertson

Sanjay Singh

Scientist/Writer
  • Emailsanjaysingh765@gmail.com
  • Socail@lampatlex
  • VisitorSince 1982
  • LocationKentucky, USA



How to Extract Multiple Sequence from Multi Fasta File with PERL - II





Previously I have shared a PERL script to extract multiple sequence from multi FASTA file with PERL. So if you have accession numbers stored in a file
and sequence in another file then you can fetch the sequence with the help of that PERL script. But here situaion is different. Here we have FASTA sequences (sequence.txt) in a file
and accession numbers/IDs (ID.txt) in different file but the IDs are given in different row and we want to extract the FASTA sequences according the IDs grouped in different
row and store in to different files (out_1, out_2, out_3).



SCRIPT 1 : extract-seq.PL


#!/usr/bin/perl
use strict;
use warnings;

my ( %list, %FHs, $id );

while (<>) {
$list{$_} = "out_$." for split;
last if eof;
}

local $/ = '>';
while (<>) {
chomp;
if ( ($id) = /(.+)/ and exists $list{$id} ) {
open $FHs{ $list{$id} }, '>', $list{$id} or die $! unless defined $FHs{ $list{$id} };
print { $FHs{ $list{$id} } } ">$_";
}
}


Uses


perl extract-seq.pl id.txt sequence.txt 


Input



Sequences


>Seq1
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGC
CAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAAC
ACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCC
AGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGC
ATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTG
AAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCA
AGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCT
TCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGG
GGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq2
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq3
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT
>Seq6
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT


IDs


Seq1 Seq2 Seq3
Seq4 Seq5
Seq6


Convert Multi Fasta file into a Single line FASTA File HERE

How to add specific word to fasta header HERE


Results



out_1


Seq1 Seq2 Seq3
>Seq1
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGC
CAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAAC
ACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCC
AGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGC
ATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTG
AAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCA
AGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCT
TCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGG
GGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq2
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq3
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT



out_2


Seq1 Seq2 Seq3
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT


out_3


Seq1 Seq2 Seq3
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT




Advantages





  • will give only sequence for redundant FASTA headers

  • Case sensitive

  • Work with both multi line FASTA and single line FASTA


  • Comments