Previously I have shared a PERL script to extract multiple sequence from multi FASTA file with PERL. So if you have accession numbers stored in a file
and sequence in another file then you can fetch the sequence with the help of that PERL script. But here situaion is different. Here we have FASTA sequences (sequence.txt) in a file
and accession numbers/IDs (ID.txt) in different file but the IDs are given in different row and we want to extract the FASTA sequences according the IDs grouped in different
row and store in to different files (out_1, out_2, out_3).
SCRIPT 1 : extract-seq.PL
#!/usr/bin/perl
use strict;
use warnings;
my ( %list, %FHs, $id );
while (<>) {
$list{$_} = "out_$." for split;
last if eof;
}
local $/ = '>';
while (<>) {
chomp;
if ( ($id) = /(.+)/ and exists $list{$id} ) {
open $FHs{ $list{$id} }, '>', $list{$id} or die $! unless defined $FHs{ $list{$id} };
print { $FHs{ $list{$id} } } ">$_";
}
}
Uses
perl extract-seq.pl id.txt sequence.txt
Input
Sequences
>Seq1
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGC
CAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAAC
ACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCC
AGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGC
ATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTG
AAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCA
AGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCT
TCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGG
GGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq2
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq3
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT
>Seq6
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
IDs
Seq1 Seq2 Seq3
Seq4 Seq5
Seq6
Convert Multi Fasta file into a Single line FASTA File HERE
How to add specific word to fasta header HERE
Results
out_1
Seq1 Seq2 Seq3
>Seq1
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGC
CAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAAC
ACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCC
AGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGC
ATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTG
AAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCA
AGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCT
TCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGG
GGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq2
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq3
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
out_2
Seq1 Seq2 Seq3
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT
out_3
Seq1 Seq2 Seq3
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT
Post a Comment