Update: 5.29.2018
Bedtools are another nice tool to extract defined regions sequence from FASTA file. Install Bedtools on your Ubuntu machine using these commands
sudo apt-get update
apt-get install bedtools
sudo
and extract sequence by this command
bedtools getfasta -fi input_fasta -bed
id_file
Formats for both fasta file and id files are same as described below.
=======================================================================
Actually, I have hundreds of protein sequence and I identified the conserved domain sequence from all those hundreds of protein sequences. Now I got the location of all those domains and want to extract the exact sequence from that locations. So it is easy if I have a single sequence and have the location of one or more domain in my protein but it's very difficult to extract out the domain sequences from many protein sequences with the help of domain location coordinates. I found an easy python script to extracting fasta sequences based on position. I have also shared an online program originally written by Dr Pierre Lindenbaum HERE .
Example FASTA file with protein sequence
>AT1G01250
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQVAAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVDGDLYQYQFMACL
>AT1G03800
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRPWGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSATPVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAVAGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLDLNASP
Example ID file with domain location
AT1G01250 45 102
AT1G03800 65 109
Script name | Download |
---|---|
domainseq.py |
Uses
python domainseq.py input.fasta ids.txt > result.fasta
Results
>AT1G01250:45-102
IREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPR
>AT1G03800:65-109
AEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFP
Post a Comment