Ronald Robertson

Sanjay Singh

Scientist/Writer
  • Emailsanjaysingh765@gmail.com
  • Socail@lampatlex
  • VisitorSince 1982
  • LocationKentucky, USA



How to Remove Duplicate Sequences from a multi fasta Files



Most common problems with sequence analysis is presence of duplicate sequence in data sets. Even protein or nucleotide sequences downloaded from NCBI or other similar databases may contain duplicate entries. Some FASTA files may have sequences with different IDs that nonetheless have the same sequence and presence of duplicate entries may lead to incorrect results. So lets talk about some utility to remove duplicate sequences from a multi fasta file.










RemoveRep.pl








Dependencies
Bio::Perl
Bio::SeqIO

Uses


removerep.pl input.txt output.txt










RemoveRep2.pl










Uses


you input sequences should be in input.txt 


3. Other options


Some other nice free bioinformatics software program you may want to give a try to remove duplicate or redundant sequence sequences from your datasets.



 I. CD-HIT



II. DNA Baser



III. Duplicates Finder

Comments