Advanced Bioinformatics Services

Removing duplicated sequences from FASTA files

Many FASTA references files (e.g. downloaded from UCSC & NCBI ftp servers) contain duplicated sequences. The latter would not only decrease the number of uniquely mapped reads but may also interfere in the downstream processing by other software packages (e.g. reads quantification with eXpress).

Currently, the faFilter software offers the most reliable way to clean any FASTA file from duplicated reference sequences.

# Download faFilter software:

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faFilter

# Create a link in your $PATH (e.g. /usr/local/bin):

sudo ln -s /path/to/faFilter/faFilter /usr/local/bin/faFilter

# Apply to a FASTA reference file:

faFilter -uniq reference.fa reference_no_duplicates.fa