Removing duplicated sequences from FASTA files

Removing duplicated sequences from FASTA files

Many FASTA references files (e.g. downloaded from UCSC & NCBI ftp servers) contain duplicated sequences. The latter would not only decrease the number of uniquely mapped reads but may also interfere in the downstream processing by other software packages (e.g. reads quantification with eXpress).

Currently, the faFilter software offers the most reliable way to clean any FASTA file from duplicated reference sequences.

# Download faFilter software:

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faFilter

# Create a link in your $PATH (e.g. /usr/local/bin):

sudo ln -s /path/to/faFilter/faFilter /usr/local/bin/faFilter

# Apply to a FASTA reference file:

faFilter -uniq reference.fa reference_no_duplicates.fa

SciBerg e.Kfm

Legal form: Sole Proprietorship

Birkenauer Str. 7, Mannheim 68309, Germany

Amtsgericht Mannheim HRA 707401

VAT identification number: DE 312303132

Get in Touch

Email: info@sciberg.com

Phone: +49 171 190 8276