Extracting specific sequences from FASTA files

Extracting specific sequences from FASTA files

The faFilter software also offers a reliable way to extract any specific sequences from a FASTA reference file based on the information in the header (sequence ID). For instance, using the faFilter one can generate separate FASTA reference files for particular RNA types (rRNA, tRNA, snRNA, snoRNA, miRNA etc.) from a FASTA file containing total reference transcriptome. Finally, the faFilter can be used to remove sequences having unwanted IDs from any FASTA file.

# Download faFilter software:

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faFilter

# Create a link in your $PATH (e.g. /usr/local/bin):

sudo ln -s /path/to/faFilter/faFilter /usr/local/bin/faFilter

# The following command will extract sequences having only IDs starting with "hg38":

faFilter -name=hg38* original_fasta.fa fasta_containing_only_sequences_having_ID_started_with_"hg38".fa

# The following command will extract only the sequences with IDs in a custom list:

faFilter -namePatList=ID_list.txt original_fasta.fa fasta_containing_only_sequences_with_IDs_from_ID_list.fa