MACSE

This is the documentation for pipelines based on MACSE v2

1. Overview

MACSE (Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons) provides a complete toolkit dedicated to the multiple alignment of coding sequences that can be leveraged via both the command line and a Graphical User Interface (GUI).

Various strategies can be built using the MACSE toolkit to handle datasets of various sizes and containing various types of sequences (contigs, pseudogenes, barcoding sequences).

We share here some of the pipelines that we have built so far using MACSE. These bash pipelines are encapsulated into Singularity containers so that you don’t need to deal with dependancies or configuration issues.

2. Getting started

If you are new to Singularity, you should probably start here:

3. The alignment pipelines

  • AlFiX: this pipeline uses MACSE and HmmCleaner to produce a high quality alignment of nucleotide (NT) coding sequences using their amino acid (AA) translations. It is well suited for datasets containing a few dozen of sequences of a few Kb.
  • OMM_MACSE: this pipeline also produces a codon-aware alignment thanks to MACSE, which could be filtered by HmmCleaner, but it can handle larger datasets by relying on MAFFT, MUSCLE or PRANK to scale up.

4. The barcoding related pipelines

For Barcoding, if you have dozen of thousands of sequences to align (e.g. COI-5P or matK) we suggest the following steps i) using a reference sequence that you have eyed cheked, identify similar sequences in your dataset and reverse complement them when needed; then select a subset of about 100 sequences representatives of this dataset ii) align these representative sequences iii) align (in parallel) each sequence against these representative subset. This can be done in three command lines using our dedicated pipelines:

  • getRepresentativeSequences: this pipeline uses MACSE, MMSEQ2, seqtK to identify a small subset of representative sequences, filter out non homologous sequences and reverse complement homologous one when needed.
  • OMM_MACSE (see above) allow to align your representative sequences found in previous step
  • MACSE_barcode_align that will parrallelize the align of your barcoding sequences thanks to nextflow

We succesfully launch this barcoding pipeline on different taxonomic group. The corresponding data are available on this page.

5. OMM_MACSE and AlFiX pipelines share several common steps and options

Mandatory input and output file options

Both pipelines produce several output files: two alignment files (at the NT and AA levels); a csv file with filtering statistics per sequence, a fasta file with filtering details (where nucleotides of input sequences are in upper cases if present in the final alignment and in lower case otherwise). All output files are stored in a new folder and named with a common prefix so that they do not mix with your own files. Both pipelines have therefore three mandatory options:

option_nameused forexample
–in_seq_filespecifying the INput SEQuence FILE containing the coding nucleotide sequences to be aligned, in fasta format–in_seq_file LOC_48720_NT.fasta
–out_dirspecifying the OUTput DIRectory in which result files will be stored–out_dir RES_LOC_48720
–out_file_prefixspecifying the common part of output file names–out_file_prefix LOC_48720

Basic usage examples

As there are only three mandatory options, you can launch these pipelines with default options using the following commands (see Singularity quick start if needed):

  • ./OMM_MACSE_v10.01.sif –in_seq_file LOC_48720.fasta –out_dir RES_LOC_48720 –out_file_prefix LOC_48720
  • ./MACSE_ALFIX_v01.sif –in_seq_file LOC_48720.fasta –out_dir RES_LOC_48720 –out_file_prefix LOC_48720

Optional options for saving intermediary files (mostly for debugging purposes)

option_nameused forexample
–out_detail_dirspecifying the output directory that will contains all intermediary files–out_detail_dir DETAILS_LOC_48720
–save_detailsturning ON when intermediary files need to be saved–save_details
–debugturning ON to keep the temporary folder created in /tmp/–debug

Optional options related to genetic codes and less reliable sequences set

Both pipelines allow you to specify the genetic code that should be used to translate your sequences, and to provide a second input file that contains less reliable sequences (e.g. newly assembled contigs, pseudogenes, etc…) in which frameshifts and stop codons are expected to be more frequent:

option_nameused forusage example
–genetic_code_number code_numberselecting the relevant genetic code–genetic_code_number 5
–in_seq_lr_filespecifying the INput SEQuence FILE containing the Less Reliable sequences–in_seq_lr_file less_reliable_seq_file.fasta

Optional options related to filtering steps

Both pipelines include four optional filtering steps:

  • a prefiltering performed before sequence alignment to mask non homologous sequence fragments that is done using trimNonHomologousFragments
  • a filtering of the amino acid alignment to mask residues that seem to be misaligned. This is done using HmmCleaner at the amino acid level and reported at the codon level using reportMaskAA2NT
  • a post-processing filtering is done to further masked isolated codons and patchy sequences (if 80% of a sequence has been masked it is probably better to remove it completely). This step is performed by setting options of reportMaskAA2NT accordingly.
  • finally the extremities of the alignment are trimmed until a site with a minimal percentage of nucleotides is reached (using trimAlignment).

All these filtering steps are active by default but can be individually turned OFF and the minimal percentage of nucleotides used for the final trimming step can be adjusted:

option_nameused fordefault value
–no_prefilteringturning OFF the pre-filtering stepON
–no_FS_detectionturning OFF the detection of frameshifts (only relevant for the OMM_MACSE pipeline)ON
–no_filteringturning OFF the HmmCleaner alignment filteringON
–no_postfilteringturning OFF the post-filtering of the alignment that mask isolated AAON
–min_percent_NT_at_endsallowing to set the minimal number of nucleotides that should be present at the first and last site of the final alignment0.7

Optional option to allocate more memory (for large datasets)

For datasets containing numerous long sequences, MACSE may need more memory than the default value allocated to the java virtual machine. This can be set using the following option:

option_nameused forusage example
–java_mempassing the argument to the jvm via its Xmx option–java_mem 600m

Optional options to specify how frameshifting codons should be exported

The output directory contains several files (see details in the readme_output.txt file in the output directory). The final alignment files (NT and AA) are obtained after replacing STOP codons by « NNN », and frameshifting codons by either « NNN » (default) or « — » using exportAlignment:

option_nameused fordefault
–replace_FS_by_gapsreplacing frameshifting codons by « — » instead of « NNN »OFF (« NNN »)

Three options specific to the OMM_MACSE pipeline

Because OMM_MACSE relies on an external tool to rapidly align the amino acid sequences after having detected frameshifts thanks to a draft alignment performed by MACSE, three additional options are available. The first allows to select the amino acid alignment tool (MAFFT or Muscle); the second allows to pass extra parameters to the alignment tool; the third allows to turn OFF the detection of frameshifts by MACSE. Note that if this option is turned OFF while some sequences actually do contain frameshifts, the resulting alignment will be meaningless since based on an erroneous translation of the nucleotide sequences.

option_nameused forusage example
–alignAA_softspecifying the software to use, MAFFT (default), Muscle or PRANK for aligning the frameshift corrected amino acid sequences–alignAA_soft MUSCLE
–aligner_extra_optionspecifying the options to provide to the alignment software–aligner_extra_option « –localpair –maxiterate 1000 »
–no_FS_detectionturning OFF the detection of frameshifts by MACSE–no_FS_detection

Muscle and PRANK are used with default parameters. MAFFT (default aligner) is launch using « –localpair –maxiterate 2000 », which correspond to L-INS-i algorithm and is usually better adapted for CDS sequences.