
                                   cons 



Function

   Creates a consensus from multiple alignments

Description

   cons calculates a consensus sequence from a multiple sequence
   alignment. To obtain the consensus, the sequence weights and a scoring
   matrix are used to calculate a score at each position in the
   alignment.

   The residue (or nucleotide) i in an alignment column, is compared to
   all other residues (j) in the column. The score for i is the sum over
   all residues j (not i=j) of the score(ij)*weight(j) . Where score(ij)
   is taken from a nucleotide or protein scoring matrix (see -datafile
   qualifier) and the "weight(j)" is the weighting given to the sequence
   j, which is given in the alignment file.

   The highest scoring type of residue is then found in the column. If
   the number of positive matches for this residue is greater than the
   "plurality value" then this residue is the consensus. The positive
   matches for a residue i are calculated as being the sum of weights of
   all the residues that increase the score of residue i (i.e. positive).

   Where no consensus is found at a position i, an 'n' or an 'x'
   character is output; (depending on it being a DNA or protein
   sequence).

   The "plurality" qualifier allows the user to set a cut-off for the
   number of positive matches below which there is no consensus.

   The "identity" qualifier provides the facility of setting the required
   number of identities at a site for it to give a consensus at that
   position. Therefore, if this is set to the number of sequences in the
   alignment only columns of identities contribute to the consensus.

   The "setcase" qualifier sets the threshold for the positive matches
   above which the consensus is is upper-case and below which the
   consensus is in lower-case.

Usage

   Here is a sample session with cons


% cons 
Creates a consensus from multiple alignments
Input sequence set: dna.msf
Output sequence [dna.fasta]: aligned.cons

   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqset     File containing a sequence alignment.
  [-outseq]            seqout     Output sequence USA

   Additional (Optional) qualifiers:
   -datafile           matrix     This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.
   -plurality          float      Set a cut-off for the number of positive
                                  matches below which there is no consensus.
                                  The default plurality is taken as half the
                                  total weight of all the sequences in the
                                  alignment.
   -identity           integer    Provides the facility of setting the
                                  required number of identities at a site for
                                  it to give a consensus at that position.
                                  Therefore, if this is set to the number of
                                  sequences in the alignment only columns of
                                  identities contribute to the consensus.
   -setcase            float      Sets the threshold for the positive matches
                                  above which the consensus is is upper-case
                                  and below which the consensus is in
                                  lower-case.
   -name               string     Name of the consensus sequence

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1             integer    Start of each sequence to be used
   -send1               integer    End of each sequence to be used
   -sreverse1           boolean    Reverse (if DNA)
   -sask1               boolean    Ask for begin/end/reverse
   -snucleotide1        boolean    Sequence is nucleotide
   -sprotein1           boolean    Sequence is protein
   -slower1             boolean    Make lower case
   -supper1             boolean    Make upper case
   -sformat1            string     Input sequence format
   -sdbname1            string     Database name
   -sid1                string     Entryname
   -ufo1                string     UFO features
   -fformat1            string     Features format
   -fopenfile1          string     Features file name

   "-outseq" associated qualifiers
   -osformat2           string     Output seq format
   -osextension2        string     File name extension
   -osname2             string     Base file name
   -osdirectory2        string     Output directory
   -osdbname2           string     Database name to add
   -ossingle2           boolean    Separate file for each entry
   -oufo2               string     UFO features
   -offormat2           string     Features format
   -ofname2             string     Features file name
   -ofdirectory2        string     Output directory

   General qualifiers:
   -auto                boolean    Turn off prompts
   -stdout              boolean    Write standard output
   -filter              boolean    Read standard input, write standard output
   -options             boolean    Prompt for standard and additional values
   -debug               boolean    Write debug output to program.dbg
   -verbose             boolean    Report some/full command line options
   -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning             boolean    Report warnings
   -error               boolean    Report errors
   -fatal               boolean    Report fatal errors
   -die                 boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-sequence]
   (Parameter 1) File containing a sequence alignment. Readable set of
   sequences Required
   [-outseq]
   (Parameter 2) Output sequence USA Writeable sequence <sequence>.format
   Additional (Optional) qualifiers Allowed values Default
   -datafile This is the scoring matrix file used when comparing
   sequences. By default it is the file 'EBLOSUM62' (for proteins) or the
   file 'EDNAFULL' (for nucleic sequences). These files are found in the
   'data' directory of the EMBOSS installation. Comparison matrix file in
   EMBOSS data path EBLOSUM62 for protein
   EDNAFULL for DNA
   -plurality Set a cut-off for the number of positive matches below
   which there is no consensus. The default plurality is taken as half
   the total weight of all the sequences in the alignment. Any numeric
   value Half the total sequence weighting
   -identity Provides the facility of setting the required number of
   identities at a site for it to give a consensus at that position.
   Therefore, if this is set to the number of sequences in the alignment
   only columns of identities contribute to the consensus. Integer 0 or
   more 0
   -setcase Sets the threshold for the positive matches above which the
   consensus is is upper-case and below which the consensus is in
   lower-case. Any numeric value @( $(sequence.totweight) / 2)
   -name Name of the consensus sequence Any string is accepted An empty
   string is accepted
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

Input file format

   The USA of a set of aligned sequences.

  Input files for usage example

  File: dna.msf

!!NA_MULTIPLE_ALIGNMENT

 dna.msf  MSF: 120  Type: N  January 01, 1776  12:00  Check: 3196 ..

 Name: MSFM1          Len:   120  Check:  8587  Weight:  1.00
 Name: MSFM2          Len:   120  Check:  6178  Weight:  1.00
 Name: MSFM3          Len:   120  Check:  8431  Weight:  1.00

//

        MSFM1  ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC
        MSFM2  ACGTACGTAC GTACGTACGT ....ACGTAC GTACGTACGT ACGTACGTAC
        MSFM3  ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT CGTACGTACG

        MSFM1  GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
        MSFM2  GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
        MSFM3  TACGTACGTA CGTACGTACG TACGTACGTA ACGTACGTAC GTACGTACGT

        MSFM1  ACGTACGTAC GTACGTACGT
        MSFM2  ACGTACGTTG CAACGTACGT
        MSFM3  ACGTACGTAC GTACGTACGT

Output file format

   The output consists of a sequence file holding the consensus sequence.

  Output files for usage example

  File: aligned.cons

>EMBOSS_001
ACGTACGTACGTACGTACGTnnnnACGTACGTACGTACGTnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnACGTACGTACGTACGTACGTACGTACGTnnnnACGTACGT

Data files

   It uses the standard set of scoring matrix data files.

   EMBOSS data files are distributed with the application and stored in
   the standard EMBOSS data directory, which is defined by the EMBOSS
   environment variable EMBOSS_DATA.

   To see the available EMBOSS data files, run:

% embossdata -showall

   To fetch one of the data files (for example 'Exxx.dat') into your
   current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

   Users can provide their own data files in their own directories.
   Project specific files can be put in the current directory, or for
   tidier directory listings in a subdirectory called ".embossdata".
   Files for all EMBOSS runs can be put in the user's home directory, or
   again in a subdirectory called ".embossdata".

   The directories are searched in the following order:
     * . (your current directory)
     * .embossdata (under your current directory)
     * ~/ (your home directory)
     * ~/.embossdata

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

   Program name                    Description
   megamerger   Merge two large overlapping nucleic acid sequences
   merger       Merge two overlapping nucleic acid sequences

Author(s)

   Tim Carver (tcarver  rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.

Comments

   None
