
                            SIGGEN documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Generates a sparse protein signature from an alignment

2.0 INPUTS & OUTPUTS

   SIGGEN reads a directory of DAF files (domain alignment files) and,
   optionally, a directory of CON files (contacts file) containing a CON
   file for each aligned domain. It generates a sparse protein signature
   of a specified sparsity for each alignment. The base name of a
   signature file is the unique identifier (an integer) for the family,
   superfamily etc if one is specified in the DAF file, otherwise, the
   base name of the input DAF file is used. The paths of the input and
   output files are specified by the user and the file extensions are
   specified in the ACD file.

3.0 INPUT FILE FORMAT

   The format of the domain alignment file is described in DOMAINALIGN
   documentation.

4.0 OUTPUT FILE FORMAT

   The output file (Figure 1) uses the following records. Domain
   classification records for the node in SCOP or CATH from which the
   input alignment and therefore signature were derived are given. In
   this example, the four records taken from the DAF (input) file are CL,
   FO, SF and FA.
     * TY - Signature type, either SCOP or CATH for domain signatures, or
       LIGAND for ligand signatures.
     * TS - Signature data type, either 1D or 3D, for sequence or
       structure-based signatures respectively.
     * CL - Domain class.
     * FO - Domain fold.
     * SF - Domain superfamily.
     * FA - Domain family.
     * SI - Unique identifier of the node in question, e.g. SCOP Sunid of
       a domain family.
     * NP - Number of signature positions.
     * NN - Signature position number. The number given in brackets
       indicates the start of the data for the relevent signature
       position.
     * IN - Informative line about signature position. The number of
       different observed amino acid residues is given after 'NRES', the
       number of different sizes of gap follows 'NGAP', and the window
       size after 'WSIZ'. When a signature is aligned to a protein
       sequence, the permissible gaps between two signature positions is
       determined by the empirical gaps and the window size for the
       C-terminal most position of the pair.

   Two rows of data for the emprical residues and gaps are then given:
     * AA - The identifier of a residue seen in this position and the
       frequency of its occurence are delimited by ';'.
     * GA - The size of a gap seen in this position and the frequency of
       its occurence are delimited by ';'.
     * // - used to delimit data for each signature. The last line of a
       file always contains '//' only.

  Output files for usage example

  File: 54894.sig

TY   SCOP
XX
TS   1D
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain
XX
FA   Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain
XX
SI   54894
XX
NP   15
XX
NN   [1]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   H ; 2
XX
GA   12 ; 2
XX
NN   [2]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   P ; 2
XX
GA   1 ; 2
XX
NN   [3]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   P ; 2
XX
GA   26 ; 2
XX
NN   [4]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   F ; 2
XX
GA   16 ; 2
XX
NN   [5]
XX


  [Part of this file has been deleted for brevity]

XX
GA   4 ; 2
XX
NN   [10]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   D ; 2
XX
GA   2 ; 2
XX
NN   [11]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   N ; 2
XX
GA   0 ; 2
XX
NN   [12]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   Y ; 2
XX
GA   0 ; 2
XX
NN   [13]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   G ; 2
XX
GA   3 ; 2
XX
NN   [14]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   P ; 2
XX
GA   3 ; 2
XX
NN   [15]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   P ; 2
XX
GA   2 ; 2
//

  File: 55074.sig

TY   SCOP
XX
TS   1D
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
SI   55074
XX
NP   38
XX
NN   [1]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   D ; 2
XX
GA   13 ; 2
XX
NN   [2]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   V ; 2
XX
GA   1 ; 2
XX
NN   [3]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   F ; 2
XX
GA   3 ; 2
XX
NN   [4]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   D ; 2
XX
GA   1 ; 2
XX
NN   [5]
XX


  [Part of this file has been deleted for brevity]

XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   L ; 2
XX
GA   4 ; 2
XX
NN   [34]
XX
IN   NRES 2 ; NGAP 1 ; WSIZ 0
XX
AA   E ; 1
AA   D ; 1
XX
GA   8 ; 2
XX
NN   [35]
XX
IN   NRES 1 ; NGAP 1 ; WSIZ 0
XX
AA   V ; 2
XX
GA   0 ; 2
XX
NN   [36]
XX
IN   NRES 2 ; NGAP 1 ; WSIZ 0
XX
AA   I ; 1
AA   V ; 1
XX
GA   17 ; 2
XX
NN   [37]
XX
IN   NRES 2 ; NGAP 1 ; WSIZ 0
XX
AA   F ; 1
AA   Y ; 1
XX
GA   2 ; 2
XX
NN   [38]
XX
IN   NRES 2 ; NGAP 1 ; WSIZ 0
XX
AA   I ; 1
AA   L ; 1
XX
GA   1 ; 2
//

5.0 DATA FILES

   SIGGEN requires a residue substitution matrix.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-algpath]           dirlist    This option specifies the location of DAF
                                  files (domain alignment files) (input). A
                                  'domain alignment file' contains a sequence
                                  alignment of domains belonging to the same
                                  SCOP or CATH family (or other node in the
                                  structural hierarchies). The file is in DAF
                                  format (CLUSTAL-like) and is annotated with
                                  domain family classification information.
                                  The files generated by using SCOPALIGN will
                                  contain a structure-based sequence alignment
                                  of domains of known structure only. Such
                                  alignments can be extended with sequence
                                  relatives (of unknown structure) by using
                                  SEQALIGN.
   -mode               menu       This option specifies the mode of signature
                                  generation. There are 3 modes for signatures
                                  generatation: (1) Use positions specified
                                  in alignment file. The alignment file must
                                  contain a line beginning with the text
                                  'Positions' for each line of the alignment.
                                  A '1' in the 'Positions' line indicates that
                                  the signature should include data from the
                                  corresponding alignment site. The signature
                                  will only include the positions that are
                                  marked with a '1'. (2) Use a scoring method.
                                  The alignment is scored (see 'Algorithm')
                                  and the signature of a specified sparsity is
                                  sampled from high scoring positions. (3):
                                  Generate a randomised signature. A signature
                                  of a specified sparsity is sampled at
                                  random from the alignment.
*  -conoption          menu       This option specifies the structure-based
                                  scoring scheme. SIGGEN provides 2
                                  structure-based scoring schemes (plus a
                                  combination method) that are used to score
                                  the input alignment.
*  -conpath            directory  This option specifies the location of CON
                                  files (contact files) (input). A 'contact
                                  file' contains contact data for a protein or
                                  a domain from SCOP or CATH, in the CON
                                  format (EMBL-like). The contacts may be
                                  intra-chain residue-residue, inter-chain
                                  residue-residue or residue-ligand. The files
                                  are generated by using CONTACTS, INTERFACE
                                  and SITES.
*  -cpdbpath           directory  This option specifies the location of domain
                                  CCF files (clean coordinate files) (input).
                                  A 'clean cordinate file' contains protein
                                  coordinate and derived data for a single PDB
                                  file ('protein clean coordinate file') or a
                                  single domain from SCOP or CATH ('domain
                                  clean coordinate file'), in CCF format
                                  (EMBL-like). The files, generated by using
                                  PDBPARSE (PDB files) or DOMAINER (domains),
                                  contain 'cleaned-up' data that is
                                  self-consistent and error-corrected. Records
                                  for residue solvent accessibility and
                                  secondary structure are added to the file by
                                  using PDBPLUS.
*  -seqoption          menu       This option specifies the sequence-based
                                  scoring scheme. SIGGEN provides 2
                                  sequence-based scoring schemes that are used
                                  to score the input alignment.
*  -datafile           matrixf    This option specifies the the substitution
                                  matrix. The substitution matrix is used by
                                  the sequence-based scoring schemes.
*  -sparsity           integer    This option specifies the % sparsity of
                                  signature. The signature sparsity is a
                                  user-defined parameter that determines how
                                  many residues the final signature will
                                  contain, for example, if the average
                                  sequence length of the proteins in the
                                  alignment is 250 residues, then a signature
                                  of sparsity 10% (default value) will contain
                                  25 key residues or signature positions,
                                  that correspond to the top 25% highest
                                  scoring alignment positions.
   -wsiz               integer    This option specifies the window size. When
                                  a signature is aligned to a protein
                                  sequence, the permissible gaps between two
                                  signature positions is determined by the
                                  empirical gaps and the window size. The user
                                  is prompted for a window size that is used
                                  for every position in the signature. Likely
                                  this is not optimal. A future implementation
                                  will provide a range of methods for
                                  generating values of window size depending
                                  upon the alignment (window size is
                                  identified by the WSIZ record in the
                                  signature output file).
*  -filtercon          toggle     This option specifies whether to disregard
                                  positions forming few contacts only during
                                  the selection of signature positions.
*  -conthresh          integer    This option specifies the threshold contact
                                  number. This controls the selection of key
                                  positions for the structure-based scoring
                                  scheme (number of contacts).
*  -[no]filterpsim     boolean    This option specifies whether to disregard
                                  alignment sites that were not aligned
                                  satisfactorily (STAMP alignments only).
  [-sigoutdir]         outdir     This option specifies the location of
                                  signature files (output). A 'signature file'
                                  contains a sparse sequence signature
                                  suitable for use with the SIGSCAN and SIGGEN
                                  programs. The files are generated by using
                                  SIGGEN & SIGGENLIG.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers: (none)
   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-algpath]
   (Parameter 1) This option specifies the location of DAF files (domain
   alignment files) (input). A 'domain alignment file' contains a
   sequence alignment of domains belonging to the same SCOP or CATH
   family (or other node in the structural hierarchies). The file is in
   DAF format (CLUSTAL-like) and is annotated with domain family
   classification information. The files generated by using SCOPALIGN
   will contain a structure-based sequence alignment of domains of known
   structure only. Such alignments can be extended with sequence
   relatives (of unknown structure) by using SEQALIGN. Directory with
   files ./
   -mode This option specifies the mode of signature generation. There
   are 3 modes for signatures generatation: (1) Use positions specified
   in alignment file. The alignment file must contain a line beginning
   with the text 'Positions' for each line of the alignment. A '1' in the
   'Positions' line indicates that the signature should include data from
   the corresponding alignment site. The signature will only include the
   positions that are marked with a '1'. (2) Use a scoring method. The
   alignment is scored (see 'Algorithm') and the signature of a specified
   sparsity is sampled from high scoring positions. (3): Generate a
   randomised signature. A signature of a specified sparsity is sampled
   at random from the alignment.
   1 (Use positions specified in alignment file)
   2 (Use a scoring method)
   3 (Generate a randomised signature)
   1
   -conoption This option specifies the structure-based scoring scheme.
   SIGGEN provides 2 structure-based scoring schemes (plus a combination
   method) that are used to score the input alignment.
   1 (Number)
   2 (Conservation)
   3 (Number and conservation)
   4 (None (structural data available))
   5 (None (no structural data available))
   5
   -conpath This option specifies the location of CON files (contact
   files) (input). A 'contact file' contains contact data for a protein
   or a domain from SCOP or CATH, in the CON format (EMBL-like). The
   contacts may be intra-chain residue-residue, inter-chain
   residue-residue or residue-ligand. The files are generated by using
   CONTACTS, INTERFACE and SITES. Directory ./
   -cpdbpath This option specifies the location of domain CCF files
   (clean coordinate files) (input). A 'clean cordinate file' contains
   protein coordinate and derived data for a single PDB file ('protein
   clean coordinate file') or a single domain from SCOP or CATH ('domain
   clean coordinate file'), in CCF format (EMBL-like). The files,
   generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain
   'cleaned-up' data that is self-consistent and error-corrected. Records
   for residue solvent accessibility and secondary structure are added to
   the file by using PDBPLUS. Directory ./
   -seqoption This option specifies the sequence-based scoring scheme.
   SIGGEN provides 2 sequence-based scoring schemes that are used to
   score the input alignment.
   1 (Substitution matrix)
   2 (Residue class)
   3 (None)
   3
   -datafile This option specifies the the substitution matrix. The
   substitution matrix is used by the sequence-based scoring schemes.
   Comparison matrix file in EMBOSS data path EBLOSUM62
   -sparsity This option specifies the % sparsity of signature. The
   signature sparsity is a user-defined parameter that determines how
   many residues the final signature will contain, for example, if the
   average sequence length of the proteins in the alignment is 250
   residues, then a signature of sparsity 10% (default value) will
   contain 25 key residues or signature positions, that correspond to the
   top 25% highest scoring alignment positions. Any integer value 10
   -wsiz This option specifies the window size. When a signature is
   aligned to a protein sequence, the permissible gaps between two
   signature positions is determined by the empirical gaps and the window
   size. The user is prompted for a window size that is used for every
   position in the signature. Likely this is not optimal. A future
   implementation will provide a range of methods for generating values
   of window size depending upon the alignment (window size is identified
   by the WSIZ record in the signature output file). Any integer value 0
   -filtercon This option specifies whether to disregard positions
   forming few contacts only during the selection of signature positions.
   Toggle value Yes/No No
   -conthresh This option specifies the threshold contact number. This
   controls the selection of key positions for the structure-based
   scoring scheme (number of contacts). Any integer value 10
   -[no]filterpsim This option specifies whether to disregard alignment
   sites that were not aligned satisfactorily (STAMP alignments only).
   Boolean value Yes/No Yes
   [-sigoutdir]
   (Parameter 2) This option specifies the location of signature files
   (output). A 'signature file' contains a sparse sequence signature
   suitable for use with the SIGSCAN and SIGGEN programs. The files are
   generated by using SIGGEN & SIGGENLIG. Output directory ./
   Additional (Optional) qualifiers Allowed values Default
   (none)
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of SIGGEN is shown below. Here is a
   sample session with siggen


% siggen 
Generates a sparse protein signature from an alignment.
Location of DAF files (domain alignment files) (input) [./]: ../domainalign-kee
p/daf
Specify mode of signature generation
         1 : Use positions specified in alignment file
         2 : Use a scoring method
         3 : Generate a randomised signature
Select number [1]: 2
Residue contacts scoring method
         1 : Number
         2 : Conservation
         3 : Number and conservation
         4 : None (structural data available)
         5 : None (no structural data available)
Select number [5]: 5
Sequence variability scoring method
         1 : Substitution matrix
         2 : Residue class
         3 : None
Select number [3]: 1
Substitution matrix to be used [EBLOSUM62]: EBLOSUM62
The % sparsity of signature [10]: 15
Window size [0]: 0
Ignore alignment postitions with post_similar value of 0 [Y]: Y
Location of signature files (output) [./]: 

   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   Handling of missing residues in domain alignment files
   The alignment in the DAF file (domain alignment file) may be generated
   by using STAMP via DOMAINALIGN. STAMP will omit from an alignment any
   residues that either completely lacks electron density and so does not
   appear in the ATOM records of the PDB file, or which lacks a CA atom.
   Such residues will of course not be present in the DAF file. This
   means that acurate gap distances (distance, in residues, between any
   two residues) for residues from two different alignment positions
   cannot reliably be found by simply counting residues.
   To overcome this problem, data from the domain CCF files (clean
   coordinate files) are used. These data should be used where available,
   i.e. the conoption acd option should be set to a value 1, 2, 3 or 4 if
   possible.
   The function embPdbAtomIndexICA is used to create an array which gives
   the index into the full-length protein sequence for structured
   residues, i.e. residues for which electron density was determined,
   EXCLUDING those residues for which CA atoms are missing. The array
   length is of course equal to the number of structured residues. This
   array is used for calculating the correct gap distances between
   residues in the alignment. The domain CCF files MUST be derived from
   protein CCF files in which residues with a single atom only are
   ommitted. Such files can be generated by using PDBPARSE with the
   atommask option set to True. This requirement will not be necessary
   when a new version of embPdbAtomIndexICA which also excludes residues
   with a single atom only becomes available.
   Manually generated signatures
   In the case a signature file is generated by hand, it is essential
   that the gap data given is listed in order of increasing gap size.
   Window size
   The user is prompted for a window size that is used for every position
   in the signature. Likely this is not optimal. A future implementation
   will provide a range of methods for generating values of window size
   depending upon the alignment (window size is identified by the WSIZ
   record in the signature output file).

8.0 NOTES

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   Clean coordinate file (for domain) CCF format (EMBL-like). Protein
   coordinate and derived data for a single domain from SCOP or CATH. The
   data are 'cleaned-up': self-consistent and error-corrected. DOMAINER
   Records for residue solvent accessibility and secondary structure are
   added to the file by using PDBPLUS.
   Contact file (intra-chain residue-residue contacts) CON format
   (EMBL-like.) Intra-chain residue-residue contact data for a protein or
   a domain from SCOP or CATH. CONTACTS N.A.
   Domain alignment file DAF format (CLUSTAL-like). Sequence alignment of
   domains belonging to the same SCOP or CATH family (or other node in
   the structural hierarchies). The file is annotated with domain family
   classification information. DOMAINALIGN (structure-based sequence
   alignment of domains of known structure). DOMAINALIGN alignments can
   be extended with sequence relatives (of unknown structure) to the
   family in question by using SEQALIGN.
   Signature file SIG format Contains a sparse sequence signature
   suitable for use with the SIGSCAN program. Contains a sparse sequence
   signature. SIGGENLIG, LIBGEN The files are generated by using SIGGEN.

   None

9.0 DESCRIPTION

   Protein signatures are useful for characterising protein families and
   have been generated manually in the past (Ison et al, 2000). SIGGEN
   provides various methods to generate automatically protein signatures.
   There are 3 modes for signature generatation:
   (1) Use positions specified in alignment file. The alignment file must
   contain a line beginning with the text 'Positions' for each line of
   the alignment. A '1' in the 'Positions' line indicates that the
   signature should include data from the corresponding alignment site.
   The signature will only include the positions that are marked with a
   '1'.
   (2) Use a scoring method. The alignment is scored (see 'Algorithm')
   and the signature of a specified sparsity is sampled from high scoring
   positions.
   (3): Generate a randomised signature. A signature of a specified
   sparsity is sampled at random from the alignment.

10.0 ALGORITHM

   Algorithm
   signature generation proceeds in three stages as follows: (i) Read
   data and write residue-residue contact maps. (ii) Apply selected
   scoring methods to potential signature positions. (iii) Select
   residues to form the signature and write residue identity and residue
   gap data into signature output file.
   Data Parsing
   SIGGEN reads DAF files (domain alignment files) and, optionally,
   domain CCF files ( clean coordinate files) and CON files (contact
   files) corresponding to domains in the alignments. If specified, a
   contact map for each domain in an input alignment is required. A
   contact map is an N by N matrix (where N is the length of the
   sequence), a '1' at any element of the matrix indicates contact
   between the two residues at the corresponding positions, a '0'
   indicates no contact (see CONTACTS for more information). The data
   from the DAF files are parsed, including the Post_Similar line (if
   available, e.g. for DAF files generated by using STAMP via DOMAINALIGN
   ). The use of the data from the Post_Similar line are fundamental: the
   user specifies whether only alignment positions with a post_similar
   value of '1' are considered to be potential signature positions or
   whether all positions are potential candidates. If the Post_Similar
   line is not available then all positions are potential candidates.
   Alignment positions where the Post_Similar value is represented by a
   '-' are not considered because one or more of the proteins in the
   alignment were assigned a gap by the STAMP program that was used to
   generate the alignment.
   Residue Scoring Schemes
   The algorithm provides four scoring schemes that can be applied to
   aligned positions (i.e. positions with Post_Similar values that is not
   '-' or, optionally, '0' either), to enable key residues to be selected
   for the final signature. The schemes are split into two groups:
   sequence based and structure based. Each position in the alignment is
   scored on the basis of a single or combination of 2 scoring schemes,
   one each from the different groups, thus providing a method of
   refining/improving the generation of signatures. Every aligned
   position is allocated a normalised score based on one or more of the
   following schemes.
   Sequence Based Scoring - Residue Identity (ResId)
   This scoring function simply takes every residue at a particular
   aligned position and calculates a score for the substitution of each
   residue pair using a residue substitution matrix. The average residue
   substitution score for the position is then normalised and the score
   assigned to the score array for that alignment position.
   eSequence Based Scoring - Residue Variability (ResVar)
   This scoring scheme implements the residue variability function of
   (Mirny & Shakhnovich, 2001).
   s(l) = - sum for i=1 to i=6 ( pi(l) x log pi(l) )
   Where s(l) is the variability at position l, and pi(l) is the
   frequency of residues from class i at position l. Six classes of
   residue are defined which reflect their physical-chemical properties
   and natural pattern of substitution as follows: (i) Aliphatic (A, V,
   L, I, M, C); (ii) Aromatic (F, W, Y, H); (iii) Polar (S, T, N, Q);
   (iv) Basic (K, R); (v) Acidic (D, E); (vi) Special (G, P). The special
   class represents the special conformational properties of glycine and
   proline. As a result of this classification mutations within a class
   are ignored e.g. L to V, whereas mutations that change the residue
   class are taken into account. Thus each aligned position is given a
   normalised score that reflects the variability of all the residues in
   that particular position.
   Structure Based Scoring - Number of Residue-Residue Contacts (N-Con)
   The contact scoring scheme provides a score based purely on structural
   information, i.e. the identity and nature of the residues are not
   considered. The structural information used is the number of
   residue-residue contacts and the contact maps generated in the first
   phase of the algorithm are used to derive the number of contacts made
   by residues at aligned positions. Each residue from an aligned
   position is noted, and the position that residue occupies in its
   original protein sequence is determined. The column of the contact map
   that corresponds to the position of the residue in its original
   sequence is identified, the occurrence of a '1' anywhere in that
   column of the matrix is recorded, and the total number of '1's
   indicates the total number of contacts that residue makes. The number
   of contacts for each residue at a particular aligned position are
   determined, the average number of contacts is calculated and the
   resulting value normalised. This procedure is then repeated for every
   aligned position.
   Structure Based Scoring - Conservation of Residue Contacts (C-Con)
   This scoring scheme extends the concept of the number of contacts
   residues at aligned positions make, by also determining which residues
   are contacted and their position in the alignment, thus providing a
   score representing how conserved the contacts made by residues at an
   aligned position are. The initial stage of the process is identical to
   that for determining the number of contacts, except every time a
   contact is found in the contact map, the position of the contacted
   residue is recorded and its position in the alignment determined. Each
   residue in an aligned position therefore has associated with it a list
   of positions in the alignment with which it makes contact. For example
   if all the residues at position 25 of the alignment make contact with
   the residues at position 79 of the alignment, a conserved contact is
   defined and a maximum score is allocated to the residues at position
   25. This procedure is repeated for all the contacts made by the
   residues at position 25 and an average normalised conservation of
   contact score calculated.
   Selection of Signature Positions
   The final phase of the algorithm involves selecting the residues that
   will make up the signature. Following the scoring phase SIGGEN will
   have created an array of scores for each scoring scheme employed, i.e.
   a score will have been allocated for every position in the alignment
   with a Post_Similar value of '1' and optionally '0' also (depending on
   the Post_Similar option selected, see below). If more than one scoring
   scheme was used then the scores for each alignment position from the
   different scoring methods are added together, to give a final array
   (total score array) of the total scores for each position. It is these
   final scores that determine which positions will make up the
   signature.
   Signature Sparsity
   The signature sparsity is a user-defined parameter that determines how
   many residues the final signature will contain, for example, if the
   average sequence length of the proteins in the alignment is 250
   residues, then a signature of sparsity 10% (default value) will
   contain 25 key residues or signature positions, that correspond to the
   top 25% highest scoring alignment positions.
   Key Residue Selection
   Assuming that a signature of 10% sparsity is desired and the average
   sequence length of the proteins is 250 residues, the total score array
   is re-arranged into ascending order of score. The top (highest
   scoring) 25 alignment positions (equal to 10% sparsity) are then
   selected, it is these 25 positions which will make up the final
   signature. These 25 highest scoring alignment positions are then
   traced back to the original protein sequences, the residue identities
   determined and gap data (number of residues between signature
   positions) calculated. The signature output file is then written, this
   specifies for each of the 25 signature positions the residues that are
   observed at that position in the alignment, and the gap (in residues)
   between that position and the next. In the case of the first signature
   position the gap data corresponds to the number of residues between
   the beginning of the sequence and the first position.
   Signature Generating Parameters
   The SIGGEN algorithm incorporates several options that can be selected
   when generating a signature. The first is the signature sparsity,
   which has been introduced above and affects the amount of information
   encoded in the signature. In addition to the four scoring schemes
   described above, there are two further option to be considered when
   generating a signature.
   Post_Similar Option
   This option determines which alignment positions should be considered
   as putative signature positions. As mentioned above, the Post_Similar
   line represents aligned positions by either a '1' a '0' or a '-'.
   SIGGEN gives the option of considering both positions with values of
   '1' and '0' or ignoring positions represented by '0', which STAMP
   considers to be less structurally equivalent, and therefore use just
   positions with a Post_Similar value of '1'.
   Contact Filtering Option
   This option also determines which aligned positions should be
   considered as putative key residues for inclusion in the signature.
   However, the criterion in this case is whether or not the average
   number of contacts that the residues at that position make is above a
   defined threshold (the contact threshold). The default value is 10
   contacts, i.e. only aligned positions that make on average 10 or more
   residue-residue contacts will be considered as potential key residues.
   As with all the SIGGEN parameters, they can be used in combination.
   For example, selecting the following parameters: contact threshold =
   10; residue identity and conservation of contact scoring schemes;
   Post_Similar option set to ignore positions with values of '0';
   signature sparsity set to 15%, the SIGGEN algorithm would proceed in
   the following manner: (i) Determine positions with Post_Similar value
   of '1'; (ii) Determine which of those positions make greater than 10
   residue contacts; (iii) Apply the residue id and conservation of
   contact scoring schemes to the positions resulting from the previous
   two filtering steps; (iv) Select the top scoring 15% positions to make
   up the signature. (v) Write signature file.

11.0 RELATED APPLICATIONS

See also

   Program name                       Description
   contactcount Count specific versus non-specific contacts
   contacts     Generate intra-chain CON files from CCF files
   domainalign  Generate alignments (DAF file) for nodes in a DCF file
   domainrep    Reorder DCF file to identify representative structures
   domainreso   Remove low resolution domains from a DCF file
   interface    Generate inter-chain CON files from CCF files
   libgen       Generate discriminating elements from alignments
   matgen3d     Generate a 3D-1D scoring matrix from CCF files
   psiphi       Phi and psi torsion angles from protein coordinates
   rocon        Generates a hits file from comparing two DHF files
   rocplot      Performs ROC analysis on hits files
   scorecmapdir Contact scores for cleaned protein chain contact files
   seqalign     Extend alignments (DAF file) with sequences (DHF file)
   seqfraggle   Removes fragment sequences from DHF files
   seqsearch    Generate PSI-BLAST hits (DHF file) from a DAF file
   seqsort      Remove ambiguous classified sequences from DHF files
   seqwords     Generates DHF files from keyword search of UniProt
   siggenlig    Generate ligand-binding signatures from a CON file
   sigscan      Generate hits (DHF file) from a signature search
   sigscanlig   Search ligand-signature library & write hits (LHF file)

12.0 DIAGNOSTIC ERROR MESSAGES

   None.

13.0 AUTHORS

   Matt Blades (mblades@rfcgr.mrc.ac.uk)
   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/
   Automatic generation and evaluation of sparse protein signatures for
   families of protein structural domains. MJ Blades, JC Ison, R
   Ranasinghe, and JBC Findlay. Protein Science. 2005 (accepted)
   A key residues approach to the definition of protein families and
   analysis of sparse family signatures. JC Ison, AJ Bleasby, MJ Blades,
   SC Daniel, JH Parish, JBC Findlay. PROTEINS: Structure, Function &
   Genetics. 2000, 40:330-341
   Alignment of a sparse protein signature with protein sequences:
   application to fold prediction for three small globulins. SC Daniel,
   JH Parish, JC Ison, MJ Blades & JBC Findlay. FEBS Letters. 1999,
   459:349-352.

  14.1 Other useful references

   LA Mirny EI Shakhnovich. Evolutionary conservation of the folding
   nucleus. Journal of Molecular Biology (2001) 308:123-129
