
                           DOMAINNR documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

  Input files for usage example

  File: ../domainseqs-keep/all_s.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
DS   SEQUENCE    52 AA;   5817 MW;  47362A43 CRC32;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
DS   SEQUENCE    65 AA;   7396 MW;  0CFB92A3 CRC32;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

2.0 INPUTS & OUTPUTS

   DOMAINNR reads a DCF file (domain classification file) containing
   domain sequence information and writes a DCF file in which the
   redundant domains are removed from each node (e.g. family, superfamily
   etc). Optionally, the redundant domains are written to a second DCF
   output file. The node of operation and input and output files are
   specified by the user. A log file is also written.

3.0 INPUT FILE FORMAT

   The format of the DCF file is described in the SCOPPARSE
   documentation.

  Input files for usage example

  File: ../domainseqs-keep/all_s.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
DS   SEQUENCE    52 AA;   5817 MW;  47362A43 CRC32;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
DS   SEQUENCE    65 AA;   7396 MW;  0CFB92A3 CRC32;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

4.0 OUTPUT FILE FORMAT

   The format of the DCF file is described in the SCOPPARSE
   documentation.

  Output files for usage example

  File: all_nr.scop

ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
DS   SEQUENCE    65 AA;   7396 MW;  0CFB92A3 CRC32;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

  File: domainnr.log

Classes are non-redundant
5% redundancy threshold
// Alpha and beta proteins (a+b)
Retained
D1II7A_
Rejected
D1CS4A_

5.0 DATA FILES

   DOMAINNR requires a residue substitution matrix.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-dcfinfile]         infile     This option specifies name of DCF file
                                  (domain classification file) (input). A
                                  'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -retain             toggle     This option specifies whether to write
                                  redundant domains to a separate file. If
                                  this option is selected, redundant domains
                                  are written to a separate output file.
   -node               menu       This option specifies the node for
                                  redundancy removal. Redundancy can be
                                  removed at any specified node in the SCOP or
                                  CATH hierarchies. For example by selecting
                                  'Class' entries belonging to the same Class
                                  will be non-redundant.
   -mode               menu       This option specifies whether to remove
                                  redundancy at a single threshold % sequence
                                  similarity or remove redundancy outside a
                                  range of acceptable threshold % similarity.
                                  All permutations of pair-wise sequence
                                  alignments are calculated for each domain
                                  family in turn using the EMBOSS
                                  implementation of the Needleman and Wunsch
                                  global alignment algorithm. Redundant
                                  sequences are removed in one of two modes as
                                  follows: (i) If a pair of proteins achieve
                                  greater than a threshold percentage sequence
                                  similarity (specified by the user) the
                                  shortest sequence is discarded. (ii) If a
                                  pair of proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range (specified by the user) the shortest
                                  sequence is discarded.
*  -threshold          float      This option specifies the % sequence
                                  identity redundancy threshold, which
                                  determines the redundancy calculation. If a
                                  pair of proteins achieve greater than this
                                  threshold the shortest sequence is
                                  discarded.
*  -threshlow          float      This option specifies the % sequence
                                  identity redundancy threshold, which
                                  determines the redundancy calculation. If a
                                  pair of proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
*  -threshup           float      This option specifies the % sequence
                                  identity redundancy threshold, which
                                  determines the redundancy calculation. If a
                                  pair of proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
  [-dcfoutfile]        outfile    This option specifies the name of
                                  non-redundant DCF file (domain
                                  classification file) (output). A 'domain
                                  classification file' contains classification
                                  and other data for domains from SCOP or
                                  CATH, in DCF format (EMBL-like). The files
                                  are generated by using SCOPPARSE and
                                  CATHPARSE. Domain sequence information can
                                  be added to the file by using DOMAINSEQS.
*  -redoutfile         outfile    This option specifies the name of DCF file
                                  (domain classification file) for redundant
                                  sequences (output). A 'domain classification
                                  file' contains classification and other
                                  data for domains from SCOP or CATH, in DCF
                                  format (EMBL-like). The files are generated
                                  by using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -logfile            outfile    This option specifies the name of log file
                                  for the build. The log file contains
                                  messages about any errors arising while
                                  domainnr ran.

   Additional (Optional) qualifiers:
   -datafile           matrixf    This option specifies the residue
                                  substitution matrix. This is used for
                                  sequence comparison.
   -gapopen            float      This option specifies the gap insertion
                                  penalty. This is the score taken away when a
                                  gap is created. The best value depends on
                                  the choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
   -gapextend          float      This option specifies the gap extension
                                  penalty. This is added to the standard gap
                                  penalty for each base or residue in the gap.
                                  This is how long gaps are penalized.
                                  Usually you will expect a few long gaps
                                  rather than many short gaps, so the gap
                                  extension penalty should be lower than the
                                  gap penalty.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-dcfoutfile" associated qualifiers
   -odirectory2        string     Output directory

   "-redoutfile" associated qualifiers
   -odirectory         string     Output directory

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-dcfinfile]
   (Parameter 1) This option specifies name of DCF file (domain
   classification file) (input). A 'domain classification file' contains
   classification and other data for domains from SCOP or CATH, in DCF
   format (EMBL-like). The files are generated by using SCOPPARSE and
   CATHPARSE. Domain sequence information can be added to the file by
   using DOMAINSEQS. Input file Required
   -retain This option specifies whether to write redundant domains to a
   separate file. If this option is selected, redundant domains are
   written to a separate output file. Toggle value Yes/No No
   -node This option specifies the node for redundancy removal.
   Redundancy can be removed at any specified node in the SCOP or CATH
   hierarchies. For example by selecting 'Class' entries belonging to the
   same Class will be non-redundant.
   1 (Class (SCOP))
   2 (Fold (SCOP))
   3 (Superfamily (SCOP))
   4 (Family (SCOP))
   5 (Class (CATH))
   6 (Architecture (CATH))
   7 (Topology (CATH))
   8 (Homologous Superfamily (CATH))
   9 (Family (CATH))
   1
   -mode This option specifies whether to remove redundancy at a single
   threshold % sequence similarity or remove redundancy outside a range
   of acceptable threshold % similarity. All permutations of pair-wise
   sequence alignments are calculated for each domain family in turn
   using the EMBOSS implementation of the Needleman and Wunsch global
   alignment algorithm. Redundant sequences are removed in one of two
   modes as follows: (i) If a pair of proteins achieve greater than a
   threshold percentage sequence similarity (specified by the user) the
   shortest sequence is discarded. (ii) If a pair of proteins have a
   percentage sequence similarity that lies outside an acceptable range
   (specified by the user) the shortest sequence is discarded.
   1 (Remove redundancy at a single threshold % sequence similarity)
   2 (Remove redundancy outside a range of acceptable threshold %
   similarity)
   1
   -threshold This option specifies the % sequence identity redundancy
   threshold, which determines the redundancy calculation. If a pair of
   proteins achieve greater than this threshold the shortest sequence is
   discarded. Any numeric value 95.0
   -threshlow This option specifies the % sequence identity redundancy
   threshold, which determines the redundancy calculation. If a pair of
   proteins have a percentage sequence similarity that lies outside an
   acceptable range the shortest sequence is discarded. Any numeric value
   30.0
   -threshup This option specifies the % sequence identity redundancy
   threshold, which determines the redundancy calculation. If a pair of
   proteins have a percentage sequence similarity that lies outside an
   acceptable range the shortest sequence is discarded. Any numeric value
   90.0
   [-dcfoutfile]
   (Parameter 2) This option specifies the name of non-redundant DCF file
   (domain classification file) (output). A 'domain classification file'
   contains classification and other data for domains from SCOP or CATH,
   in DCF format (EMBL-like). The files are generated by using SCOPPARSE
   and CATHPARSE. Domain sequence information can be added to the file by
   using DOMAINSEQS. Output file test.scop
   -redoutfile This option specifies the name of DCF file (domain
   classification file) for redundant sequences (output). A 'domain
   classification file' contains classification and other data for
   domains from SCOP or CATH, in DCF format (EMBL-like). The files are
   generated by using SCOPPARSE and CATHPARSE. Domain sequence
   information can be added to the file by using DOMAINSEQS. Output file
   <sequence>.domainnr
   -logfile This option specifies the name of log file for the build. The
   log file contains messages about any errors arising while domainnr
   ran. Output file domainnr.log
   Additional (Optional) qualifiers Allowed values Default
   -datafile This option specifies the residue substitution matrix. This
   is used for sequence comparison. Comparison matrix file in EMBOSS data
   path EBLOSUM62
   -gapopen This option specifies the gap insertion penalty. This is the
   score taken away when a gap is created. The best value depends on the
   choice of comparison matrix. The default value assumes you are using
   the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix
   for nucleotide sequences. Floating point number from 1.0 to 100.0 10.0
   for any sequence
   -gapextend This option specifies the gap extension penalty. This is
   added to the standard gap penalty for each base or residue in the gap.
   This is how long gaps are penalized. Usually you will expect a few
   long gaps rather than many short gaps, so the gap extension penalty
   should be lower than the gap penalty. Floating point number from 0.0
   to 10.0 0.5 for any sequence
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of DOMAINNR is shown below. Here is a
   sample session with domainnr


% domainnr 
Removes redundant domains from a DCF file.
Name of DCF file (domain classification file) (input).: ../domainseqs-keep/all_
s.scop
Write redundant domains to separate file. [N]: 
Node at which to remove redundancy
         1 : Class (SCOP)
         2 : Fold (SCOP)
         3 : Superfamily (SCOP)
         4 : Family (SCOP)
         5 : Class (CATH)
         6 : Architecture (CATH)
         7 : Topology (CATH)
         8 : Homologous Superfamily (CATH)
         9 : Family (CATH)
Select number. [1]: 1
Redundancy removal options
         1 : Remove redundancy at a single threshold % sequence similarity
         2 : Remove redundancy outside a range of acceptable threshold % simila
rity
Select number. [1]: 1
The % sequence identity redundancy threshold. [95.0]: 5
Name of non-redundant DCF file (domain classification file) (output) [test.scop
]: all_nr.scop
Name of log file for the build. [domainnr.log]: 
Warning: Bad args passed to ajDomainWrite

// Alpha and beta proteins (a+b)
D1CS4A_
D1II7A_

   Go to the input files for this example
   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   None.

8.0 NOTES

   If for example the user selected the node to be "Family (SCOP)" then
   DOMAINNR removes redundancy at the level of the SCOP family, i.e.
   entries belonging to the same family will be non-redundant.

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   Domain classification file (for SCOP) DCF format (EMBL-like format for
   domain classification data). Classification and other data for domains
   from SCOP. SCOPPARSE Domain sequence information can be added to the
   file by using DOMAINSEQS.
   Domain classification file (for CATH) DCF format (EMBL-like format for
   domain classification data). Classification and other data for domains
   from CATH. CATHPARSE Domain sequence information can be added to the
   file by using DOMAINSEQS.

   None

9.0 DESCRIPTION

   The inclusion of very similar sequences in certain analyses will
   introduce undesirable bias. For example, a family may possess 100
   sequences in the sequence database, but 90 of these might be
   essentially the same sequence, e.g. very close relatives or mutations
   of a single sequence. Although 100 sequences are known, the family
   only contains 11 sequences that are essentially unique. For many
   methods it is desirable to use sets of sequences that are truly
   representative of the larger family. DOMAINNR reads a DCF file (domain
   classification file) and writes a DCF file with redundant domains
   removed from each node in the domain classification hierarchy, e.g.
   family, superfamily or class.

10.0 ALGORITHM

   All permutations of pair-wise sequence alignments are calculated for
   each node (family etc) in turn using the EMBOSS implementation of the
   Needleman and Wunsch global alignment algorithm. Redundant sequences
   are removed in one of two modes as follows: (i) If a pair of proteins
   achieve greater than a threshold percentage sequence similarity
   (specified by the user) the shortest sequence is discarded. (ii) If a
   pair of proteins have a percentage sequence similarity that lies
   outside an acceptable range (specified by the user) the shortest
   sequence is discarded. The user must specify gap insertion and
   extension penalties and a residue substitution matrix for use in the
   alignments. % sequence similarity is calculated by using the EMBOSS
   function embAlignCalcSimilarity.

11.0 RELATED APPLICATIONS

See also

    Program name                        Description
   aaindexextract Extract data from AAINDEX
   allversusall   Sequence similarity data from all-versus-all comparison
   cathparse      Generates DCF file from raw CATH files
   cutgextract    Extract data from CUTG
   domainer       Generates domain CCF files from protein CCF files
   domainseqs     Adds sequence records to a DCF file
   domainsse      Add secondary structure records to a DCF file
   hetparse       Converts heterogen group dictionary to EMBL-like format
   pdbparse       Parses PDB files and writes protein CCF files
   pdbplus        Add accessibility & secondary structure to a CCF file
   pdbtosp        Convert swissprot:PDB codes file to EMBL-like format
   printsextract  Extract data from PRINTS
   prosextract    Build the PROSITE motif database for use by patmatmotifs
   rebaseextract  Extract data from REBASE
   scopparse      Generate DCF file from raw SCOP files
   seqnr          Removes redundancy from DHF files
   sites          Generate residue-ligand CON files from CCF files
   ssematch       Search a DCF file for secondary structure matches
   tfextract      Extract data from TRANSFAC

12.0 DIAGNOSTIC ERROR MESSAGES

   DOMAINNR generates a log file an excerpt of which is shown in Figure
   1. The first two lines give the level in the SCOP or CATH hierarchy at
   which redundancy was removed (e.g. 'Families') and the value of the
   redundancy threshold. The file then contains a section for each node,
   e.g. each family. Each section contains a line with the record '//'
   immediately followed by the name of the node (family in this case),
   and two lines containing 'Retained' and 'Rejected' respectively.
   Domain identifier codes of domains that appear in output file are
   listed under 'Retained', while redundant domains are listed under
   'Rejected'. Tthese will be saved in a second output file if the user
   has specified redundant domains to be retained. The text 'ERROR
   filename file read error' will be given when an error was encountered
   during a file read.
   Figure 1 Summary of ROCPLOT output 

Excerpt from DOMAINNR log file
Families are non-redundant
95% redundancy threshold
// Homeodomain
Retained
D2HDDA_
D1AKHA_
D1MNMC_
Rejected
D2HDDB_
D1ENH__
D3HDDA_
WARN  d3hdda_.pxyz not found
// Di-haem cytohrome c peroxidase
WARN  ds005__.pxyz not found
WARN  Empty family
// Nuclear receptor coactivator Src-1
Retained
D2PRGC_
Rejected

13.0 AUTHORS

   Ranjeeva Ranasinghe (rranasin@hgmp.mrc.ac.uk)
   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research, Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references
