
                          DOMAINSEQS documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Adds sequence records to a DCF file

2.0 INPUTS & OUTPUTS

   DOMAINSEQS parses a DCF file (domain classification file) and writes a
   file containing the same data, except that domain sequence information
   derived from structural and, optionally, sequence databases are added.
   Domain sequences are taken from domain CCF files (clean coordinate
   files) and (optionally) the swissprot database. If the swissprot
   sequence is used, DOMAINSEQS requires a swissprot:PDB equivalence file
   that gives the accession number of each PDB file corresponding to the
   domains in the DCF file. The path for the CCF files (input) and names
   of DCF files (input and output) are specified by the user. A log file
   is also written.

3.0 INPUT FILE FORMAT

   The format of the DCF file is described in the SCOPPARSE
   documentation.

  Input files for usage example

  File: ../scopparse-keep/all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

  File: ../pdbtosp-keep/Epdbtosp.dat

EN   101M
XX
NE   1
XX
IN   MYG_PHYCA ID; P02185 ACC;
XX
//
EN   102L
XX
NE   1
XX
IN   LYCV_BPT4 ID; P00720 ACC;
XX
//
EN   102M
XX
NE   1
XX
IN   MYG_PHYCA ID; P02185 ACC;
XX
//
EN   103L
XX
NE   1
XX
IN   LYCV_BPT4 ID; P00720 ACC;
XX
//
EN   103M
XX
NE   1
XX
IN   MYG_PHYCA ID; P02185 ACC;
XX
//
EN   9XIA
XX
NE   1
XX
IN   XYLA_STRRU ID; P24300 ACC;
XX
//
EN   9XIM
XX
NE   1
XX
IN   XYLA_ACTMI ID; P12851 ACC;
XX
//

4.0 OUTPUT FILE FORMAT

   The format of the DCF file is described in the SCOPPARSE
   documentation.
   DOMAINSEQS may add the following records:
     * (1) AC - Accession number of the domain sequence. This record will
       only be present if the DCF file has been processed using
       DOMAINSEQS and if an accession number for the PDB file
       corresponding to the domain is given in the swissprot:PDB
       equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use
       of.
     * (2) SP - Swissprot code of the domain sequence. This record will
       only be present if the domain classification file has been
       processed using DOMAINSEQS and if an swissprot code for the PDB
       file corresponding to the domain is given in the swissprot:PDB
       equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use
       of.
     * (3) RA - Position of domain in swissprot sequence. The integers
       preceeding START and END give the start and end points
       respectively of the domain sequence relative to the full-length
       swissprot sequence.
     * (4) SQ - Sequence of the domain according to swissprot. This
       sequence is taken from the swissprot database. The SQ record will
       only be present if the SCOP classification file has been processed
       using DOMAINSEQS and if an accession number for the PDB file
       corresponding to the domain is given in the swissprot:PDB
       equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use
       of.

  Output files for usage example

  File: all_s.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
DS   SEQUENCE    52 AA;   5817 MW;  47362A43 CRC32;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
DS   SEQUENCE    65 AA;   7396 MW;  0CFB92A3 CRC32;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

  File: domainseqs.log

//
D1CS4A_
NO_ACCESSION_NUMBER
//
D1II7A_
NO_ACCESSION_NUMBER

5.0 DATA FILES

   If the user specified retrieval of sequences from a sequence database,
   DOMAINSEQS uses a swissprot:PDB equivalence file is generated by using
   PDBTOSP.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-dcfinfile]         infile     This option specifies the name of DCF file
                                  (domain classification file) (input). A
                                  'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
  [-dpdbdir]           directory  This option specifies the location of domain
                                  CCF file (clean coordinate files) (input).
                                  A 'clean cordinate file' contains coordinate
                                  and other data for a single PDB file or a
                                  single domain from SCOP or CATH, in CCF
                                  format (EMBL-like). The files, generated by
                                  using PDBPARSE (PDB files) or DOMAINER
                                  (domains), contain 'cleaned-up' data that is
                                  self-consistent and error-corrected.
                                  Records for residue solvent accessibility
                                  and secondary structure are added to the
                                  file by using PDBPLUS.
   -getswiss           toggle     Retrieve swissprot sequence.
*  -pdbtospfile        infile     This option specifies the name of the
                                  pdbcodes to swissprot indexing file. The
                                  swissprot:PDB equivalence file is generated
                                  by PDBTOSP
  [-dcfoutfile]        outfile    This option specifies the name of DCF file
                                  (domain classification file) (output). A
                                  'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -logfile            outfile    This option specifies the name of log file
                                  for the build. The log file contains
                                  messages about any errors arising while
                                  domainseqs ran.

   Additional (Optional) qualifiers (* if not always prompted):
*  -datafile           matrixf    This option specifies the residue
                                  substitution matrix, which is used for
                                  sequence comparison.
*  -gapopen            float      This option specifies the gap insertion
                                  penalty. This is the score taken away when a
                                  gap is created. The best value depends on
                                  the choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
*  -gapextend          float      This option specifies the gap extension
                                  penalty. This is added to the standard gap
                                  penalty for each base or residue in the gap.
                                  This is how long gaps are penalized.
                                  Usually you will expect a few long gaps
                                  rather than many short gaps, so the gap
                                  extension penalty should be lower than the
                                  gap penalty.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-dcfoutfile" associated qualifiers
   -odirectory3        string     Output directory

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-dcfinfile]
   (Parameter 1) This option specifies the name of DCF file (domain
   classification file) (input). A 'domain classification file' contains
   classification and other data for domains from SCOP or CATH, in DCF
   format (EMBL-like). The files are generated by using SCOPPARSE and
   CATHPARSE. Domain sequence information can be added to the file by
   using DOMAINSEQS. Input file Required
   [-dpdbdir]
   (Parameter 2) This option specifies the location of domain CCF file
   (clean coordinate files) (input). A 'clean cordinate file' contains
   coordinate and other data for a single PDB file or a single domain
   from SCOP or CATH, in CCF format (EMBL-like). The files, generated by
   using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up'
   data that is self-consistent and error-corrected. Records for residue
   solvent accessibility and secondary structure are added to the file by
   using PDBPLUS. Directory ./
   -getswiss Retrieve swissprot sequence. Toggle value Yes/No No
   -pdbtospfile This option specifies the name of the pdbcodes to
   swissprot indexing file. The swissprot:PDB equivalence file is
   generated by PDBTOSP Input file Required
   [-dcfoutfile]
   (Parameter 3) This option specifies the name of DCF file (domain
   classification file) (output). A 'domain classification file' contains
   classification and other data for domains from SCOP or CATH, in DCF
   format (EMBL-like). The files are generated by using SCOPPARSE and
   CATHPARSE. Domain sequence information can be added to the file by
   using DOMAINSEQS. Output file domainseqs.out
   -logfile This option specifies the name of log file for the build. The
   log file contains messages about any errors arising while domainseqs
   ran. Output file domainseqs.log
   Additional (Optional) qualifiers Allowed values Default
   -datafile This option specifies the residue substitution matrix, which
   is used for sequence comparison. Comparison matrix file in EMBOSS data
   path EBLOSUM62
   -gapopen This option specifies the gap insertion penalty. This is the
   score taken away when a gap is created. The best value depends on the
   choice of comparison matrix. The default value assumes you are using
   the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix
   for nucleotide sequences. Floating point number from 1.0 to 100.0 10.0
   for any sequence
   -gapextend This option specifies the gap extension penalty. This is
   added to the standard gap penalty for each base or residue in the gap.
   This is how long gaps are penalized. Usually you will expect a few
   long gaps rather than many short gaps, so the gap extension penalty
   should be lower than the gap penalty. Floating point number from 0.0
   to 10.0 0.5 for any sequence
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of DOMAINSEQS is shown below. Here is a
   sample session with domainseqs


% domainseqs 
Adds sequence records to a DCF file.
Name of DCF file (domain classification file) (input).: ../scopparse-keep/all.s
cop
Location of domain CCF file (clean coordinate files) (input). [./]: ../domainer
-keep
Retrieve swissprot sequence. [N]: Y
Name of the pdbcodes to swissprot indexing file.: ../pdbtosp-keep/Epdbtosp.dat
Name of DCF file (domain classification file) (output). [domainseqs.out]: all_s
.scop
Name of log file for the build. [domainseqs.log]: 

//
D1CS4A_
//
D1II7A_

   Go to the input files for this example
   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   The swissprot:PDB equivalence file available as part of the EMBOSS
   distribution does not contain accession numbers for all PDB files and
   covers only a relatively small proportion of domains in SCOP.

8.0 NOTES

   DOMAINSEQS will not attempt to retrieve a swissprot sequence for any
   domains comprised of more than a single segment, i.e. for domains
   whose NC record in the DCF file has a value other than 1 (see
   SCOPPARSE documentation).
   The region of sequence to retrieve is identified by alignment of the
   sequence from the CCF file (structural database) to the full length
   sequences in swissprot. If this were to be done for segmented domains,
   the start and end point for the retrieved sequence (relative to the
   full length sequence) might actually include a completely different
   domain.
   Consider :
   (A) represents a sequence from the coordinate file for a segmented
   domain D1
   (B) represents the full-length swissprot sequence, it includes D1, but
   D1 is split by a second domain D2

                 D1
  (A)  XXXXXXXXXXXXXXXXXXXXX

          D1         D2        D1
  (B) XXXXXXXXXX-----------XXXXXXXXXXX

   It should be clear that for an alignment retrieving B, the start and
   end points for the retrieved sequence would be misleading.
   The user should be aware that sequences from the domain CCF file for
   domains comprised of more than a single segment are not biologically
   significant, as the sequences are derived from different segments of
   one or more chains. However, such sequences might be acceptable for
   redundancy calculations (e.g. by using DOMAINNR) because two redundant
   domains made of similar fragments will have similar sequences, so the
   redundancy should be detectable.

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   Domain classification file (for SCOP) DCF format (EMBL-like format for
   domain classification data). Classification and other data for domains
   from SCOP. SCOPPARSE Domain sequence information can be added to the
   file by using DOMAINSEQS.
   Domain classification file (for CATH) DCF format (EMBL-like format for
   domain classification data). Classification and other data for domains
   from CATH. CATHPARSE Domain sequence information can be added to the
   file by using DOMAINSEQS.
   Clean coordinate file (for protein) CCF format (EMBL-like format for
   protein coordinate and derived data). Coordinate and other data for a
   single PDB file. The data are 'cleaned-up': self-consistent and
   error-corrected. PDBPARSE Records for residue solvent accessibility
   and secondary structure are added to the file by using PDBPLUS.
   Clean coordinate file (for domain) CCF format (EMBL-like format for
   protein coordinate and derived data). Coordinate and other data for a
   single domain from SCOP or CATH. The data are 'cleaned-up':
   self-consistent and error-corrected. DOMAINER Records for residue
   solvent accessibility and secondary structure are added to the file by
   using PDBPLUS.
   swissprot:PDB equivalence file EMBL-like format. A file containing
   swissprot identifiers for PDB codes. Included in the EMBOSS
   distribution N.A.

   None

9.0 DESCRIPTION

   Domain sequences are not given in the raw SCOP or CATH parsable files,
   but are required for many analyses and for convenience should,
   ideally, be provided along with the classification itself. DOMAINSEQS
   reads a DCF file (domain classificaiton file) that lacks sequence
   information and writes one containing sequence information.

10.0 ALGORITHM

   In order to find the start and end of a domain in a sequence from
   swissprot, the domain sequence from the domain CCF file is aligned to
   the full length protein sequence from swissprot. Alignment is
   performed first by string handling and if that fails, by using the
   EMBOSS implementation of the Needleman and Wunsch global alignment
   algorithm. Gap insertion and extension penalties used in the
   alignments are user-specified.

11.0 RELATED APPLICATIONS

See also

    Program name                        Description
   aaindexextract Extract data from AAINDEX
   allversusall   Sequence similarity data from all-versus-all comparison
   cathparse      Generates DCF file from raw CATH files
   cutgextract    Extract data from CUTG
   domainer       Generates domain CCF files from protein CCF files
   domainnr       Removes redundant domains from a DCF file
   domainsse      Add secondary structure records to a DCF file
   hetparse       Converts heterogen group dictionary to EMBL-like format
   pdbparse       Parses PDB files and writes protein CCF files
   pdbplus        Add accessibility & secondary structure to a CCF file
   pdbtosp        Convert swissprot:PDB codes file to EMBL-like format
   printsextract  Extract data from PRINTS
   prosextract    Build the PROSITE motif database for use by patmatmotifs
   rebaseextract  Extract data from REBASE
   scopparse      Generate DCF file from raw SCOP files
   seqnr          Removes redundancy from DHF files
   sites          Generate residue-ligand CON files from CCF files
   ssematch       Search a DCF file for secondary structure matches
   tfextract      Extract data from TRANSFAC

12.0 DIAGNOSTIC ERROR MESSAGES

   None.

13.0 AUTHORS

   Ranjeeva Ranasinghe (rranasin@hgmp.mrc.ac.uk)
   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research, Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references
