
                          CATHPARSE documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Generates DCF file from raw CATH files

2.0 INPUTS & OUTPUTS

   CATHPARSE parses the CATH classification files, e.g. caths.list.v2.4,
   domlist.v2.4 and CAT.names.all.v2.4. These files are available by
   anonymous ftp from ftp.biochem.ucl.ac.uk (e.g. /pub/cathdata/v2.4) The
   format of these files is explained in the README file available there.
   CATHPARSE writes the CATH classification to a DCF file (EMBL-like
   format). No changes are made to the data other than changing the
   format in which it is held. The input and output files are specified
   by the user.

3.0 INPUT FILE FORMAT

   An excerpt from the raw CATH classification files, of the type
   caths.list.vX.X (Figure 1), domlist.vX.X (Figure 2) and
   CAT.names.all.vX.X (Figure 3) is shown below. The format of these
   files is explained in the CATH README file available by anonymous ftp
   from ftp.biochem.ucl.ac.uk (e.g. /pub/cathdata/v2.4).

  Input files for usage example

  File: caths.list.small

1cuk03    1  10   8  10   1   1   1  48 1.900
1hjp03    1  10   8  10   1   1   2  44 2.500

  File: domlist.small

1cuk00  D03   F00    1  0   1 - 0  66 -    1  0  67 - 0 142 -    1  0 156 - 0 2
03 -
1hjp00 D03 F01  1  0    1 - 0   66 -  1  0   67 - 0  158 -  1  0  159 - 0  202
-  0  203 - 0  203 - (1)

  File: CAT.names.all.small

1.10.8           1cuk03            :Helicase, Ruva Protein, domain 3
1.10.8.10         1cuk03    :DNA helicase RuvA subunit, C-terminal domain
0001             2ccyA0        :Mainly Alpha
0001.0010        1eca00          :Orthogonal Bundle

4.0 OUTPUT FILE FORMAT

   An example of the DCF output file is shown in Figure 4. The records
   used to describe an entry are as follows. Records (5) to (8) are used
   to describe the position of the domain in the CATH hierarchy.
     * (1) ID - Domain identifier code. This is a 6-character code that
       uniquely identifies the domain in CATH. The first four characters
       are the PDB identifier code, the fifth character is the PDB chain
       identifier to which the domain belongs and the final character is
       the number of the domain in the chain (for chains comprising more
       than one domain). This character is '0' if the chain comprises a
       single domain only.
     * (2) EN - PDB identifier code. This is the 4-character PDB
       identifier code of the PDB entry containing the domain.
     * (3) TY - domain type. "CATH" or "SCOP" is given ("CATH" for DCF
       files generated by using CATHPARSE).
     * (4) CI - CATH Classification Numbers. The integers preceeding the
       codes CL, AR, TP, SF, FA, NI, IF are the CATH classification
       numbers for CLass, ARchitecture, ToPology, Homologous SuPerfamily,
       FAmily, Near Identical family and Identical Family respectively.
       These numbers uniquely identify the appropriate node in the CATH
       parsable files.
     * (5) CL - Class. It is the identical text taken from
       CAT.names.all.vX.X.
     * (6) AR - Architecture. It is the identical text taken from
       CAT.names.all.vX.X.
     * (7) TP - Topology. It is the identical text taken from
       CAT.names.all.vX.X.
     * (8) SF - Homologous Superfamily. It is the identical text taken
       from CAT.names.all.vX.X.
     * (9) DS - Sequence of the domain according to the PDB file. This
       sequence is taken from the domain CCF file (clean coordinate file)
       generated by DOMAINER. The DS record will only be present if the
       DCF file has been processed using DOMAINSEQS.
     * (10) NR - Number of residues in domain
     * (11) NC - Number of segments comprising the domain. All domains in
       CATH are from single chains. If the number of segments is greater
       than 1, then the domain entry will have a section containing a CN
       and a CH record (see below) for each segment.
     * (12) CN - Segment number. The number given in brackets after this
       record indicates the start of the data for the relevent segment.
     * (13) CH - Domain definition. The character given before CHAIN is
       the PDB chain identifier, the strings before START and END give
       the start and end positions respectively of the domain in the PDB
       file (a '.' is given in cases where a position was not specified).
       Note that the start and end positions refer to residue numbering
       given in the original pdb file and therefore must be treated as
       strings. (14) XX - used for spacing. (15) // - used to delimit
       records for a domain.

  Output files for usage example

  File: Ecath.dat

ID   1CUK03
XX
EN   1CUK
XX
TY   CATH
XX
CI   1 CL; 10 AR; 8 TP; 10 SF; 1 FA; 1 NI;1 IF;
XX
CL   Mainly Alpha
XX
AR   Orthogonal Bundle
XX
TP   Helicase, Ruva Protein, domain 3
XX
SF   DNA helicase RuvA subunit, C-terminal domain
XX
NR   48
XX
NC   1
XX
CN   [1]
XX
CH   0 CHAIN; 156 START; 203 END;
//
ID   1HJP03
XX
EN   1HJP
XX
TY   CATH
XX
CI   1 CL; 10 AR; 8 TP; 10 SF; 1 FA; 1 NI;2 IF;
XX
CL   Mainly Alpha
XX
AR   Orthogonal Bundle
XX
TP   Helicase, Ruva Protein, domain 3
XX
SF   DNA helicase RuvA subunit, C-terminal domain
XX
NR   44
XX
NC   1
XX
CN   [1]
XX
CH   0 CHAIN; 159 START; 202 END;
//

  File: cathparse.log

1.10.8.10
1.10.8
0001.0010
0001
1.10.8.10
1.10.8
0001.0010
0001

5.0 DATA FILES

   None.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers:
  [-listfile]          infile     This option specifies the name of raw CATH
                                  classification file (caths.list.vX.X)
                                  (input). The raw CATH parsable files
                                  (classification and description files)
                                  available from ftp.biochem.ucl.ac.uk
                                  (/pub/cathdata/v2.4").
  [-domfile]           infile     This option specifies the name of raw CATH
                                  classification file (domlist.vX.X) (input).
                                  The raw CATH parsable files (classification
                                  and description files) available from
                                  ftp.biochem.ucl.ac.uk (/pub/cathdata/v2.4").
  [-namesfile]         infile     This option specifies the name of raw CATH
                                  classification file (CAT.names.all.vX.X)
                                  (input). The raw CATH parsable files
                                  (classification and description files)
                                  available from ftp.biochem.ucl.ac.uk
                                  (/pub/cathdata/v2.4").
  [-outfile]           outfile    This option specifies the name of CATH DCF
                                  file (domain classification file) (output).
                                  A 'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -logfile            outfile    This option specifies the name of the
                                  CATHPARSE log file.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory4        string     Output directory

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-listfile]
   (Parameter 1) This option specifies the name of raw CATH
   classification file (caths.list.vX.X) (input). The raw CATH parsable
   files (classification and description files) available from
   ftp.biochem.ucl.ac.uk (/pub/cathdata/v2.4"). Input file
   caths.list.v2.4
   [-domfile]
   (Parameter 2) This option specifies the name of raw CATH
   classification file (domlist.vX.X) (input). The raw CATH parsable
   files (classification and description files) available from
   ftp.biochem.ucl.ac.uk (/pub/cathdata/v2.4"). Input file domlist.v2.4
   [-namesfile]
   (Parameter 3) This option specifies the name of raw CATH
   classification file (CAT.names.all.vX.X) (input). The raw CATH
   parsable files (classification and description files) available from
   ftp.biochem.ucl.ac.uk (/pub/cathdata/v2.4"). Input file
   CAT.names.all.v2.4
   [-outfile]
   (Parameter 4) This option specifies the name of CATH DCF file (domain
   classification file) (output). A 'domain classification file' contains
   classification and other data for domains from SCOP or CATH, in DCF
   format (EMBL-like). The files are generated by using SCOPPARSE and
   CATHPARSE. Domain sequence information can be added to the file by
   using DOMAINSEQS. Output file Ecath.dat
   -logfile This option specifies the name of the CATHPARSE log file.
   Output file cathparse.log
   Additional (Optional) qualifiers Allowed values Default
   (none)
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of CATHPARSE is shown below. Here is a
   sample session with cathparse


% cathparse 
Generates DCF file from raw CATH files.
Name of raw CATH classification file (caths.list.vX.X) (input). [caths.list.v2.
4]: caths.list.small
Name of raw CATH classification file (domlist.vX.X) (input). [domlist.v2.4]: do
mlist.small
Name of raw CATH classification file (CAT.names.all.vX.X) (input). [CAT.names.a
ll.v2.4]: CAT.names.all.small
Name of CATH DCF file (domain classification file) (output). [Ecath.dat]: 
Name of CATHPARSE log file [cathparse.log]: 

   Go to the input files for this example
   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

8.0 NOTES

   A future implementation will give the option of omitting from the
   output file domains that consist of more than one segment.
   CATH includes several minor classes which are not appropriate for some
   anaylses. A future implementation will give the option to omit domains
   from minor classes.

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   SCOP parsable files CATH format. Raw CATH classification data.
   Available from ftp.biochem.ucl.ac.uk (e.g. /pub/cathdata/v2.4)
   N.A.
   Domain classification file (for CATH) DCF format (EMBL-like).
   Classification and other data for domains from CATH. CATHPARSE Domain
   sequence information can be added to the file by using DOMAINSEQS.

9.0 DESCRIPTION

   The raw CATH classification files are inconvenient for some uses
   because the text describing the domain classification is given in a
   different file to the classification itself, the file formats are not
   easily extended and differ from other related classifications such as
   SCOP. CATHPARSE reads the raw CATH classification files and writes a
   single file in DCF (EMBL-like) format, which is an easier format to
   work with, is more human-readable and is more extensible than the
   native CATH database format.

10.0 ALGORITHM

   None.

11.0 RELATED APPLICATIONS

See also

    Program name                        Description
   aaindexextract Extract data from AAINDEX
   allversusall   Sequence similarity data from all-versus-all comparison
   cutgextract    Extract data from CUTG
   domainer       Generates domain CCF files from protein CCF files
   domainnr       Removes redundant domains from a DCF file
   domainseqs     Adds sequence records to a DCF file
   domainsse      Add secondary structure records to a DCF file
   hetparse       Converts heterogen group dictionary to EMBL-like format
   pdbparse       Parses PDB files and writes protein CCF files
   pdbplus        Add accessibility & secondary structure to a CCF file
   pdbtosp        Convert swissprot:PDB codes file to EMBL-like format
   printsextract  Extract data from PRINTS
   prosextract    Build the PROSITE motif database for use by patmatmotifs
   rebaseextract  Extract data from REBASE
   scopparse      Generate DCF file from raw SCOP files
   seqnr          Removes redundancy from DHF files
   sites          Generate residue-ligand CON files from CCF files
   ssematch       Search a DCF file for secondary structure matches
   tfextract      Extract data from TRANSFAC

12.0 DIAGNOSTIC ERROR MESSAGES

   None.

13.0 AUTHORS

   Mike Hurley (mhurley@rfcgr.mrc.ac.uk)
   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS. Please cite the authors and
   EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references

   1. Conte, L.L., Ailey, B., Hubbard, T.J. Brenner, S.E., Murzin, A.G.
   and Chothia, C. (2000) SCOP: a structural classification of proteins
   database. Nucleic Acids Res. 28, 257-259.
