
                          SCOPPARSE documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Generate DCF file from raw SCOP files

2.0 INPUTS & OUTPUTS

   SCOPPARSE parses the dir.cla.scop.txt and dir.des.scop.txt SCOP
   classification files, e.g. available at URLs:
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57
   The format of these files is explained at URL:
   http://scop.mrc-lmb.cam.ac.uk/scop/release-notes-1.55.html
   SCOPPARSE writes the classification to a DCF file (EMBL-like format).
   No changes are made to the data other than changing the format in
   which it is held. The file does not include domain sequence
   information. The input and output files are specified by the user.

3.0 INPUT FILE FORMAT

   An excerpt from the dir.cla.scop.txt (Figure 1) and dir.des.scop.txt
   (Figure 2) SCOP input files is shown below. The format of these files
   is explained on the SCOP website:
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57

  Input files for usage example

  File: scop.cla.raw

# dir.cla.scop.txt
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/s
cop/lic/copy.html
d1cs4a_ 1cs4    A:      d.58.29.1       39418   cl=53931,cf=54861,sf=55073,fa=5
5074,dm=55077,sp=55078,px=39418
d1ii7a_ 1ii7    A:      d.159.1.4       62415   cl=53931,cf=56299,sf=56300,fa=6
4427,dm=64428,sp=64429,px=62415

  File: scop.des.raw

# dir.des.scop.txt
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/s
cop/lic/copy.html
53931   cl      d       -       Alpha and beta proteins (a+b)
54861   cf      d.58    -       Ferredoxin-like
55073   sf      d.58.29 -       Adenylyl and guanylyl cyclase catalytic domain
55074   fa      d.58.29.1       -       Adenylyl and guanylyl cyclase catalytic
 domain
55077   dm      d.58.29.1       -       Adenylyl cyclase VC1, domain C1a
55078   sp      d.58.29.1       -       Dog (Canis familiaris)
39418   px      d.58.29.1       d1cs4a_ 1cs4 A:
56299   cf      d.159   -       Metallo-dependent phosphatases
56300   sf      d.159.1 -       Metallo-dependent phosphatases
64427   fa      d.159.1.4       -       DNA double-strand break repair nuclease
64428   dm      d.159.1.4       -       Mre11
64429   sp      d.159.1.4       -       Archaeon Pyrococcus furiosus
62415   px      d.159.1.4       d1ii7a_ 1ii7 A:

4.0 OUTPUT FILE FORMAT

   An example of the DCF output file is shown in Figure 3. The records
   used to describe an entry are as follows. Records (4) to (9) are used
   to describe the position of the domain in the SCOP hierarchy. Various
   other ADDITIONAL RECORDS may be present if the file is processed by
   other programs, e.g. DOMAINSEQS or DOMAINSSE.
     * (1) ID - Domain identifier code. This is a 7-character code that
       uniquely identifies the domain in SCOP. It is identical to the
       first 7 characters of a line in the SCOP classification file. The
       first character is always 'D', the next four characters are the
       PDB identifier code, the fifth character is the PDB chain
       identifier to which the domain belongs (a '.' is given in cases
       where the domain is composed of multiple chains, a '_' is given
       where a chain identifier was not specified in the PDB file) and
       the final character is the number of the domain in the chain (for
       chains comprising more than one domain) or '_' (the chain
       comprises a single domain only).
     * (2) EN - PDB identifier code. This is the 4-character PDB
       identifier code of the PDB entry containing the domain.
     * (3) TY - domain type. "CATH" or "SCOP" is given ("SCOP" for DCF
       files generated by using SCOPPARSE).
     * (4) SI - SCOP Sunid's. The integers preceeding the codes CL, FO,
       SF, FA, DO, SO and DD are the SCOP sunids for Class, Fold,
       Superfamily, Family, Domain, Source and domain data respectively.
       These numbers uniquely identify the appropriate node in the SCOP
       parsable files.
     * (5) CL - Domain class. It is identical to the text given after
       'Class' in the SCOP classification file.
     * (6) FO - Domain fold. It is identical to the text given after
       'Fold' in the SCOP classification file.
     * (7) SF - Domain superfamily. It is identical to the text given
       after 'Superfamily' in the SCOP classification file.
     * (8) FA - Domain family. It is identical to the text given after
       'Family' in the SCOP classification file.
     * (9) DO - Domain name. It is identical to the text given after
       'Protein' in the SCOP classification file.
     * (10) OS - Source of the protein. It is identical to the text given
       after 'Species' in the SCOP classification file.
     * (11) DS - Sequence of the domain according to the PDB file. This
       sequence is taken from the domain clean coordinate file generated
       by DOMAINER. The DS record will only be present if the DCF file
       has been processed using DOMAINSEQS.
     * (12) NC - Number of chains comprising the domain, or number of
       segments from the same chain that the domain is comprised of. NC
       is usually 1. If the number of chains is greater than 1, then the
       domain entry will have a section containing a CN and a CH record
       (see below) for each chain.
     * (13) CN - Chain number. The number given in brackets after this
       record indicates the start of the data for the relevent chain.
     * (14) CH - Domain definition. The character given before CHAIN is
       the PDB chain identifier (a '.' is given in cases where a chain
       identifier was not specified in the DCF file), the strings before
       START and END give the start and end positions respectively of the
       domain in the PDB file (a '.' is given in cases where a position
       was not specified). Note that the start and end positions refer to
       residue numbering given in the original PDB file and therefore
       must be treated as strings.
     * (15) XX - used for spacing.
     * (16) // - used to delimit records for a domain.

  Output files for usage example

  File: all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

5.0 DATA FILES

   No data files are used.

6.0 USAGE

   Standard (Mandatory) qualifiers:
  [-classfile]         infile     This option specifies the name of raw SCOP
                                  classification file dir.cla.scop.txt_X.XX
                                  (input). This is the raw SCOP classification
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.
cla.scop.txt_1.57.
  [-desinfile]         infile     This option specifies the name of raw SCOP
                                  description file dir.des.scop.txt_X.XX
                                  (input). This is the raw SCOP description
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.
des.scop.txt_1.57.
   -nosegments         boolean    This option specifies whether to omit
                                  domains comprising of more than one segment.
                                  This is necessary if a continuous residue
                                  sequence is required.
   -nomultichain       boolean    This option specifies whether to omit
                                  domains comprising segments from more than
                                  one chain. This is necessary if a continuous
                                  residue sequence is required.
  [-dcffile]           outfile    This option specifies the name of SCOP DCF
                                  file (domain classification file) (output).
                                  A 'domain classification file' contains
                                  classification and other data for domains
                                  from the SCOP or CATH databases. The file is
                                  generated by using DOMAINER and is in DCF
                                  format (EMBL-like). Domain sequence
                                  information can be added to the file by
                                  using DOMAINSEQS.

   Additional (Optional) qualifiers:
   -nominor            boolean    This option specifies whether to omit
                                  domains from minor classes (defined as
                                  anything not in class 'All alpha proteins',
                                  'All beta proteins', 'Alpha and beta
                                  proteins (a/b)' or 'Alpha and beta proteins
                                  (a+b)'). This is necessary or appropriate
                                  for many analyses.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-dcffile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers Allowed values Default
   [-classfile]
   (Parameter 1) This option specifies the name of raw SCOP
   classification file dir.cla.scop.txt_X.XX (input). This is the raw
   SCOP classification file available at
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57. Input
   file Required
   [-desinfile]
   (Parameter 2) This option specifies the name of raw SCOP description
   file dir.des.scop.txt_X.XX (input). This is the raw SCOP description
   file available at
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57. Input
   file Required
   -nosegments This option specifies whether to omit domains comprising
   of more than one segment. This is necessary if a continuous residue
   sequence is required. Boolean value Yes/No No
   -nomultichain This option specifies whether to omit domains comprising
   segments from more than one chain. This is necessary if a continuous
   residue sequence is required. Boolean value Yes/No No
   [-dcffile]
   (Parameter 3) This option specifies the name of SCOP DCF file (domain
   classification file) (output). A 'domain classification file' contains
   classification and other data for domains from the SCOP or CATH
   databases. The file is generated by using DOMAINER and is in DCF
   format (EMBL-like). Domain sequence information can be added to the
   file by using DOMAINSEQS. Output file test.scop
   Additional (Optional) qualifiers Allowed values Default
   -nominor This option specifies whether to omit domains from minor
   classes (defined as anything not in class 'All alpha proteins', 'All
   beta proteins', 'Alpha and beta proteins (a/b)' or 'Alpha and beta
   proteins (a+b)'). This is necessary or appropriate for many analyses.
   Boolean value Yes/No No
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of SCOPPARSE is shown below. Here is a
   sample session with scopparse


% scopparse 
Generate DCF file from raw SCOP files.
Name of raw SCOP classification file dir.cla.scop.txt_X.XX (input).: scop.cla.r
aw
Name of raw SCOP description file dir.des.scop.txt_X.XX (input).: scop.des.raw
Omit domains comprising of more than one segment. [N]: Y
Omit domains comprising segments from more than one chain. [N]: N
Name of SCOP DCF file (domain classification file) (output). [test.scop]: all.s
cop

   Go to the input files for this example
   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   None.

8.0 NOTES

   Some SCOP domains are comprised of more than one segments of
   polypeptide chain, these segments belonging to a single or more than
   one polypeptide chains. It is debatable whether a domain (using the
   widely accepted definition) can truly consist of regions from more
   than polypeptide. Accordingly, SCOPPARSE gives the option of omitting
   from the output file domains that consist of more than one segment and
   domains that consist of more than one segment where the segments are
   from different chains.
   SCOP includes several minor classes which are not appropriate for some
   anaylses. Accordingly, SCOPPARSE gives the option to omit domains from
   minor classes. This is defined as anything not in class 'All alpha
   proteins', 'All beta proteins', 'Alpha and beta proteins (a/b)' or
   'Alpha and beta proteins (a+b)'

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   SCOP parsable files SCOP format. Raw SCOP classification data.
   Available from http://scop.mrc-lmb.cam.ac.uk/scop/parse/
   N.A.
   Domain classification file (for SCOP) DCF format (EMBL-like format for
   domain classification data). Classification and other data for domains
   from SCOP. The file is in DCF format (EMBL-like). SCOPPARSE Domain
   sequence information can be added to the file by using DOMAINSEQS.

8.3 ADDITIONAL RECORDS

   The following records for database sequence and secondary structure
   may be present in a DCF file that has been processed by using
   DOMAINSEQS or DOMAINSSE.
     * (1) AC - Accession number of the domain sequence. This record will
       only be present if the DCF file has been processed using
       DOMAINSEQS and if an accession number for the PDB file
       corresponding to the domain is given in the swissprot:PDB
       equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use
       of.
     * (2) SP - Swissprot code of the domain sequence. This record will
       only be present if the domain classification file has been
       processed using DOMAINSEQS and if an swissprot code for the PDB
       file corresponding to the domain is given in the swissprot:PDB
       equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use
       of.
     * (3) RA - Position of domain in swissprot sequence. The integers
       preceeding START and END give the start and end points
       respectively of the domain sequence relative to the full-length
       swissprot sequence.
     * (4) SQ - Sequence of the domain according to swissprot. This
       sequence is taken from the swissprot database. The SQ record will
       only be present if the SCOP classification file has been processed
       using DOMAINSEQS and if an accession number for the PDB file
       corresponding to the domain is given in the swissprot:PDB
       equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use
       of.
     * (5) SS - Secondary structure assignment (per residue). A
       simplified 3-state assignment: H (helix), E (extended, beta
       strand) and L (loop, open coil) is used.
     * (6) SE - Secondary structure element assignment (per secondary
       structure element). The same 3-state assignment as for the SS
       record is used: H (helix), E (extended, beta strand) and L (loop,
       open coil).

XX
AC   P02213
XX
SP   GLB1_SCAIN
XX
RA   1 START; 146 END;
XX
SQ   SEQUENCE   146 AA;  15947 MW;  5868B4E5 CRC32;
     PSVYDAAAQL TADVKKDLRD SWKVIGSDKK GNGVALMTTL FADNQETIGY FKRLGDVSQG
     MANDKLRGHS ITLMYALQNF IDQLDNPDDL VCVVEKFAVN HITRKISAAE FGKINGPIKK
     VLASKNFGDK YANAWAKLVA VVQAAL
XX
AC   P02213
XX
SP   GLB1_SCAIN
XX
RA   1 START; 146 END;
XX
SQ   SEQUENCE   146 AA;  15947 MW;  5868B4E5 CRC32;
     PSVYDAAAQL TADVKKDLRD SWKVIGSDKK GNGVALMTTL FADNQETIGY FKRLGDVSQG
     MANDKLRGHS ITLMYALQNF IDQLDNPDDL VCVVEKFAVN HITRKISAAE FGKINGPIKK
     VLASKNFGDK YANAWAKLVA VVQAAL

   None

9.0 DESCRIPTION

   The raw SCOP classification files are inconvenient for some uses
   because the text describing the domain classification is given in a
   different file to the classification itself, the file formats are not
   easily extended and differ from other related classifications such as
   CATH. SCOPPARSE reads the raw SCOP classification files and writes a
   single file in DCF (EMBL-like) format, which is an easier format to
   work with, is more human-readable and is more extensible than the
   native SCOP database format.

10.0 ALGORITHM

   None.

11.0 RELATED APPLICATIONS

See also

    Program name                        Description
   aaindexextract Extract data from AAINDEX
   allversusall   Sequence similarity data from all-versus-all comparison
   cathparse      Generates DCF file from raw CATH files
   cutgextract    Extract data from CUTG
   domainer       Generates domain CCF files from protein CCF files
   domainnr       Removes redundant domains from a DCF file
   domainseqs     Adds sequence records to a DCF file
   domainsse      Add secondary structure records to a DCF file
   hetparse       Converts heterogen group dictionary to EMBL-like format
   pdbparse       Parses PDB files and writes protein CCF files
   pdbplus        Add accessibility & secondary structure to a CCF file
   pdbtosp        Convert swissprot:PDB codes file to EMBL-like format
   printsextract  Extract data from PRINTS
   prosextract    Build the PROSITE motif database for use by patmatmotifs
   rebaseextract  Extract data from REBASE
   seqnr          Removes redundancy from DHF files
   sites          Generate residue-ligand CON files from CCF files
   ssematch       Search a DCF file for secondary structure matches
   tfextract      Extract data from TRANSFAC

12.0 DIAGNOSTIC ERROR MESSAGES

   None.

13.0 AUTHORS

   Alan Bleasby (ableasby@hgmp.mrc.ac.uk)
   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references

   1. Conte, L.L., Ailey, B., Hubbard, T.J. Brenner, S.E., Murzin, A.G.
   and Chothia, C. (2000) SCOP: a structural classification of proteins
   database. Nucleic Acids Res. 28, 257-259.
