
                            SITES documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Generate residue-ligand CON files from CCF files

2.0 INPUTS & OUTPUTS

   SITES reads CCF files (clean coordinate file) and writes a CON files
   (contacts file) of residue-ligand contact data for domains in a DCF
   file (domain classification file). The CON file contains contact data
   for all ligand-domain pairs (using domain definitions from the DCF
   file) found in the CCF files. The input and output files are specified
   by the user (file extensions in the ACD file). A log file is also
   written.

3.0 INPUT FILE FORMAT

   The format of the protein CCF file is described in the PDBPARSE
   documentation.

  Input files for usage example

  File: ../scopparse-keep/all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

4.0 OUTPUT FILE FORMAT

   The CON format used for the contact files (Figure 1) is similar to
   EMBL format and is described in the CONTACTS documentation. A few of
   the records differ in the SITES output compared to the CONTACTS
   output, however, so for the sake of clarity all records are described
   below.
     * XX - used for spacing and comments. The first line is
       bibliographic information and contains the text "Residue-ligand
       contact data (for domains)".
     * TY - type of contact. For CON files generated by SITES, 'LIGAND'
       is always given.
     * EX - experimental information. The value of the threshold contact
       distance is given as a floating point number after 'THRESH'. For
       CON files generated by SITES, a '.' is given after 'IGNORE',
       'NMOD' and 'NCHA' (these records are used by the CONTACTS and
       INTERFACE applications and can be disregarded).
     * NE - number of entries in the file. For CON files generated by
       SITES, this is the number of unique ligands:domain pairs.
       Following the NE record, the file has a section for each entry
       containing records for entry number (EN), identifier codes (ID),
       ligand description (DE), polypeptide chain-specific data (CN),
       chain sequence information (S1) and number of contacts (NC), that
       together define the ligand:domain pair and its contacts.
     * EN - entry number. The number in brackets indicates the start of
       an entry (ligand:domain pair).
     * ID - identifier codes: (1) PDB: 4-character PDB identifier code.
       (2) DOM: 7-character domain identifier code from SCOP or CATH. (3)
       LIG: Ligand identifier (an abbreviation of its full name).
     * DE - Full name of the ligand, see HETPARSE documentation.
     * CN - polypeptide chain-specific data. Tokens delimiting data items
       are as follow. (1) MO: The model number (from the PDB file). '1'
       is always given for CON files generated by using SITES (contacts
       were calculated from the coordinates for a single model from a
       domain CCF file). (2) CN1: Chain number. '1' is always given
       (domains from a domain CCF file are always listed as from a single
       chain only). (3) CN2: Not used by SITES, a '.' is given. (4) ID1:
       PDB chain identifier (a '.` given in cases where a chain
       identifier was not specified in the original PDB file or, for
       domain CCF files, the domain from SCOP or CATH is comprised of
       more than one chain). (5) ID2: Not used by SITES, a '.' is given.
       (6) NRES1: number of residues in chain. (7) NRES2: Not used by
       SITES, a '.' is given.
     * S1 - polypeptide chain sequence for domain. The number of residues
       is given before AA on the first line. The sequece is given on
       subsequent lines.
     * NC - number of contacts: (1) SM: Not used by SITES, a '.' is
       given. (2) LI: Number of residue-ligand contacts; between
       side-chain or main-chain atoms of an amino acid residue and a
       ligand.
     * LI - Line of residue-ligand contact data. The amino acid
       identifier and residue number are given. Residue numbers are taken
       from the CCF file and give a correct index into the sequence (i.e.
       they are not necessarily the same as the original PDB file). This
       sequence is given in the CON file itself (S1 record).
     * // - delimiter for individual entries in the file and also given
       on the last line of the file.

  Output files for usage example

  File: SITES.con

XX   Residue-ligand contact data (for domains).
XX
TY   LIGAND
XX
EX   THRESH 1.0; IGNORE .; NMOD .; NCHA .;
XX
NE   11
XX
EN   [1]
XX
ID   PDB 1cs4; DOM d1cs4a_; LIG 101;
XX
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
XX
SI   SN 1; NS 2
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 52; NRES2 .
XX
S1   SEQUENCE    52 AA;   5817 MW;  47362A43 CRC32;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   SM .; LI 6
XX
LI   ASP 2
LI   PHE 6
LI   THR 7
LI   LEU 44
LI   GLY 45
LI   ASP 46
XX
//
EN   [2]
XX
ID   PDB 1ii7; DOM d1ii7a_; LIG 101;
XX
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
XX
SI   SN 2; NS 2
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 65; NRES2 .
XX
S1   SEQUENCE    65 AA;   7396 MW;  0CFB92A3 CRC32;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   SM .; LI 2
XX
LI   HIS 10
LI   ASP 49
XX


  [Part of this file has been deleted for brevity]

NC   SM .; LI 3
XX
LI   ASP 8
LI   HIS 10
LI   ASP 49
XX
//
EN   [10]
XX
ID   PDB 2hhb; DOM .; LIG PO4;
XX
DE   PHOSPHATE ION
XX
SI   SN 1; NS 1
XX
CN   MO .; CN1 1; CN2 .; ID1 D; ID2 .; NRES1 146; NRES2 .
XX
S1   SEQUENCE   146 AA;  15868 MW;  EC9744C9 CRC32;
     VHLTPEEKSA VTALWGKVNV DEVGGEALGR LLVVYPWTQR FFESFGDLST PDAVMGNPKV
     KAHGKKVLGA FSDGLAHLDN LKGTFATLSE LHCDKLHVDP ENFRLLGNVL VCVLAHHFGK
     EFTPPVQAAY QKVVAGVANA LAHKYH
XX
NC   SM .; LI 2
XX
LI   VAL 1
LI   LEU 81
XX
//
EN   [11]
XX
ID   PDB 1cs4; DOM d1cs4a_; LIG POP;
XX
DE   PYROPHOSPHATE 2-
XX
SI   SN 1; NS 1
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 52; NRES2 .
XX
S1   SEQUENCE    52 AA;   5817 MW;  47362A43 CRC32;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   SM .; LI 6
XX
LI   ASP 2
LI   ILE 3
LI   GLU 4
LI   GLY 5
LI   PHE 6
LI   THR 7
XX
//

  File: sites.log

CCF: /ebi/services/idata/pmr/hgmp/test/qa/pdbplus-keep/2hhb.ccf HETS:YES NHETS:
5 SCOP:NO NCHN:4
CCF: /ebi/services/idata/pmr/hgmp/test/qa/pdbplus-keep/1cs4.ccf HETS:YES NHETS:
7 SCOP:YES NDOMS: 1
CCF: /ebi/services/idata/pmr/hgmp/test/qa/pdbplus-keep/1ii7.ccf HETS:YES NHETS:
5 SCOP:YES NDOMS: 1

5.0 DATA FILES

   SITES uses a data file containing van der Waals radii for atoms in
   proteins (see CONTACTS documentation.) The file Evdw.dat is such a
   data file and is part of the EMBOSS distribution.
   SITES uses a data file containing a dictionary of heterogen groups in
   PDB. This file may be generated by using HETPARSE and is part of the
   EMBOSS distribution. The file Ehet.dat is such a data file and is part
   of the EMBOSS distribution.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers:
  [-protpath]          dirlist    This option specifies the location of the
                                  protein CCF files (clean coordinate files)
                                  (input). A 'clean cordinate file' contains
                                  protein coordinate and derived data for a
                                  single PDB file ('protein clean coordinate
                                  file') or a single domain from SCOP or CATH
                                  ('domain clean coordinate file'), in CCF
                                  format (EMBL-like). The files, generated by
                                  using PDBPARSE (PDB files) or DOMAINER
                                  (domains), contain 'cleaned-up' data that is
                                  self-consistent and error-corrected.
                                  Records for residue solvent accessibility
                                  and secondary structure are added to the
                                  file by using PDBPLUS.
  [-domaindir]         directory  This option specifies the location of the
                                  domain CCF files (clean coordinate files)
                                  (input). A 'clean cordinate file' contains
                                  protein coordinate and derived data for a
                                  single PDB file ('protein clean coordinate
                                  file') or a single domain from SCOP or CATH
                                  ('domain clean coordinate file'), in CCF
                                  format (EMBL-like). The files, generated by
                                  using PDBPARSE (PDB files) or DOMAINER
                                  (domains), contain 'cleaned-up' data that is
                                  self-consistent and error-corrected.
                                  Records for residue solvent accessibility
                                  and secondary structure are added to the
                                  file by using PDBPLUS.
  [-dcffile]           infile     This option specifies the name of the DCF
                                  file (domain classification file) (input). A
                                  'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -threshold          float      This option specifies the threshold contact
                                  distance.
  [-outfile]           outfile    This option specifies the name of the output
                                  file.
   -logfile            outfile    This option specifies the name of the log
                                  file.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers:
   -dicfile            datafile   This option specifies the dictionary of
                                  heterogen groups in PDB. This file is
                                  generated by using HETPARSE and is part of
                                  the EMBOSS distribution.
   -vdwfile            datafile   This option specifies the name of the data
                                  file with van der Waals radii for atoms in
                                  amino acid residues. This file is part of
                                  the EMBOSS distribution.

   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory4        string     Output directory

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-protpath]
   (Parameter 1) This option specifies the location of the protein CCF
   files (clean coordinate files) (input). A 'clean cordinate file'
   contains protein coordinate and derived data for a single PDB file
   ('protein clean coordinate file') or a single domain from SCOP or CATH
   ('domain clean coordinate file'), in CCF format (EMBL-like). The
   files, generated by using PDBPARSE (PDB files) or DOMAINER (domains),
   contain 'cleaned-up' data that is self-consistent and error-corrected.
   Records for residue solvent accessibility and secondary structure are
   added to the file by using PDBPLUS. Directory with files ./
   [-domaindir]
   (Parameter 2) This option specifies the location of the domain CCF
   files (clean coordinate files) (input). A 'clean cordinate file'
   contains protein coordinate and derived data for a single PDB file
   ('protein clean coordinate file') or a single domain from SCOP or CATH
   ('domain clean coordinate file'), in CCF format (EMBL-like). The
   files, generated by using PDBPARSE (PDB files) or DOMAINER (domains),
   contain 'cleaned-up' data that is self-consistent and error-corrected.
   Records for residue solvent accessibility and secondary structure are
   added to the file by using PDBPLUS. Directory ./
   [-dcffile]
   (Parameter 3) This option specifies the name of the DCF file (domain
   classification file) (input). A 'domain classification file' contains
   classification and other data for domains from SCOP or CATH, in DCF
   format (EMBL-like). The files are generated by using SCOPPARSE and
   CATHPARSE. Domain sequence information can be added to the file by
   using DOMAINSEQS. Input file Required
   -threshold This option specifies the threshold contact distance. Any
   numeric value 1.0
   [-outfile]
   (Parameter 4) This option specifies the name of the output file.
   Output file SITES.con
   -logfile This option specifies the name of the log file. Output file
   sites.log
   Additional (Optional) qualifiers Allowed values Default
   (none)
   Advanced (Unprompted) qualifiers Allowed values Default
   -dicfile This option specifies the dictionary of heterogen groups in
   PDB. This file is generated by using HETPARSE and is part of the
   EMBOSS distribution. Data file Ehet.dat
   -vdwfile This option specifies the name of the data file with van der
   Waals radii for atoms in amino acid residues. This file is part of the
   EMBOSS distribution. Data file Evdw.dat

  6.2 EXAMPLE SESSION

   An example of interactive use of SITES is shown below. Here is a
   sample session with sites


% sites 
Generate residue-ligand CON files from CCF files.
Protein CCF files (clean coordinate files) (input) [./]: ../pdbplus-keep
Domain CCF files (clean coordinate files) (input) [./]: ../domainer-keep
Name of DCF file (domain classification file) (input): ../scopparse-keep/all.sc
op
Threshold contact distance [1.0]: 1
Name of output file [SITES.con]: 
Name of log file [sites.log]: 

Entries in HetDic 4306
Entries in Dbase 4306
CCF FILE: /ebi/services/idata/pmr/hgmp/test/qa/pdbplus-keep/2hhb.ccf (1/3)
CCF FILE: /ebi/services/idata/pmr/hgmp/test/qa/pdbplus-keep/1cs4.ccf (2/3)
CCF FILE: /ebi/services/idata/pmr/hgmp/test/qa/pdbplus-keep/1ii7.ccf (3/3)

   Go to the input files for this example
   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   None.

8.0 NOTES

   Types of contact 
   LI records are used for contacts to ligands (as defined above). In
   CONTACTS and INTERFACE output, SM records are used for contacts
   between either either side-chain or main-chain atoms. In a future
   implementation, SS will be used for side-chain only contacts, MM will
   be used for main-chain only contacts, and there will probably be
   several other forms of contact too.

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   Clean coordinate file (for protein) CCF format (EMBL-like). Protein
   coordinate and derived data for a single PDB file. The data are
   'cleaned-up': self-consistent and error-corrected. PDBPARSE Records
   for residue solvent accessibility and secondary structure are added to
   the file by using PDBPLUS.
   Clean coordinate file (for domain) CCF format (EMBL-like). Protein
   coordinate and derived data for a single domain from SCOP or CATH. The
   data are 'cleaned-up': self-consistent and error-corrected. DOMAINER
   Records for residue solvent accessibility and secondary structure are
   added to the file by using PDBPLUS.
   Contact file (intra-chain residue-residue contacts) CON format
   (EMBL-like.) Intra-chain residue-residue contact data for a protein or
   a domain from SCOP or CATH. CONTACTS N.A.
   Contact file (inter-chain residue-residue contacts) CON format
   (EMBL-like.) Inter-chain residue-residue contact data for a protein or
   a domain from SCOP or CATH. INTERFACE N.A.
   Contact file (residue-ligand contacts) CON format (EMBL-like.)
   Residue-ligand contact data for a protein or a domain from SCOP or
   CATH. SITES N.A.
   van der Waals radii A file of van der Waals radii for atoms in amino
   acid residues. Part of the emboss distribution. N.A. N.A.
   Dictionary of heterogen groups A file of the dictionary of heterogen
   groups in PDB. HETPARSE N.A.

9.0 DESCRIPTION

   Knowledge of the physical contacts that amino acid residues make with
   protein ligands is required for several different analyses. SITES
   calculates residue-ligand contact data from protein CCF files (clean
   coordinate files) and organises the data according to domains taken
   from a DCF file (domain classification file). None

10.0 ALGORITHM

   Contact between two residues is defined as when the van der Waals
   surface of any atom of the first residue comes within the threshold
   contact distance of the van der Waals surface of any atom of the
   second residue. The threshold contact distance is a user-defined
   distance with a default value of 1 Angstrom.

11.0 RELATED APPLICATIONS

See also

    Program name                        Description
   aaindexextract Extract data from AAINDEX
   allversusall   Sequence similarity data from all-versus-all comparison
   cathparse      Generates DCF file from raw CATH files
   cutgextract    Extract data from CUTG
   domainer       Generates domain CCF files from protein CCF files
   domainnr       Removes redundant domains from a DCF file
   domainseqs     Adds sequence records to a DCF file
   domainsse      Add secondary structure records to a DCF file
   hetparse       Converts heterogen group dictionary to EMBL-like format
   pdbparse       Parses PDB files and writes protein CCF files
   pdbplus        Add accessibility & secondary structure to a CCF file
   pdbtosp        Convert swissprot:PDB codes file to EMBL-like format
   printsextract  Extract data from PRINTS
   prosextract    Build the PROSITE motif database for use by patmatmotifs
   rebaseextract  Extract data from REBASE
   scopparse      Generate DCF file from raw SCOP files
   seqnr          Removes redundancy from DHF files
   ssematch       Search a DCF file for secondary structure matches
   tfextract      Extract data from TRANSFAC

12.0 DIAGNOSTIC ERROR MESSAGES

   SITES generates a log file an excerpt of which is shown below. The
   file contains a line for each protein CCF that was read containing
   diagnostic information is given (in case of difficulty email Jon Ison,
   jison@hgmp.mrc.ac.uk).
   Figure 2 Excerpt from an INTERFACE log file 

Excerpt of log file
CCF: 000_testdata_new/sites/in/1cs4.ccf HETS:YES        NHETS:7 SCOP:YES
NDOMS: 1
CCF: 000_testdata_new/sites/in/1ii7.ccf HETS:YES        NHETS:5 SCOP:YES
NDOMS: 1

13.0 AUTHORS

   Waqas Awan (wawan@rfcgr.mrc.ac.uk)
   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS. Please cite the authors and
   EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references
