
                           PDBPARSE documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Parses PDB files and writes CCF files (clean coordinate files) for
   proteins. Parses PDB files and writes protein CCF files

2.0 INPUTS & OUTPUTS

   PDBPARSE parses every PDB file in a directory and writes a protein CCF
   file (clean coordinate file) for each one. The paths and extensions
   for the PDB files (input) and protein CCF files (output) files are
   specified by the user. The user specifies whether the output files
   have the same names as the input files or whether the PDB identifier
   codes (from the PDB files) are used to name the files.
   The parser generates a log file containing diagnostic messages for
   various types of inconsistency, error and other features of a PDB file
   that justify manual inspection of the file to verify its contents (see
   Section 12.0 below).
   PDBPARSE implement the parsing methodology described under 'ALGORITHM'
   below. The output includes a single file for each PDB file parsed,
   excluding entries that lack any chains with at least the
   user-specified minimum number (typically 5) of known amino acids or
   which lack any SEQRES or ATOM records. The data (described in Section
   4.0 and Figure 1) includes the amino acid sequence for each chain
   (given in the SQ record of a CCF file) and coordinate and derived data
   for each residue and atom (RE and AT records). Optionally the parser
   can be configured to mask (disregard) atoms in protein chains as
   follows: (1) Mask non-amino acid groups that do not contain a C-alpha
   atom. Masked groups will not appear in either the RE, AT or SQ
   records. (2) Mask amino acids that do not contain a C-alpha atom. (3)
   Mask amino acids with a single atom only. For (2) and (3) the residue
   will not appear in the RE or AT records but will be present in the SQ
   record.

3.0 INPUT FILE FORMAT

   An excerpt of a PDB file is shown below (Figure 1). A detailed
   explanation of the pdb file format is available on the PDB web site:
   http://www.rcsb.org/pdb/info.html#File_Formats_and_Standards

4.0 OUTPUT FILE FORMAT

   An excerpt from a protein CCF file is shown in Figure 2. The data are
   as follows (record names are given in parentheses):

  4.1 Bibliographic data

   These include the 4-character PDB identifier code or the 7-character
   domain identifier code taken from SCOP (ID), text from the COMPND (DE)
   and SOURCE (OS) records of the PDB file and experimental data (EX).
   Tokens delimiting items of experimental data are as follows. (1)
   METHOD: The text 'nmr_or_model' for structures determined by nuclear
   magnetic resonance or modelling, or 'xray' for X-ray crystallography.
   (2) RESO: The resolution of X-ray structures, or '0' otherwise. (3, 4)
   NMOD and NCHN: The number of models or polypeptide chains: for domain
   coordinate files a 1 is always given. NCHN is the number of chains
   that have at least the user-specified minimum number (5) of known
   amino acids. (5) NGRP: Number of non-covalently associated groups
   ('heterogens') that could not be assigned to a specific chain. Spacing
   lines (XX) are used for improving clarity of the file and the end of
   file (//) is clearly indicated.

  4.2 Chain-specific data

   Following the EX record the file has a section for each chain (with at
   least the user-specified minimum number (5) of known amino acids),
   containing the chain number (CN), chain-specific data (IN) and the
   chain sequence (SQ). Tokens delimiting items of chain-specific data
   are as follow. (1) ID: The PDB chain identifier or a '.' if one was
   not specified in the PDB file or if a domain is comprised of segments
   from more than one chain. (2) NR: The number of residues in the chain
   or domain. (3) NL: The number of heterogens that are associated with
   the chain. Domain coordinate files do not include coordinates for
   these groups so a value of 0 is always given. (4, 5) NH and NE: The
   number of helices and beta-strands in the chain or domain (see Section
   11.2). Values for NH and NE are added by using PDBPLUS and a 0 will be
   given if PDBPLUS is not used.

  4.3 Residue data

   Each RE record contains data for a single residue. The data are in 26
   columns in the RE record (column numbers are given in parentheses):
   (1) RE is always given. (2 - 3) Model and chain number (always 1 for
   domains). (4) Residue number: the position of the residue in the
   sequence given in the SQ record (for protein atoms) or '.' (for
   heterogens and water). (5) Original PDB residue number. (6) SSE type
   from the PDB file: either 'C' (coil), 'H' (helix), 'E' (beta-strand)
   or 'T' (turn). (7) SSE serial number from columns 8 - 10 in a HELIX,
   SHEET or TURN record of a PDB file. A '.' is given for atoms not in a
   helix or sheet. (8) SSE identifier code from columns 12 - 14 in a
   HELIX, SHEET or TURN record, or '.' for atoms not in a helix or sheet.
   (9) The class of helix, which is an integer from 1-10; 1 -
   right-handed alpha, 2 - right-handed omega, 3 - right-handed pi, 4 -
   right-handed gamma, 5 - right-handed 3-10, 6 - left-handed alpha, 7 -
   left-handed omega, 8 - left-handed gamma, 9 - 27 ribbon/helix or 10
   polyproline; see
   http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html.
   (10) Secondary structure assignment according to STRIDE (see Section
   11.2). (11) SSE number: the position of the SSE (see Section 11.2)
   from the N-terminus. A '.' is given if the atom is not in an element.
   (12) Single character amino acid code or a '.' (for heterogens and
   water). (13) 3-character residue identifier code. (14-16) Phi and Psi
   angle and solvent accessible surface area of residue as calculated by
   STRIDE. (17-26) Accessible surface area according to NACCESS. Absolute
   and relative measures of accessibility: (17-18) for all atoms, (19-20)
   all side-chain atoms, (21-22) all main-chain atoms, (23-24) all
   non-polar side-chain atoms, (25-26) all polar side-chain atoms. Values
   for records 10-11 and 17-26 are added by using PDBPLUS and a '.' will
   be given if a value is not available.

  4.4 Atom data

   Each AT record contains data for a single atom. The data are in 14
   columns in the AT record (column numbers are given in parentheses):
   (1) AT is always given. (2 - 3) Model and chain number (always 1 for
   domains). (4) Group number of heterogens or '.'. (5) Either 'P' (a
   protein atom), 'H' (heterogen) or 'W' (water). (6) Residue number: the
   position of the residue in the sequence given in the SQ record (for
   protein atoms) or '.' (for heterogens and water). (7) Single character
   amino acid code or a '.' (for heterogens and water). (8) 3-character
   residue identifier code. (9) Atom type. (10-12) The x, y and z
   orthogonal coordinates. (13) Occupancy. (14) Temperature factor.

  Output files for usage example

  File: pdbparse.log

/ebi/services/idata/pmr/hgmp/test/data/structure/2hhb.ent
ATOMCOL12      1277
//
/ebi/services/idata/pmr/hgmp/test/data/structure/1cs4.ent
SEQRESLENDIF   1 (A)
ATOMCOL12      429
BADINDEX       1 (A)
GAPPEDOK       1 (A)
SECSTART       1 1 ILE 384
SECSTART       1 1 ILE 384
//
/ebi/services/idata/pmr/hgmp/test/data/structure/1ii7.ent
SEQRESLENDIF   1 (A)
ATOMCOL12      390
SECBOTH        1 1 SER 57 GLU 73
SECBOTH        1 1 VAL 78 ILE 81
SECBOTH        1 1 LYS 2 LEU 6
//

  File: 1cs4.ccf

ID   1cs4
XX
DE   MOL_ID: 1; MOLECULE: TYPE V ADENYLATE CYCLASE;
XX
OS   MOL_ID: 1; ORGANISM_SCIENTIFIC: CANIS FAMILIARIS;
XX
EX   METHOD xray; RESO 2.50; NMOD 1; NCHN 1; NGRP 0;
XX
CN   [1]
XX
IN   ID A; NR 52; NL 7; NH 0; NE 0;
XX
SQ   SEQUENCE    52 AA;   5817 MW;  47362A43 CRC32;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
RE   1    1    2    396   D ASP   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    3    397   I ILE   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    4    398   E GLU   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    5    399   G GLY   1    1    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    6    400   F PHE   1    1    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    7    401   T THR   1    1    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    8    402   S SER   1    1    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    9    403   L LEU   1    1    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    10   404   A ALA   1    1    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    11   405   S SER   1    1    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    12   406   Q GLN   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    13   407   C CYS   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    14   408   T THR   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    15   409   A ALA   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    16   410   Q GLN   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    17   411   E GLU   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    18   412   L LEU   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    19   413   V VAL   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    20   414   M MET   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    21   415   T THR   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    22   416   L LEU   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    23   417   N ASN   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    24   418   E GLU   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    25   419   L LEU   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    26   420   F PHE   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    27   421   A ALA   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    28   422   R ARG   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    29   423   F PHE   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    30   424   D ASP   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    31   425   K LYS   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    32   426   L LEU   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    33   427   A ALA   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    34   428   A ALA   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    35   429   E GLU   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    36   430   N ASN   2    2    H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00


  [Part of this file has been deleted for brevity]

AT   1    1    5    .    1002  . FOK   H C9       42.200  -11.309   50.489    1
.00   41.39
AT   1    1    5    .    1002  . FOK   H O6       42.275  -12.455   49.593    1
.00   43.23
AT   1    1    5    .    1002  . FOK   H C10      43.008  -11.601   51.811    1
.00   39.11
AT   1    1    5    .    1002  . FOK   H C11      40.680  -11.078   50.616    1
.00   44.36
AT   1    1    5    .    1002  . FOK   H O7       40.106  -10.945   51.688    1
.00   48.77
AT   1    1    5    .    1002  . FOK   H C12      39.943  -11.046   49.301    1
.00   40.67
AT   1    1    5    .    1002  . FOK   H C13      40.595  -10.085   48.292    1
.00   41.47
AT   1    1    5    .    1002  . FOK   H C14      40.276  -10.620   46.930    1
.00   46.69
AT   1    1    5    .    1002  . FOK   H C15      39.971  -11.751   46.590    1
.00   53.22
AT   1    1    5    .    1002  . FOK   H C16      40.047   -8.685   48.426    1
.00   42.42
AT   1    1    5    .    1002  . FOK   H C17      42.671   -8.737   50.253    1
.00   39.67
AT   1    1    5    .    1002  . FOK   H C18      46.732  -13.026   51.827    1
.00   35.74
AT   1    1    5    .    1002  . FOK   H C19      45.859  -11.483   53.586    1
.00   34.48
AT   1    1    5    .    1002  . FOK   H C20      42.913  -10.426   52.807    1
.00   39.44
AT   1    1    5    .    1002  . FOK   H C21      45.883   -9.553   47.821    1
.00   42.15
AT   1    1    5    .    1002  . FOK   H O5       46.157  -10.520   47.166    1
.00   40.91
AT   1    1    5    .    1002  . FOK   H C22      46.769   -8.315   48.006    1
.00   37.08
AT   1    1    6    .    1003  . MES   H O1       45.676    7.326   49.092    1
.00   77.86
AT   1    1    6    .    1003  . MES   H C2       44.367    6.816   48.900    1
.00   75.17
AT   1    1    6    .    1003  . MES   H C3       44.349    5.317   48.923    1
.00   74.42
AT   1    1    6    .    1003  . MES   H N4       44.832    4.804   50.196    1
.00   72.45
AT   1    1    6    .    1003  . MES   H C5       46.234    5.425   50.473    1
.00   73.23
AT   1    1    6    .    1003  . MES   H C6       46.176    6.914   50.355    1
.00   75.06
AT   1    1    6    .    1003  . MES   H C7       44.806    3.336   50.302    1
.00   73.39
AT   1    1    6    .    1003  . MES   H C8       44.672    2.791   51.713    1
.00   76.85
AT   1    1    6    .    1003  . MES   H S        45.724    1.379   51.967    1
.00   78.26
AT   1    1    6    .    1003  . MES   H O1S      47.062    1.828   51.737    1
.00   79.39
AT   1    1    6    .    1003  . MES   H O2S      45.303    0.380   51.016    1
.00   81.58
AT   1    1    6    .    1003  . MES   H O3S      45.523    0.961   53.326    1
.00   80.59
AT   1    1    6    .    1004  . MES   H O1       59.246   -5.152   27.381    1
.00   99.99
AT   1    1    6    .    1004  . MES   H C2       60.067   -4.021   27.127    1
.00   99.99
AT   1    1    6    .    1004  . MES   H C3       60.447   -3.301   28.378    1
.00   99.78
AT   1    1    6    .    1004  . MES   H N4       61.180   -4.156   29.270    1
.00   96.33
AT   1    1    6    .    1004  . MES   H C5       60.358   -5.461   29.506    1
.00   97.90
AT   1    1    6    .    1004  . MES   H C6       59.965   -6.072   28.203    1
.00   99.68
AT   1    1    6    .    1004  . MES   H C7       61.596   -3.484   30.507    1
.00   93.33
AT   1    1    6    .    1004  . MES   H C8       61.931   -2.010   30.442    1
.00   90.74
AT   1    1    6    .    1004  . MES   H S        60.763   -0.978   31.301    0
.50   90.72
AT   1    1    6    .    1004  . MES   H O1S      59.476   -1.170   30.680    0
.50   91.60
AT   1    1    6    .    1004  . MES   H O2S      61.249    0.383   31.164    0
.50   91.20
AT   1    1    6    .    1004  . MES   H O3S      60.776   -1.430   32.647    0
.50   90.05
AT   1    1    7    .    1005  . POP   H P1       58.812   -7.766   57.091    1
.00   57.40
AT   1    1    7    .    1005  . POP   H O1       60.254   -7.589   56.745    1
.00   54.93
AT   1    1    7    .    1005  . POP   H O2       58.618   -8.839   58.095    1
.00   55.36
AT   1    1    7    .    1005  . POP   H O3       57.949   -8.024   55.908    1
.00   55.10
AT   1    1    7    .    1005  . POP   H O        58.295   -6.370   57.759    1
.00   57.30
AT   1    1    7    .    1005  . POP   H P2       56.998   -5.955   58.661    1
.00   59.66
AT   1    1    7    .    1005  . POP   H O4       57.491   -5.746   60.070    1
.00   54.95
AT   1    1    7    .    1005  . POP   H O5       56.004   -7.075   58.550    1
.00   56.24
AT   1    1    7    .    1005  . POP   H O6       56.427   -4.710   58.044    1
.00   56.50
//

  File: 1ii7.ccf

ID   1ii7
XX
DE   MOL_ID: 1; MOLECULE: MRE11 NUCLEASE;
XX
OS   MOL_ID: 1; ORGANISM_SCIENTIFIC: PYROCOCCUS FURIOSUS;
XX
EX   METHOD xray; RESO 2.20; NMOD 1; NCHN 1; NGRP 0;
XX
CN   [1]
XX
IN   ID A; NR 65; NL 6; NH 0; NE 0;
XX
SQ   SEQUENCE    65 AA;   7396 MW;  0CFB92A3 CRC32;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
RE   1    1    8    8     D ASP   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    9    9     I ILE   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    10   10    H HIS   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    11   11    L LEU   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    12   12    G GLY   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    13   13    Y TYR   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    14   14    E GLU   1    1    H    5    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    15   15    Q GLN   1    1    H    5    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    16   16    F PHE   1    1    H    5    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    17   17    H HIS   1    1    H    5    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    18   18    K LYS   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    19   19    P PRO   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    20   20    Q GLN   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    21   21    R ARG   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    22   22    E GLU   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    23   23    E GLU   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    24   24    E GLU   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    25   25    F PHE   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    26   26    A ALA   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    27   27    E GLU   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    28   28    A ALA   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    29   29    F PHE   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    30   30    K LYS   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    31   31    N ASN   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    32   32    A ALA   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    33   33    L LEU   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    34   34    E GLU   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    35   35    I ILE   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    36   36    A ALA   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    37   37    V VAL   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    38   38    Q GLN   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    39   39    E GLU   2    A    E    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    40   40    N ASN   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    41   41    V VAL   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00


  [Part of this file has been deleted for brevity]

AT   1    1    .    50   50    L LEU   P CD2      12.425   39.035   22.798    1
.00   23.77
AT   1    1    1    .    402   . PO4   H P        34.178   32.996   46.387    1
.00   60.84
AT   1    1    1    .    402   . PO4   H O1       35.146   33.243   45.291    1
.00   57.95
AT   1    1    1    .    402   . PO4   H O2       34.912   32.751   47.670    1
.00   59.15
AT   1    1    1    .    402   . PO4   H O3       33.291   34.184   46.538    1
.00   58.92
AT   1    1    1    .    402   . PO4   H O4       33.352   31.796   46.060    1
.00   61.86
AT   1    1    2    .    403   . MN    H MN        8.130   27.788   21.899    1
.00   36.09
AT   1    1    2    .    404   . MN    H MN        5.801   27.935   24.271    1
.00   39.57
AT   1    1    3    .    405   . MN    H MN       36.023   34.916   44.253    1
.00   39.52
AT   1    1    3    .    406   . MN    H MN       33.658   36.365   46.296    1
.00   33.69
AT   1    1    5    .    501   . SO4   H S        17.175   28.112   32.476    1
.00  100.80
AT   1    1    5    .    501   . SO4   H O1       18.136   28.230   31.357    1
.00  100.18
AT   1    1    5    .    501   . SO4   H O2       17.097   26.692   32.887    1
.00  100.80
AT   1    1    5    .    501   . SO4   H O3       17.633   28.926   33.626    1
.00  100.14
AT   1    1    5    .    501   . SO4   H O4       15.834   28.575   32.045    1
.00  100.56
AT   1    1    5    .    502   . SO4   H S         0.566   29.512   36.007    1
.00   86.73
AT   1    1    5    .    502   . SO4   H O1        1.690   28.556   35.971    1
.00   87.27
AT   1    1    5    .    502   . SO4   H O2       -0.620   28.803   36.523    1
.00   87.87
AT   1    1    5    .    502   . SO4   H O3        0.896   30.642   36.905    1
.00   86.58
AT   1    1    5    .    502   . SO4   H O4        0.287   30.037   34.658    1
.00   86.51
AT   1    1    5    .    503   . SO4   H S       -13.586   39.644   36.031    1
.00  100.28
AT   1    1    5    .    503   . SO4   H O1      -12.340   39.512   35.250    1
.00  100.72
AT   1    1    5    .    503   . SO4   H O2      -14.638   38.811   35.421    1
.00  100.46
AT   1    1    5    .    503   . SO4   H O3      -13.347   39.201   37.420    1
.00   99.66
AT   1    1    5    .    503   . SO4   H O4      -14.020   41.056   36.015    1
.00   99.97
AT   1    1    6    .    401   . 101   H P         7.599   25.305   23.994    1
.00   56.33
AT   1    1    6    .    401   . 101   H O1P       8.249   24.467   25.030    1
.00   56.70
AT   1    1    6    .    401   . 101   H O2P       6.700   26.285   24.649    1
.00   54.49
AT   1    1    6    .    401   . 101   H O3P       8.637   26.026   23.216    1
.00   53.97
AT   1    1    6    .    401   . 101   H O5*       7.095   23.970   23.128    1
.00   59.20
AT   1    1    6    .    401   . 101   H C5*       7.073   23.961   21.762    1
.00   66.74
AT   1    1    6    .    401   . 101   H C4*       6.041   23.013   21.296    1
.00   71.22
AT   1    1    6    .    401   . 101   H O4*       6.029   21.855   22.189    1
.00   73.78
AT   1    1    6    .    401   . 101   H C3*       4.736   23.676   21.350    1
.00   73.80
AT   1    1    6    .    401   . 101   H O3*       4.355   23.874   19.995    1
.00   76.51
AT   1    1    6    .    401   . 101   H C2*       3.864   22.749   22.165    1
.00   74.04
AT   1    1    6    .    401   . 101   H C1*       4.682   21.474   22.506    1
.00   74.70
AT   1    1    6    .    401   . 101   H N9        4.578   21.123   23.969    1
.00   76.71
AT   1    1    6    .    401   . 101   H C8        3.630   21.533   24.876    1
.00   76.87
AT   1    1    6    .    401   . 101   H N7        3.758   21.069   26.081    1
.00   77.50
AT   1    1    6    .    401   . 101   H C5        4.896   20.300   25.989    1
.00   77.78
AT   1    1    6    .    401   . 101   H C6        5.570   19.479   26.941    1
.00   78.16
AT   1    1    6    .    401   . 101   H N6        5.155   19.409   28.200    1
.00   78.77
AT   1    1    6    .    401   . 101   H N1        6.682   18.805   26.554    1
.00   78.32
AT   1    1    6    .    401   . 101   H C2        7.090   18.888   25.277    1
.00   78.14
AT   1    1    6    .    401   . 101   H N3        6.541   19.611   24.271    1
.00   78.05
AT   1    1    6    .    401   . 101   H C4        5.403   20.288   24.700    1
.00   78.10
AT   1    .    .    .    407   . HOH   W O         5.997   27.242   22.189    1
.00   38.84
AT   1    .    .    .    408   . HOH   W O        35.697   35.756   46.350    1
.00   41.39
AT   1    .    .    .    600   . HOH   W O        20.825   31.690   27.031    1
.00   20.90
//

  File: 2hhb.ccf

ID   2hhb
XX
DE   HEMOGLOBIN (DEOXY)
XX
OS   HUMAN (HOMO SAPIENS)
XX
EX   METHOD xray; RESO 1.74; NMOD 1; NCHN 4; NGRP 0;
XX
CN   [1]
XX
IN   ID A; NR 141; NL 1; NH 0; NE 0;
XX
SQ   SEQUENCE   141 AA;  15127 MW;  5EC7DB1E CRC32;
     VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH GSAQVKGHGK
     KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL LSHCLLVTLA AHLPAEFTPA
     VHASLDKFLA SVSTVLTSKY R
XX
CN   [2]
XX
IN   ID B; NR 146; NL 1; NH 0; NE 0;
XX
SQ   SEQUENCE   146 AA;  15868 MW;  EC9744C9 CRC32;
     VHLTPEEKSA VTALWGKVNV DEVGGEALGR LLVVYPWTQR FFESFGDLST PDAVMGNPKV
     KAHGKKVLGA FSDGLAHLDN LKGTFATLSE LHCDKLHVDP ENFRLLGNVL VCVLAHHFGK
     EFTPPVQAAY QKVVAGVANA LAHKYH
XX
CN   [3]
XX
IN   ID C; NR 141; NL 1; NH 0; NE 0;
XX
SQ   SEQUENCE   141 AA;  15127 MW;  5EC7DB1E CRC32;
     VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH GSAQVKGHGK
     KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL LSHCLLVTLA AHLPAEFTPA
     VHASLDKFLA SVSTVLTSKY R
XX
CN   [4]
XX
IN   ID D; NR 146; NL 2; NH 0; NE 0;
XX
SQ   SEQUENCE   146 AA;  15868 MW;  EC9744C9 CRC32;
     VHLTPEEKSA VTALWGKVNV DEVGGEALGR LLVVYPWTQR FFESFGDLST PDAVMGNPKV
     KAHGKKVLGA FSDGLAHLDN LKGTFATLSE LHCDKLHVDP ENFRLLGNVL VCVLAHHFGK
     EFTPPVQAAY QKVVAGVANA LAHKYH
XX
RE   1    1    1    1     V VAL   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    2    2     L LEU   .    .    .    .    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    3    3     S SER   1    AA   H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    4    4     P PRO   1    AA   H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    5    5     A ALA   1    AA   H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00
RE   1    1    6    6     D ASP   1    AA   H    1    .    .        0.00    0.0
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.
00    0.00


  [Part of this file has been deleted for brevity]

AT   1    .    .    .    174   . HOH   W O        -4.764   -6.228    5.515    8
.00   40.89
AT   1    .    .    .    175   . HOH   W O        23.809   19.925    1.758    8
.00   39.37
AT   1    .    .    .    176   . HOH   W O        -7.871   -9.078    2.406    8
.00   43.37
AT   1    .    .    .    177   . HOH   W O         4.693   12.083    7.558    8
.00   40.24
AT   1    .    .    .    178   . HOH   W O         8.775  -23.438   16.055    8
.00   42.33
AT   1    .    .    .    179   . HOH   W O        -7.480  -10.898   17.998    8
.00   38.06
AT   1    .    .    .    180   . HOH   W O        -4.731   16.453    2.295    8
.00   36.37
AT   1    .    .    .    181   . HOH   W O        -1.055   11.866   -0.448    8
.00   43.19
AT   1    .    .    .    182   . HOH   W O       -27.610  -10.991    5.353    8
.00   43.46
AT   1    .    .    .    183   . HOH   W O        26.015   11.766    5.159    8
.00   40.95
AT   1    .    .    .    184   . HOH   W O       -18.517   -8.355   15.267    8
.00   35.55
AT   1    .    .    .    185   . HOH   W O       -14.034    2.806  -30.367    8
.00   41.77
AT   1    .    .    .    186   . HOH   W O       -32.905   -9.033    0.480    8
.00   43.68
AT   1    .    .    .    187   . HOH   W O       -28.749  -13.315    1.938    8
.00   45.36
AT   1    .    .    .    188   . HOH   W O         0.516   -8.074  -26.354    8
.00   41.53
AT   1    .    .    .    189   . HOH   W O       -20.080   -9.873  -22.862    8
.00   36.25
AT   1    .    .    .    190   . HOH   W O       -13.442    9.778  -13.572    8
.00   39.70
AT   1    .    .    .    191   . HOH   W O       -24.804   -2.608  -15.488    8
.00   37.79
AT   1    .    .    .    192   . HOH   W O         6.547    9.706   16.296    8
.00   41.86
AT   1    .    .    .    193   . HOH   W O         0.029   22.606   14.164    8
.00   43.02
AT   1    .    .    .    194   . HOH   W O       -11.367    0.306   28.463    8
.00   44.30
AT   1    .    .    .    195   . HOH   W O       -19.950  -10.635   14.301    8
.00   40.17
AT   1    .    .    .    196   . HOH   W O        -7.047   -6.324   20.098    8
.00   36.98
AT   1    .    .    .    197   . HOH   W O       -23.876    1.108   14.102    8
.00   33.31
AT   1    .    .    .    198   . HOH   W O       -34.199    8.033   11.037    8
.00   40.72
AT   1    .    .    .    199   . HOH   W O       -14.173   13.393   -8.778    8
.00   43.21
AT   1    .    .    .    200   . HOH   W O        11.388  -11.044   24.763    8
.00   39.34
AT   1    .    .    .    201   . HOH   W O         3.735   -3.643    2.734    8
.00   42.17
AT   1    .    .    .    202   . HOH   W O         3.149   -0.692    2.083    8
.00   41.40
AT   1    .    .    .    203   . HOH   W O         4.511  -25.886   13.006    8
.00   39.83
AT   1    .    .    .    204   . HOH   W O         8.712  -21.655    3.577    8
.00   43.08
AT   1    .    .    .    205   . HOH   W O        22.926   -4.304   24.079    8
.00   38.10
AT   1    .    .    .    206   . HOH   W O        11.435    9.654   20.618    8
.00   40.23
AT   1    .    .    .    207   . HOH   W O        18.099    5.542   27.744    8
.00   39.03
AT   1    .    .    .    208   . HOH   W O        12.174    9.951    9.804    8
.00   44.34
AT   1    .    .    .    209   . HOH   W O        24.745   -2.501   15.270    8
.00   39.78
AT   1    .    .    .    210   . HOH   W O        24.231    0.100   14.764    8
.00   42.94
AT   1    .    .    .    211   . HOH   W O        23.324  -18.136   10.981    8
.00   53.60
AT   1    .    .    .    212   . HOH   W O        25.576  -22.211    6.309    8
.00   45.18
AT   1    .    .    .    213   . HOH   W O        14.639   24.823   -4.300    8
.00   41.35
AT   1    .    .    .    214   . HOH   W O        14.903    5.393  -23.047    8
.00   37.45
AT   1    .    .    .    215   . HOH   W O        16.650   -5.137  -16.717    8
.00   39.12
AT   1    .    .    .    216   . HOH   W O         7.424   -6.700  -20.085    8
.00   38.62
AT   1    .    .    .    217   . HOH   W O        -1.263   -2.837  -21.251    8
.00   45.10
AT   1    .    .    .    218   . HOH   W O        23.120   -3.118  -12.992    8
.00   37.05
AT   1    .    .    .    219   . HOH   W O        23.664    0.968  -14.389    8
.00   36.25
AT   1    .    .    .    220   . HOH   W O        25.698    7.981  -15.362    8
.00   35.85
AT   1    .    .    .    221   . HOH   W O        30.009   16.347   -6.794    8
.00   37.62
AT   1    .    .    .    222   . HOH   W O        27.728   16.677   -1.376    8
.00   42.54
AT   1    .    .    .    223   . HOH   W O         8.142   18.836    1.041    8
.00   39.90
//

5.0 DATA FILES

   PDBPARSE does not use a data file.

6.0 USAGE

   Standard (Mandatory) qualifiers:
  [-pdbpath]           dirlist    This option specifies the location of PDB
                                  files (input). A PDB file contains protein
                                  coordinate and other data. A detailed
                                  explanation of the PDB file format is
                                  available on the PDB web site
                                  http://www.rcsb.org/pdb/info.html.
   -camask             boolean    This option specifies whether to to mask
                                  non-amino acid groups in protein chains that
                                  do not contain a C-alpha atom. If masked,
                                  the group will not appear in either the CO
                                  or SQ records of the clean coordinate file.
   -camaska            boolean    This option specifies whether to mask amino
                                  acids in protein chains that do not contain
                                  a C-alpha atom. If masked, the amino acid
                                  will not appear in the CO record but will
                                  still be present in the SQ record of the
                                  clean coordinate file.
   -atommask           boolean    This option specifies whether to mask amino
                                  acid residues in protein chains with a
                                  single atom only. If masked, the amino acid
                                  will appear not appear in the CO record but
                                  will still be present in the SQ record of
                                  the clean coordinate file.
  [-ccfoutdir]         outdir     This option specifies the location of CCF
                                  files (clean coordinate files) (output). A
                                  'protein clean cordinate file' contains
                                  protein coordinate and other data for a
                                  single PDB file. The files, generated by
                                  using PDBPARSE, are in CCF format
                                  (EMBL-like) and contain 'cleaned-up' data
                                  that is self-consistent and error-corrected.
                                  Records for residue solvent accessibility
                                  and secondary structure are added to the
                                  file by using PDBPLUS.
   -logfile            outfile    This option specifies tame of the log file
                                  for the build. The log file may contain
                                  messages about inconsistencies or errors in
                                  the PDB files that were parsed.

   Additional (Optional) qualifiers:
   -[no]ccfnaming      boolean    This option specifies whether to use pdbid
                                  code to name the output files. If set, the
                                  PDB identifier code (from the PDB file) is
                                  used to name the file. Otherwise, the output
                                  files have the same names as the input
                                  files.
   -chnsiz             integer    Minimum number of amino acid residues in a
                                  chain for it to be parsed.
   -maxmis             integer    Maximum number of permissible mismatches
                                  between the ATOM and SEQRES sequences.
   -maxtrim            integer    Max. no. residues to trim when checking for
                                  missing C-terminal SEQRES sequences.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers Allowed values Default
   [-pdbpath]
   (Parameter 1) This option specifies the location of PDB files (input).
   A PDB file contains protein coordinate and other data. A detailed
   explanation of the PDB file format is available on the PDB web site
   http://www.rcsb.org/pdb/info.html. Directory with files ./
   -camask This option specifies whether to to mask non-amino acid groups
   in protein chains that do not contain a C-alpha atom. If masked, the
   group will not appear in either the CO or SQ records of the clean
   coordinate file. Boolean value Yes/No No
   -camaska This option specifies whether to mask amino acids in protein
   chains that do not contain a C-alpha atom. If masked, the amino acid
   will not appear in the CO record but will still be present in the SQ
   record of the clean coordinate file. Boolean value Yes/No No
   -atommask This option specifies whether to mask amino acid residues in
   protein chains with a single atom only. If masked, the amino acid will
   appear not appear in the CO record but will still be present in the SQ
   record of the clean coordinate file. Boolean value Yes/No No
   [-ccfoutdir]
   (Parameter 2) This option specifies the location of CCF files (clean
   coordinate files) (output). A 'protein clean cordinate file' contains
   protein coordinate and other data for a single PDB file. The files,
   generated by using PDBPARSE, are in CCF format (EMBL-like) and contain
   'cleaned-up' data that is self-consistent and error-corrected. Records
   for residue solvent accessibility and secondary structure are added to
   the file by using PDBPLUS. Output directory ./
   -logfile This option specifies tame of the log file for the build. The
   log file may contain messages about inconsistencies or errors in the
   PDB files that were parsed. Output file pdbparse.log
   Additional (Optional) qualifiers Allowed values Default
   -[no]ccfnaming This option specifies whether to use pdbid code to name
   the output files. If set, the PDB identifier code (from the PDB file)
   is used to name the file. Otherwise, the output files have the same
   names as the input files. Boolean value Yes/No Yes
   -chnsiz Minimum number of amino acid residues in a chain for it to be
   parsed. Any integer value 5
   -maxmis Maximum number of permissible mismatches between the ATOM and
   SEQRES sequences. Any integer value 3
   -maxtrim Max. no. residues to trim when checking for missing
   C-terminal SEQRES sequences. Any integer value 10
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of PDBPARSE is shown below. Here is a
   sample session with pdbparse


% pdbparse 
Parses PDB files and writes protein CCF files.
Location of PDB files (input) [./]: structure
Mask non-amino acid groups in protein chains that do not contain a C-alpha atom
. [N]: 
Mask amino acids in protein chains that do not contain a C-alpha atom. [N]: Y
Mask amino acid residues in protein chains with a single atom only. [N]: 
Location of CCF files (clean coordinate files) (output) [./]: 
Name of log file for the build. [pdbparse.log]: 

Processing /ebi/services/idata/pmr/hgmp/test/data/structure/2hhb.ent
Processing /ebi/services/idata/pmr/hgmp/test/data/structure/1cs4.ent
Processing /ebi/services/idata/pmr/hgmp/test/data/structure/1ii7.ent

   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   Although our parsing methodology was validated by the manual
   comparison of many "clean" files to their respective PDB files, it was
   clearly not possible to check every file. Any errors should be
   reported to the authors.
   PDBPARSE is not guaranteed to work correctly (or even at all) for
   files where an NMR structure contains multiple models but the models
   have different sequence of residues due to errors.
   PDBPARSE will not work in cases where a residue number is duplicated
   AND an alternative residue numbering system is used somewhere else in
   the same chain. If such cases exist they could be parsed by having a
   variable corresponding to oddnum (see pdbparse.c), but just for
   duplicate residue positions. The new variable would get written in the
   same place as oddnum is written.
   The author does not know whether either of the above cases occur in
   pdb.
   PDBPARSE necessarily must hold the entire PDB file and some derived
   data in memory. If an error of the type 'Uncaught exception:
   Allocation failed, insufficient memory available' is raised then this
   is probably because the memory requirements exceed per-user memory
   defaults (that are usually set quite low). This can easily be
   unlimited in the login process. If tcsh is used, then simply type
   'unlimit' before PDBPARSE is run.

8.0 NOTES

   Clean coordinate files are available from
   ftp://ftp.uk.embnet.org/pub/databases/structure/cleancoord.
   A list of problematic features in individual PDB files is available at
   ftp://ftp.uk.embnet.org/pub/databases/structure/cleancoord/pdbparse.lo
   g.
   Values in the CCF file for the number of helices (NH) or beta-strands
   (NE) in a chain and columns 12-13 and 22-34 of the coordinate line
   record (CO) are given null values ('.' or 0) by PDBPARSE (see Section
   4.0). The EMBOSS program PDBPLUS can be used to assign values to these
   these records.

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   PDB file PDB format. Protein coordinate data in PDB format. N.A. N.A.
   Clean coordinate file (for protein) CCF format (EMBL-like format for
   protein coordinate and derived data). Coordinate and other data for a
   single PDB file. The data are 'cleaned-up': self-consistent and
   error-corrected. PDBPARSE Records for residue solvent accessibility
   and secondary structure are added to the file by using PDBPLUS.

   None

9.0 DESCRIPTION

   The parsing of protein coordinate data from the Protein Data Bank
   (PDB) is a common task but is difficult in practice because of an
   awkward file format, errors in individual PDB files and
   inconsistencies, particularly in residue numbering. The PDB format is
   inconvenient for domain-based work or approaches using derived data
   because PDB files are not annotated with domain definitions and are
   not easily extended. We required a source of coordinate data that
   provided fast and convenient access, used an easily parsed,
   self-consistent format, was specific to proteins, contained minimal
   bibliographic data, was easily extendible and which could incorporate
   information on known structural domains as described in SCOP.
   We wrote PDBPARSE to parse PDB files reliably and generate 'clean'
   files of coordinate and derived data, for whole PDB files and, by
   using the EMBOSS applications SCOPPARSE and DOMAINER, individual
   structural domains from SCOP. These files fulfil the requirements
   above and in addition, by using the EMBOSS applications PDBPLUS, can
   include derived data such as residue solvent accessibility and
   secondary structure. The files correct several inconsistencies in PDB
   and employ a consistent residue numbering scheme whilst preserving the
   numbering from the original PDB file. PDBPARSE identifies over 40
   different types of inconsistency, formatting error or other feature of
   a PDB file that warrant the manual verification of its contents. The
   Protein Data Bank [1, 2] was established some 30 years ago at the
   Brookhaven National Laboratories as a repository for protein X-ray
   crystallographic data. The original design used an ASCII file format
   based on punched cards. Today, PDB uses a relational database
   management system and is managed by the Research Collaboratory for
   Structural Bioinformatics (RCSB) in collaboration with the European
   Macromolecular Structure Database (EMSD). Query tools such as the
   web-based PDBbrowse [3], SearchFields
   (http://www.rcsb.org/pdb/queryForm.cgi), MSDLite
   (http://www.ebi.ac.uk/msd-srv/msdlite/index.html) and MSDPro
   (http://www.ebi.ac.uk/msd-srv/msdpro/index.html) are useful for the
   analysis of a single or a few protein structures, but are an
   inconvenient source of coordinate data and are inappropriate for
   global, automated analyses. A researcher commonly needs direct access
   to the coordinate data, however the text files provided are
   notoriously difficult to parse reliably. The problems arise from
   errors in individual PDB files and an awkward and inconsistent file
   format, which has evolved in a seemingly ad hoc manner to cope with
   increasing amounts of bibliographic and macromolecular coordinate data
   from a variety of experimental techniques. Difficulties in parsing can
   only compound problems arising from anomalies with the coordinate data
   themselves: over a million "outliers" in PDB have been identified [4],
   reflecting discrepancies with conventions, statistical outliers and
   probable errors.
   A particularly difficult aspect of parsing is determining the sequence
   of residues and ensuring that the atomic coordinates are assigned to
   the correct sequence position in the relevant data structure. The
   biological amino acid sequence (given in the SEQRES records of a PDB
   file) frequently differs from the sequence of residues (given in the
   ATOM records) for which coordinates are available. PDB does not
   consistently use a sequential residue numbering scheme and residue
   numbers must be treated as strings. Although a mapping between the
   ATOM and SEQRES records can be obtained automatically by using the
   pdb2cif program [5], these contain errors owing to mistakes and
   inconsistencies in PDB. The authors of the ASTRAL compendium [6]
   identified some of the types of error and provide manual corrections
   to the pdb2cif mappings in their Rapid Access Format (RAF) database
   [7].
   Although extensive validation is now performed on deposited data,
   including comparisons of PDB SEQRES and ATOM records, there is a
   legacy of PDB files that predate these quality control measures.
   Extensive efforts are being made by the RCSB and the EMSD to clean up
   the legacy files. For example, database constrains are used by MSD to
   maintain data integrity in their archive database so that
   inconsistencies (primarily in bibliographic, chemical and coordinate
   data) do not appear in the search database. Further difficulties in
   processing PDB data arise in cases unrelated to file formatting, for
   example where multiple sets of coordinates are given for an individual
   atom or whole residue, where coordinates for only a single atom of a
   residue are given, where C-alpha atoms are missing or where
   coordinates for non-amino acid groups are given in polypeptide chains.
   Further, if a method uses protein domains such as those described in
   the SCOP database [8], coordinates for the domain have to be extracted
   from the PDB file. Fortunately, the SCOP domain definitions use the
   original PDB residue numbers taken from the ATOM records. Nonetheless,
   this presents an extra layer of complexity in parsing the data. For
   example, a SCOP domain may span more than one PDB chain or be composed
   of fragments from the same or different chains.
   We required a source of coordinate data that provided fast and
   convenient access, used an easily parsed, self-consistent format, was
   specific to proteins, contained minimal bibliographic data, correctly
   employed a consistent residue numbering scheme whilst preserving the
   original numbering, was easily extendible and which incorporated
   information on known structural domains as described in SCOP. We have
   written software to parse PDB files reliably and generate "cleaned up"
   flat text files of protein coordinate and derived data, for whole PDB
   files and individual SCOP domains. These files fulfil the requirements
   above and in addition include derived data such as residue solvent
   accessibility and secondary structure. Flexible masking of coordinate
   data for problematic residues, for example those lacking coordinates
   for a C-alpha atom, is also provided by the software. our software,
   the parsing methodology, the content and format of our CCF files. This
   work complements other groups who have worked towards handling the PDB
   data, for example [9].
   The parsing of PDB files is highly inconvenient, time-consuming and
   potentially a major source of error in important fields such as
   molecular modelling, drug design, protein docking experiments, protein
   folding studies, protein structure comparison and threading. In fact,
   problems manifest in any method that uses PDB data and relies on an
   acurate mapping of the biological sequence to the available
   coordinates. Only a single error, for example in assigning coordinate
   or derived data to the correct position in a data structure, may be
   required to poison an entire analysis. Further, different
   interpretations of the PDB files by different groups might lead to
   inconsistencies between these analyses. PDBPARSE is used to generate
   files of coordinate data for protein chains and domains in an easily
   parsed, self-consistent format in which many inconsistencies and
   problematic features of the original PDB files, particularly in
   residue numbering, have been corrected. The original residue numbers
   are preserved, however, so that it's unnecessary to learn a new set of
   residue numbering conventions and comparisons to the original files
   and to the approaches of other groups is easy. The files can be
   annotated with useful derived data, for example, by using the PDBPLUS
   application. This further increases their usefulness.
   Options to mask data for problematic residues and the capacity to
   generate derived data from the coordinates for whole proteins or
   individual domains add further flexibility to our software. The fast
   and reliable parsing of CCF files is a trivial matter: appropriate
   software is available in EMBOSS as part of the AJAX C programming
   library. We hope CCF files will be useful, for example, in the
   construction of secondary databases.

10.0 ALGORITHM

   Some of the tasks and difficulties involved in parsing a PDB file are
   summarized below. The numbered tasks refer to the 'Methodology for
   parsing PDB files' described in the text.

  10.1 Summary of difficulties in parsing a PDB file

  10.1.1 Count models.

   The number of models is normally equal to the number of MODEL records,
   but NMR structures with only a single model sometimes lack a MODEL
   record.

  10.1.2. Count chains and assign chain identifier, sequence and length
  (residues).

   SEQRES records do not consistently include heterogens, or non-amino
   acid groups in polypeptide chains. The type of molecule (protein,
   nucleic acid or polysaccharide) is not clearly indicated. The
   indicated and actual number of SEQRES residues can differ. Rarely,
   chains are given in the SEQRES records but are missing from the ATOM
   records.

  10.1.3. Parse coordinates for individual chains.

   Some files do not contain any TER records or have multiple chains that
   are not delimited by TER records. Multiple TER records are given for a
   single chain where the coordinates are for fragments of a protein
   digest. Occasionally, the TER record does not delimit the protein and
   heterogen atoms, but is given after the final heterogen atom instead.
   The order of chains in the SEQRES records may not agree with that in
   the ATOM records. Errors may occur in the use of chain identifiers,
   especially for N and C-terminal residues.

  10.1.4. Parse coordinates for non-covalently associated groups (heterogens)
  and assign them to chains.

   Chain identifiers are not consistently used and might differ from that
   of the chain to which the group is associated. Occasionally all the
   heterogen atoms are listed together after the last chain in the
   structure rather being than associated with individual chains.

  10.1.5. Identify heterogeneous residue positions.

   Residue numbering for heterogeneous positions is not handled
   consistently. For example, both insertion codes and non-sequential
   numbers are used.

  10.1.6. Process non-sequential and character-based numbering systems

   Coordinates might be given for a fragment and residue numbering is
   relative to the full length protein. Residue numbering might be
   relative to a sequence or topological alignment. For example,
   insertion codes (characters) are used in cases where numbering is
   given relative to a reference protein and the homologue possesses
   certain residues that the reference protein lacks. Insertion codes
   might also be used to indicate insertion mutations.

  10.1.7. Process jumps in residue numbering.

   Jumps in residue numbering may arise systematically, for example where
   parts of the structure could not be refined or where residue numbering
   is given relative to a reference protein and the homologue lacks
   certain residue(s) in the reference protein. Other jumps are the
   result of errors.

  10.1.8. Process residue numbering at the N-terminus.

   N-terminal MET residues and blocking groups are often numbered zero
   (rather than 1) but this also occurs for other N-terminal residues. In
   some files, for reasons of alignment, the N-terminal residue is
   assigned a negative number and the residue C-terminal to residue -1
   can be numbered +1 or 0. Sometimes the indicated starting residue
   number is either higher or lower than suggested by the SEQRES records.

  10.2 Parsing methodology

   The following text is numbered to correspond to the text above. The
   PDB file is read into memory and the number of models (sets of
   coordinates) is determined (1). The SEQRES records are parsed to
   determine the number of unique polypeptide chains and the chain
   identifier, amino acid sequence ('SEQRES sequence') and length
   (residues) of each chain (2). Coordinates for individual chains and
   heterogens are parsed for every model. Coordinates for a chain are
   normally indicated by the presence of an ATOM or HETATM record before
   a TER record and containing the relevant chain identifier, with the
   coordinates for heterogens appearing after the TER record. There are,
   however, many inconsistencies (3 and 4) that the parser manages. Each
   heterogen is assigned to a chain where possible and given a group
   number relative to either the chain with which is associated or the
   whole protein. Thus a heterogen is uniquely identified by its chain
   number (if available) and group number. Where multiple coordinates are
   given for a single atom or residue the first set of coordinates are
   used and the others discarded. Such cases are distinguished from
   residue heterogeneity (5), which may arise naturally or if a residue
   has been partly chemically modified. There are many difficulties in
   assigning the correct residue sequence and numbering for a chain.
   Non-sequential and character-based numbering systems are used (6),
   producing many examples where the residue numbering in the ATOM
   records does not agree at all with the SEQRES records. Jumps in
   residue numbering occur (7), and incorrect residue identifiers result
   in mismatches between the ATOM and SEQRES records. N-terminal MET
   residues and blocking groups listed in the ATOM records are frequently
   missing from the SEQRES records and there are serious inconsistencies
   in residue numbering at the N-terminus (8). Other N and C-terminal
   residues are occasionally omitted from the SEQRES records. The correct
   residue sequence and numbering for a chain is determined by an
   alignment of the 'SEQRES sequence' and an 'ATOM sequence' that is
   extracted from the ATOM records of the PDB file. The alignment
   procedure is summarised in Figure 3 and described in 6 steps below.

  10.2.1. (Step 1) Mark up ambiguous positions

   - The character position used in PDB files to indicate heterogeneity
   is also used in the character-based numbering schemes. Such ambiguous
   residue positions are recorded.

  10.2.2. (Step 2) Check residue numbering (presuming a character-based
  numbering system)

   - Each residue in the ATOM sequence is assigned a positive and
   incremental residue number based on the original PDB residue number;
   non-sequential or character-based numbering schemes are replaced.
   However, any jumps in the residue numbering are preserved. A copy of
   the original PDB residue numbering is also preserved. The SEQRES
   sequence is corrected for any missing N- or C-terminal groups. At this
   stage, ATOM records are presumed to use a character-based numbering
   scheme rather than contain heterogeneity and no errors (residue
   mismatches between the ATOM and SEQRES sequences) are allowed. If the
   ATOM sequence is a sub-string of the SEQRES sequence or the residue
   numbering agrees with the SEQRES sequence, the SEQRES sequence is
   taken as the biological sequence, residue numbers for the chain are
   assigned and no further steps are necessary.

  10.2.3. (Step 3) Align ATOM and SEQRES sequences (presuming a character-based
  numbering system)

   - An alignment of the ATOM and SEQRES sequences is sought by
   identifying exact string matches between sub-strings of the ATOM
   sequence and the full-length SEQRES sequence. Consider the alignment
   of an ATOM sequence (A) of 100 residues (A1, A2 A100) to a SEQRES
   sequence of 120 residues (S1, S2 S120). First, the parser checks for
   an exact match of A to S, and if none is found sub-strings of A of
   progressively smaller size are tested; A1-A99 first, then A1-A98 and
   so on until an exact match is found or an exact match for a single
   residue only (A1) cannot be found. A sub-string of A can be matched to
   any region in the SEQRES sequence, but exact matches are discarded if
   they would not leave sufficient space in the SEQRES sequence
   (C-terminal to the matched region) for the alignment of the remainder
   of the ATOM sequence.
   Imagine an example where A1-A50 matches exactly S11-S60, but A51 does
   not match S61. To continue the alignment, the parser searches for an
   exact match for the remainder of the ATOM sequence, A51-A100, then
   A51-A99, A51-58 as before. The sub-string can be matched to any region
   in the SEQRES sequence beginning from position 61 onwards, so long as
   it would leave space for the remainder of the ATOM sequence as before.
   Gaps between successive matching regions are allowed and in this
   manner SEQRES residues missing from the ATOM records are detected and
   residue numbers assigned. If, for example, A51-A100 matched exactly
   S62 S111, then the ATOM sequence is missing a single residue (SEQRES
   residue 61) relative to the SEQRES sequence.

  10.2.4. (Step 4) Check alignment (presuming heterogeneity)

   - If, after string alignment, an exact match is not found for all
   positions in the ATOM sequence, steps 2. and 3. are repeated but
   heterogeneity is presumed rather than an alternative numbering scheme;
   redundant sets of coordinates for heterogeneous positions are masked
   from the ATOM sequence.

  10.2.5. (Step 5) Check alignment (allowing mismatches)

   - If an exact match still cannot be found for all positions in the
   ATOM sequence, steps 2 - 4. are repeated, but this time a user-defined
   number (typically 3) of mismatches between the SEQRES and ATOM
   positions are allowed. The "true" residue (i.e. the one that will be
   given in the protein sequence in the output file) is that from the
   ATOM records. For example, in cases where the lengths of the two
   sequences are the same but the sequences differ, the sequence from the
   ATOM records is taken to be the true sequence.

  10.2.6. (Step 6) Default assignment

   - If, after step 5., the ATOM and SEQRES sequences could not be
   matched (allowing up to the maximum number of mismatches), the raw
   ATOM sequence is taken to be the true sequence in which case the
   assignment of residue numbers is trivial. Thus the alignment procedure
   uses the following priority when finding the correct alignment of the
   ATOM and SEQRES sequences: (1) No gap insertion or mismatches
   required. (2) Gap insertion with no mismatches. (3) Mismatches but no
   gaps. (4) Gap insertion and mismatches. (5) Default of raw ATOM
   sequence is used and SEQRES sequence is discarded.
   Figure 3 Schematic diagram of alignment procedure 

   -

11.0 RELATED APPLICATIONS

See also

    Program name                        Description
   aaindexextract Extract data from AAINDEX
   allversusall   Sequence similarity data from all-versus-all comparison
   cathparse      Generates DCF file from raw CATH files
   cutgextract    Extract data from CUTG
   domainer       Generates domain CCF files from protein CCF files
   domainnr       Removes redundant domains from a DCF file
   domainseqs     Adds sequence records to a DCF file
   domainsse      Add secondary structure records to a DCF file
   hetparse       Converts heterogen group dictionary to EMBL-like format
   pdbplus        Add accessibility & secondary structure to a CCF file
   pdbtosp        Convert swissprot:PDB codes file to EMBL-like format
   printsextract  Extract data from PRINTS
   prosextract    Build the PROSITE motif database for use by patmatmotifs
   rebaseextract  Extract data from REBASE
   scopparse      Generate DCF file from raw SCOP files
   seqnr          Removes redundancy from DHF files
   sites          Generate residue-ligand CON files from CCF files
   ssematch       Search a DCF file for secondary structure matches
   tfextract      Extract data from TRANSFAC

  11.1 Domain coordinate data

   We wrote the DOMAINER application to read protein CCF files and
   generate files of coordinates for single SCOP domains in the CCF
   format (domain CCF files, Figure 1) and the PDB format. DOMAINER reads
   a file of SCOP classification data that is prepared by using the
   application SCOPPARSE, and generates an output file in each format for
   each domain listed. Where coordinates for multiple models were
   determined, data in the output files are given for the first model
   only. In cases where a domain consists of sections from more than one
   polypeptide chain, the data are presented as belonging to a single
   chain only (a single sequence with a chain identifier of is given).
   SCOPPARSE was written to parse the raw SCOP classification files
   (dir.cla.scop.txt and dir.des.scop.txt) available at URL
   (http://scop.mrc-lmb.cam.ac.uk/scop/parse/) and generate an output
   file suitable for use with DOMAINER.

  11.2 Derived data

   We wrote the PDBPLUS application, as a wrapper to the STRIDE [10] and
   NACCESS [11] programs, to add the derived data (i - iv below) to a CCF
   file. PDBPLUS runs on both protein and domain CCF files and requires
   corresponding files in PDB format, either the original PDB file (for
   proteins) or one generated by using DOMAINER (for domains). The values
   of the derived data for a given residue will of course depend on
   whether they were calculated from the coordinates for the entire
   structure or just the domain. Thus PDBPLUS provides useful flexibility
   in generating the derived data.
   i. Absolute (Abs.) and relative (Rel.) accessible surface area of
   residues according to NACCESS. Abs. is the summed accessible surface
   area of the atoms, whereas Rel. is expressed as a percentage relative
   to the accessibility of the atoms in an extended ALA-x-ALA tri-peptide
   conformation. The NACCESS authors treat alpha carbons as side-chain
   atoms so that glycine can have a value for side-chain accessible
   surface area (see 'Database format' below). They are therefore not
   included in the main-chain.
   ii. Phi and Psi angle and solvent accessible surface area of residues
   as calculated by using STRIDE.
   iii. Secondary structure assignment according to STRIDE, one of 'H'
   (alpha helix), 'G' (3-10 helix), 'I' (Pi-helix), 'E' (extended
   conformation), 'B' or 'b' (isolated bridge), 'T' (turn) or 'C' (coil,
   i.e. none of the above).
   iv. The number of helical or beta-strand secondary structure elements
   (SSEs) in the chain or domain. An 'element' is defined as a run of a
   user-defined number (typically 4) of residues in the 'H', 'G' or 'I'
   conformation (helices) or the 'E' conformation (beta-strands).

12.0 DIAGNOSTIC ERROR MESSAGES

  12.1 Features of a PDB file diagnosed during parsing

   Listed below are various types of inconsistencies, errors or other
   dubious features of a PDB file identified by PDBPARSE. The first line
   of each block are examples of diagnostic messages that may appear in
   the log file generated by PDBPARSE. The second line is a description /
   action taken and the third line is the number of times the error
   message is reported, e.g. "Chain" means the error is reported for each
   chain as appopriate.


FILE_OPEN      my.file
my.file could not be opened for reading or writing.  The file is ignored.
File

FILE_READ      my.file
my.file could not be read.  The file is ignored.
File

NO_OUTPUT      my.file
No clean coordinate file was generated for my.file.  This will happen if there
was a FILE_READ error on the raw PDB file, or a NOSEQRES, NOATOM or NOPROTEINS
error when reading the file.
File

FILE_WRITE     my.file
my.file could not be written. The file is ignored.
File.

BADINDEX 1 (A)
Raw residue numbering from ATOM records does not give the correct index into th
e SEQRES sequence for chain 1 ('A'). The correct alignment of the ATOM and SEQR
ES sequences is found by string handling (see 'Parsing methodology' in the text
).
Chain

NEGNUM 1 (A) 123
Negative residue number found for chain 1 ('A') on line 123.
Chain

ZERNUM 1 (A) 123
Residue number of zero found for chain 1 ('A') on line 123.
Chain

ODDNUM 1 (A) 123
Possible residue heterogeneity or alternative residue numbering scheme for chai
n 1 ('A') on line 123.
Chain

NONSQNTL 1 (A) 123
Possible case of non-sequential numbering error for chain 1 ('A') on line 123.
Chain

HETEROK 1 (A)
Correct alignment of ATOM and SEQRES sequences of chain 1 ('A') found by presum
ing an alternative residue numbering scheme.
Chain

ALTERNOK 1 (A)
Correct alignment of ATOM and SEQRES sequences of chain 1 ('A') found by presum
ing one or more instances of heterogeneity.
Chain

MISSNTERM 1 (A) 3
SEQRES records appeared to be missing 3 N-terminal residues relative to ATOM se
quence for chain 1 ('A'). The missing residues are added to the sequence.
Chain

MISSCTERM 1 (A) 3
SEQRES records appeared to be missing 3 C-terminal residues relative to ATOM se
quence for chain 1 ('A'). The missing residues are added to the sequence.
Chain

GAPPEDOK 1 (A)
Correct alignment of ATOM and SEQRES sequences of chain 1 ('A') found by gap in
sertion with no mismatches.
Chain

MISMATCH 1 (A) 2 ALA 2 ARG 6;    ALA 12 TYR 16
Correct alignment of ATOM and SEQRES sequences of chain 1 ('A') found without g
ap insertion but contained 2 mismatches (ALA 2 versus ARG 6 and ALA 12 versus T
YR 16).
Chain

GAPPED 1 (A) 2 ALA 2 ARG 6;    ALA 12 TYR 16
Correct alignment of ATOM and SEQRES sequences of chain 1 ('A') found by gap in
sertion but contained 2 mismatches (ALA 2 versus ARG 6 and ALA 12 versus TYR 16
).
Chain

NOMATCH 1 (A)
Correct alignment of ATOM and SEQRES sequences of chain 1 ('A') could not be fo
und by string handling (see 'Parsing methodology' in the text). The raw sequenc
e from the ATOM records is taken to be the true sequence and the SEQRES sequenc
e is discarded.
Chain

DUPATOMRES 3
Multiple sets of coordinates were given for an individual atom or whole residue
s, first instance on line 3. The first set of coordinates are used and the othe
rs discarded.
Residue

NOATOMRESID 123
No atom or residue identifier specified, first instance on line 123. All such l
ines are discarded.
Residue

SEQRESLENDIF 1(A)
Indicated and actual length of SEQRES sequence differs for chain 1 (chain ident
ifier 'A'). The actual length of the sequence is used.
Chain

CHAINIDS 1 (A) 2 (A)
Chain identifiers of chains 1 and 2 are not unique ('A' in both cases). Both ch
ains are discarded.
File

CHAINIDSPC
Space (' ') and non-space characters are both used for chain identifiers in a s
ingle file. Chains in ATOM records are identified by reference to the TER recor
ds as well as chain identifiers.
File

CHAINORDER 123
The order of the chains in the ATOM records is inconsistent with that in the SE
QRES records, first instance on line 123. Coordinates are assigned to the corre
ct chain by reference to the chain identifier.
File

TERNONE
No TER records were found. The chains in the ATOM records are identified by ref
erence to the chain identifiers.
File

TERTOOMANY
Number of TER records is greater than the number of chains; possible digest.
File

TERTOOFEW
Number of TER records is less than the number of chains.
File

TERMISSHET 123 124
A chain is not separated from its heterogen group by a TER record between lines
 123 and 124. Coordinates for the chain and heterogen are distinguished by refe
rence to the chain identifier and residue numbers.
Chain

TERMISSCHN 123 124
Two chains are not separated by TER records between lines 123 and 124.
Chain

SEQRESNOAA 1 (A)
No known amino acids found in the SEQRES records for chain 1 ('A'). The chain i
s discarded.
Chain

SEQRESFEWAA 1 (A)
Fewer than the user-specified minimum number (5) of known amino acids were foun
d in the SEQRES records for chain 1 ('A'). The chain is discarded.
Chain

NOPROTEINS
No chains were found with at least the user-specified minimum number (5) of kno
wn amino acids. The file is not parsed and no output file is generated.
File

ATOMFEWAA 1 (A) 3
Fewer than the user-specified minimum number of known amino acids found in the
ATOM records for chain 1 ('A'), model 3. The chain is discarded.
Chain

SECMISS 123
One or more standard records (e.g. for residue identity) were missing for an SS
E on line 123. The element(s) are discarded.
Line

SECBOTH 1 2 ALA 2 ARG 6
The start and end residues (ALA 2 ARG 6) of an element given in the HELIX, SHEE
T or TURN records was not found in the ATOM records of chain 1, model 2. The el
ement is discarded.
Element

SECSTART 1 2 ALA 2
The start residue (ALA 2) of an element was not found in the ATOM records of ch
ain 1, model 2. The element is discarded.
Element

SECEND 1 2 ARG 6
The end residue (ARG 6) of an element was not found in the ATOM records of chai
n 1, model 2. The element is discarded.
Element

SECCHAIN A
Chain identifier ('A') specified for an element not found in PDB file. The elem
ent is discarded.
Element

SECTWOCHN A   B
2 chain identifiers ('A' and 'B') specified for an element. The element is disc
arded.
Element

NOSEQRES
No SEQRES records. The file is not parsed and no output file is generated.
File

NOATOM
No ATOM records. The file is not parsed and no output file is generated.
File

RESOLMOD
A value for the RESOLUTION record is given but MODEL records are also found. Th
e file is presumed to contain an NMR structure or model.
File

NORESOLUTION
RESOLUTION record not found. The file is presumed to contain an NMR structure o
r model.
File

NOMODEL
NMR structure with no MODEL records. The number of models is determined by refe
rence to the TER records.
File

MODELDUP 123
Duplicate MODEL records on line 123. The duplicate record is disregarded.
File

13.0 AUTHORS

   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references

   1. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N.,
   Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data
   Bank. Nucleic Acids Res., 28, 235-242.
   2. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F.Jr,
   Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi T. and Tasumi, M.
   (1997) The Protein Data Bank, a Computer-based Archival File for
   Macromolecular Structures. J.Mol.Biol. 112, 535-542.
   3. Stampf, D.R., Felder, C.E. and Sussman, J.L. PDBbrowse (1995) A
   graphics interface to the Brookhaven Protein Data Bank. Nature. 374,
   572-4
   4. Hooft, R.W.W, Vriend, G., Sander, C. and Abola, E.E. (1996) Errors
   in protein structures. Nature, 381, 272.
   5. Bernstein, H, Bernstein, F and Bourne, P.E. pdb2cif: translating
   PDB entries into mmCIF format. J. Appl. Crystallog., 31, 282-295
   6. Brenner, S.E., Koehl, P. and Levitt, M. (2000) The ASTRAL
   compendium for protein structure and sequence analysis. Nucleic Acids
   Res., 28, 254-256.
   7. Chandonia, J-M., Walker, N.S., Lo Conte, L., Koehl, P., Levitt, M.
   and Brenner, S.E. (2002) ASTRAL compendium enhancements. Nucleic Acids
   Res., 30, 260-263.
   8. Conte, L.L., Ailey, B., Hubbard, T.J. Brenner, S.E., Murzin, A.G.
   and Chothia, C. (2000) SCOP: a structural classification of proteins
   database. Nucleic Acids Res. 28, 257-259.
   9. Hamelryck, T. and Manderick, B. PDB file parser and structure class
   implemented in Python. Bioinformatics. 17, 2308-2310.
   10. Frishman, D. and Argos, P. (1996) 75% accuracy in protein
   secondary structure prediction. Proteins, 27, 329-335
   11. Hubbard, S.J. and Thornton, J.M. (1993) 'NACCESS', Computer
   Program, Department of Biochemistry and Molecular Biology, University
   College London.
   12. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells,
   M.B. and Thornton, J.M. (1997) CATH - A hierarchic classification of
   protein domain structures. Structure, 5, 1093-1108
