
                           HETPARSE documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Converts raw dictionary of heterogen groups EMBL-like format. Converts
   heterogen group dictionary to EMBL-like format

2.0 INPUTS & OUTPUTS

   HETPARSE parse the dictionary of heterogen groups available at
   http://pdb.rutgers.edu/het_dictionary.txt and writes a file containing
   the group names, synonyms and 3-letter codes in EMBL-like format.
   Optionally, HETPARSE will search a directory of PDB files and will
   count the number of files that each heterogen appears in. The path and
   extension of the PDB files and the names of the input and output files
   are user- specified (file extension is set in the ACD file).

3.0 INPUT FILE FORMAT

   An excerpt from the raw dictionary of heterogen group is shown (Figure
   1).

  Input files for usage example

  File: het.txt

RESIDUE   061     58
CONECT      N1     2 N2   C5
CONECT      N2     2 N1   N3
CONECT      N3     2 N2   N4
CONECT      N4     3 N3   C5   HN4
CONECT      C5     3 N1   N4   C6
CONECT      C6     3 C5   C7   C11
CONECT      C7     3 C6   C8   C12
CONECT      C8     3 C7   C9   H8
CONECT      C9     3 C8   C10  H9
CONECT      C10    3 C9   C11  H10
CONECT      C11    3 C6   C10  H11
CONECT      C12    3 C7   C13  C17
CONECT      C13    3 C12  C14  H13
CONECT      C14    3 C13  C15  H14
CONECT      C15    3 C14  C16  C18
CONECT      C16    3 C15  C17  H16
CONECT      C17    3 C12  C16  H17
CONECT      C18    4 C15  N19 1H18 2H18
CONECT      N19    3 C18  C20  C33
CONECT      C20    3 N19  C21  N25
CONECT      C21    4 C20  C22 1H21 2H21
CONECT      C22    4 C21  C23 1H22 2H22
CONECT      C23    4 C22  C24 1H23 2H23
CONECT      C24    4 C23 1H24 2H24 3H24
CONECT      N25    2 C20  C26
CONECT      C26    3 N25  C27  C32
CONECT      C27    3 C26  C28  H27
CONECT      C28    3 C27  C29  H28
CONECT      C29    3 C28  O30  C31
CONECT      O30    2 C29  HOU
CONECT      C31    3 C29  C32  H31
CONECT      C32    3 C26  C31  C33
CONECT      C33    3 N19  C32  O34
CONECT      O34    1 C33
CONECT      HN4    1 N4
CONECT      H8     1 C8
CONECT      H9     1 C9
CONECT      H10    1 C10
CONECT      H11    1 C11
CONECT      H13    1 C13
CONECT      H14    1 C14
CONECT      H16    1 C16
CONECT      H17    1 C17
CONECT     1H18    1 C18
CONECT     2H18    1 C18
CONECT     1H21    1 C21
CONECT     2H21    1 C21
CONECT     1H22    1 C22
CONECT     2H22    1 C22


  [Part of this file has been deleted for brevity]

CONECT     2H6     1 C6
CONECT     1H8     1 C8
CONECT     2H8     1 C8
CONECT     1H9     1 C9
CONECT     2H9     1 C9
END
HET    104             28
HETSYN     104 TRIENTINE
HETNAM     104 N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE
FORMUL      104    C6 H18 N4

RESIDUE   105     32
CONECT      B      3 O1   O2   C3
CONECT      O1     2 B    H1
CONECT      O2     2 B    H2
CONECT      C3     4 B    N4  1H3  2H3
CONECT      N4     3 C3   C5   H4
CONECT      C5     3 N4   O6   C7
CONECT      O6     1 C5
CONECT      C7     3 C5   C8   C12
CONECT      N11    2 O10  C12
CONECT      O10    2 N11  C8
CONECT      C8     3 C7   O10  C9
CONECT      C12    3 C7   N11  C13
CONECT      C9     4 C8  1H9  2H9  3H9
CONECT      C13    3 C12  C14  C18
CONECT      C14    3 C13  C15 CL1
CONECT     CL1     1 C14
CONECT      C15    3 C14  C16  H15
CONECT      C16    3 C15  C17  H16
CONECT      C17    3 C16  C18  H17
CONECT      C18    3 C13  C17  H18
CONECT      H1     1 O1
CONECT      H2     1 O2
CONECT     1H3     1 C3
CONECT     2H3     1 C3
CONECT      H4     1 N4
CONECT     1H9     1 C9
CONECT     2H9     1 C9
CONECT     3H9     1 C9
CONECT      H15    1 C15
CONECT      H16    1 C16
CONECT      H17    1 C17
CONECT      H18    1 C18
END
HET    105             32
HETSYN     105 CLOXACILLIN DERIVATIVE
HETNAM     105 N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACID
HETNAM   2 105 AMIDE] BORONIC ACID
FORMUL      105    C12 H12 N2 O4 B1 CL1

4.0 OUTPUT FILE FORMAT

   The records used in the output file (Figure 2) are as follows:
     * ID - 3-character abbreviation of heterogen
     * DE - full description
     * SY - synonym
     * NN - number of files which this heterogen appears in

  Output files for usage example

  File: Ehet.dat

ID   105
DE   N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACIDAMIDE] BORONIC ACID
SY   CLOXACILLIN DERIVATIVE
NN   0
//
ID   104
DE   N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE
SY   TRIENTINE
NN   0
//
ID   103
DE   2',5'-DIDEOXY-ADENOSINE 3'-MONOPHOSPHATE
SY   .
NN   0
//
ID   102
DE   GAMMA-DEOXY-GAMMA-SULFO-GUANOSINE-5'-TRIPHOSPHATE
SY   .
NN   0
//
ID   101
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
SY   .
NN   0
//
ID   100
DE   1-(5-CHLOROINDOL-3-YL)-3-HYDROXY-3-(2H-TETRAZOL-5-YL)-PROPENONE
SY   .
NN   0
//
ID   074
DE   [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL-PROLINE
SY   CA-074;
SY   [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2-CARBONYL)-L-ISOLEUCYL-L-PROLINE]
NN   0
//
ID   072
DE
DE   (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4-OXO-5-THIAZOLIDINE
SY   THIAZOLIDINONE; GW0072
NN   0
//
ID   061
DE
DE   2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4-YLMETHYL]-3H-QUINAZO
LIN-4-ONE
SY   L-159,061
NN   0
//

5.0 DATA FILES

   HETPARSE does not use a data file.

6.0 USAGE

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-infile]            infile     This option specifies the name of input file
                                  (raw dictionary of heterogen groups) to
                                  parse, which should be of the format
                                  specified at
                                  http://pdb.rutgers.edu/het_dictionary.txt
   -dogrep             toggle     This option specifies whether to search a
                                  directory of files (typically PDB files)
                                  with keywords. If set, HETPARSE will search
                                  the directory and will count the number of
                                  files that each heterogen appears in.
*  -dirlistpath        dirlist    This option specifies the directory to
                                  search with keywords.
  [-outfile]           outfile    This option specifies the name of EMBL-like
                                  format dictionary of heterogen groups.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers Allowed values Default
   [-infile]
   (Parameter 1) This option specifies the name of input file (raw
   dictionary of heterogen groups) to parse, which should be of the
   format specified at http://pdb.rutgers.edu/het_dictionary.txt Input
   file Required
   -dogrep This option specifies whether to search a directory of files
   (typically PDB files) with keywords. If set, HETPARSE will search the
   directory and will count the number of files that each heterogen
   appears in. Toggle value Yes/No No
   -dirlistpath This option specifies the directory to search with
   keywords. Directory with files ./
   [-outfile]
   (Parameter 2) This option specifies the name of EMBL-like format
   dictionary of heterogen groups. Output file Ehet.dat
   Additional (Optional) qualifiers Allowed values Default
   (none)
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of HETPARSE is shown below. Here is a
   sample session with hetparse


% hetparse 
Converts heterogen group dictionary to EMBL-like format.
Name of input file (raw dictionary of heterogen groups): het.txt
Search a directory of PDB files with keywords? [N]: Y
Directory to search with keywords [./]: 
Name of EMBL-like format dictionary of heterogen groups. [Ehet.dat]: Ehet.dat

   Go to the input files for this example
   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   None.

8.0 NOTES

   HETPARSE is used to create the EMBOSS data file Ehet.dat that is
   included in the EMBOSS distribution.

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   Dictionary of heterogen groups A file of the dictionary of heterogen
   groups in PDB. HETPARSE N.A.

   None

9.0 DESCRIPTION

   Some research applications require knowledge of the types of small
   molecules or 'heterogens' (non-protein groups) that are represented in
   PDB files. A dictionary of such groups containing various data for all
   of the heterogens found in PDB is available, but is not in a
   convenient format. HETPARSE parses the dictionary in its raw format
   and converts it to an EMBL-like format.

10.0 ALGORITHM

   None.

11.0 RELATED APPLICATIONS

See also

    Program name                        Description
   aaindexextract Extract data from AAINDEX
   allversusall   Sequence similarity data from all-versus-all comparison
   cathparse      Generates DCF file from raw CATH files
   cutgextract    Extract data from CUTG
   domainer       Generates domain CCF files from protein CCF files
   domainnr       Removes redundant domains from a DCF file
   domainseqs     Adds sequence records to a DCF file
   domainsse      Add secondary structure records to a DCF file
   pdbparse       Parses PDB files and writes protein CCF files
   pdbplus        Add accessibility & secondary structure to a CCF file
   pdbtosp        Convert swissprot:PDB codes file to EMBL-like format
   printsextract  Extract data from PRINTS
   prosextract    Build the PROSITE motif database for use by patmatmotifs
   rebaseextract  Extract data from REBASE
   scopparse      Generate DCF file from raw SCOP files
   seqnr          Removes redundancy from DHF files
   sites          Generate residue-ligand CON files from CCF files
   ssematch       Search a DCF file for secondary structure matches
   tfextract      Extract data from TRANSFAC

12.0 DIAGNOSTIC ERROR MESSAGES

   None.

13.0 AUTHORS

   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references
