
                           SEQWORDS documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Generates DHF files (domain hits files) of database hits (sequences)
   from Swissprot matching keywords from a keywords file. Generates DHF
   files from keyword search of UniProt

2.0 INPUTS & OUTPUTS

   SEQWORDS searches a swissprot-format sequence database with keywords
   taken from a keywords file, and writes a DHF file (domain hits file)
   of sequences whose swissprot entries contains at least one of the
   keywords. If an entry contains a keyword in a domain record of the
   feature table, then the sequence of the domain is written to the
   output file, otherwise the entire sequence is written. The user
   specifies the name of the swissprot-format sequence database (input),
   keywords file (input) and DHF file (output).

3.0 INPUT FILE FORMAT

   The keywords file (below) contains lists of keywords specific to a
   number of SCOP or CATH nodes, e.g. families and superfamilies. Each
   list of keywords is given after a block of SCOP or CATH classification
   records; for family-specific search terms, the block must contain a
   CL, FO, SF and an FA record (see below). For superfamily-specific
   terms, clearly only the CL, FO and SF should be specified. A single
   keyword must be given per line after the record 'TE'. Each block of
   SCOP or CATH classification records and search terms must be delimited
   by the record '//' (the file should also end with this record).
   It is possible to provide search terms above the level of superfamily,
   for example, fold and class-specific search terms for SCOP by using
   the CL and FO records only as appropriate. However, text searches of
   swissprot for members of scop folds and classes are unlikely to
   produce specific or meaningful results.

  Input files for usage example

  File: seqwords.terms

TY   SCOP
XX
CL   Alpha and beta proteins (a/b)
XX
FO   NAD(P)-binding Rossmann-fold domains
XX
SF   NAD(P)-binding Rossmann-fold domains
XX
FA   Lactate & malate dehydrogenases, N-terminal domain
XX
TE   NAD(P)-binding Rossmann-fold
TE   Lactate & malate dehydrogenases
TE   Lactate dehydrogenase
TE   Malate dehydrogenase
//

  File: seqwords.seq

ID   ACEA_ECOLI     STANDARD;      PRT;   434 AA.
AC   P05313;
DT   01-NOV-1988 (Rel. 09, Created)
DT   01-NOV-1988 (Rel. 09, Last sequence update)
DT   15-DEC-1998 (Rel. 37, Last annotation update)
DE   ISOCITRATE LYASE (EC 4.1.3.1) (ISOCITRASE) (ISOCITRATASE) (ICL).
GN   ACEA OR ICL.
OS   Escherichia coli.
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC   Escherichia.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12;
RX   MEDLINE; 89083515.
RA   Byrne C.R., Stokes H.W., Ward K.A.;
RT   "Nucleotide sequence of the aceB gene encoding malate synthase A in
RT   Escherichia coli.";
RL   Nucleic Acids Res. 16:10924-10924(1988).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12;
RX   MEDLINE; 88262573.
RA   Rieul C., Bleicher F., Duclos B., Cortay J.-C., Cozzone A.J.;
RT   "Nucleotide sequence of the aceA gene coding for isocitrate lyase in
RT   Escherichia coli.";
RL   Nucleic Acids Res. 16:5689-5689(1988).
RN   [3]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 89008064.
RA   Matsuoka M., McFadden B.A.;
RT   "Isolation, hyperexpression, and sequencing of the aceA gene encoding
RT   isocitrate lyase in Escherichia coli.";
RL   J. Bacteriol. 170:4528-4536(1988).
RN   [4]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12 / MG1655;
RX   MEDLINE; 94089392.
RA   Blattner F.R., Burland V.D., Plunkett G. III, Sofia H.J.,
RA   Daniels D.L.;
RT   "Analysis of the Escherichia coli genome. IV. DNA sequence of the
RT   region from 89.2 to 92.8 minutes.";
RL   Nucleic Acids Res. 21:5408-5417(1993).
RN   [5]
RP   SEQUENCE OF 293-434 FROM N.A.
RX   MEDLINE; 88227861.
RA   Klumpp D.J., Plank D.W., Bowdin L.J., Stueland C.S., Chung T.,
RA   Laporte D.C.;
RT   "Nucleotide sequence of aceK, the gene encoding isocitrate
RT   dehydrogenase kinase/phosphatase.";
RL   J. Bacteriol. 170:2763-2769(1988).


  [Part of this file has been deleted for brevity]

FT   CONFLICT     70     70       A -> R (IN REF. 2).
FT   CONFLICT     80     80       A -> R (IN REF. 1 AND 2).
FT   CONFLICT    116    116       I -> N (IN REF. 2).
FT   CONFLICT    144    144       F -> L (IN REF. 1).
FT   CONFLICT    305    312       LGEEFVNK -> WAKSSLISN (IN REF. 2).
FT   CONFLICT    307    307       E -> Q (IN REF. 1).
FT   STRAND        2      6
FT   TURN          7      9
FT   HELIX        11     23
FT   TURN         26     27
FT   STRAND       28     33
FT   TURN         37     38
FT   HELIX        39     47
FT   TURN         48     48
FT   STRAND       53     58
FT   HELIX        64     67
FT   TURN         68     69
FT   STRAND       72     75
FT   TURN         83     84
FT   HELIX        87    108
FT   TURN        110    111
FT   STRAND      113    116
FT   HELIX       121    134
FT   TURN        135    136
FT   TURN        140    141
FT   STRAND      143    145
FT   HELIX       148    162
FT   TURN        163    163
FT   HELIX       166    168
FT   STRAND      173    175
FT   TURN        179    181
FT   STRAND      182    184
FT   HELIX       186    188
FT   TURN        190    191
FT   HELIX       196    217
FT   TURN        218    219
FT   HELIX       225    242
FT   TURN        243    244
FT   STRAND      248    255
FT   STRAND      263    271
FT   TURN        272    273
FT   STRAND      274    278
FT   HELIX       286    311
SQ   SEQUENCE   312 AA;  32337 MW;  17741A3B5AD068BA CRC64;
     MKVAVLGAAG GIGQALALLL KTQLPSGSEL SLYDIAPVTP GVAVDLSHIP TAVKIKGFSG
     EDATPALEGA DVVLISAGVA RKPGMDRSDL FNVNAGIVKN LVQQVAKTCP KACIGIITNP
     VNTTVAIAAE VLKKAGVYDK NKLFGVTTLD IIRSNTFVAE LKGKQPGEVE VPVIGGHSGV
     TILPLLSQVP GVSFTEQEVA DLTKRIQNAG TEVVEAKAGG GSATLSMGQA AARFGLSLVR
     ALQGEQGVVE CAYVEGDGQY ARFFSQPLLL GKNGVEERKS IGTLSAFEQN ALEGMLDTLK
     KDIALGEEFV NK
//

4.0 OUTPUT FILE FORMAT

   DHF file (domain hits file)
   The format of the DHF file (domain hits file) of hit sequences
   generated by SEQWORDS (Figure 1) is described fully in SEQSEARCH
   documentation and only summarised here. The file contains two lines
   per hit, the first is a description of the hit in 16 text tokens
   delimited by '^'. The second line contains the protein sequence. The
   first 4 tokens refer to the hit (sequence) itself, the tokens are as
   follows:
     * (i) Accession number
     * (ii) Database code,
     * (iii - iv) Start and end positions of the hit relative to the full
       length sequence.

   The next 9 tokens refer to the domain family, superfamily etc for
   which the terms were defined (in the keywords file) and are as
   follows:
     * (v) Type of domain (one of 'SCOP' or 'CATH'),
     * (vi) SCOP or CATH domain identifier.
     * (vii) SCOP or CATH node unique identifier, e.g. SCOP Family Sunid.
     * (viii) Domain class. Textual description of the 'Class' (SCOP and
       CATH domains).
     * (ix) Domain architecture. Textual description of the
       'Architecture' (CATH only).
     * (x) Domain topology. Textual description of the 'Topology' (CATH
       only).
     * (xi) Domain fold. Textual description of the 'Fold' (SCOP domains
       only).
     * (xii) Domain superfamily. Textual description of the 'Superfamily'
       (SCOP and CATH domains).
     * (xiii) Domain family. Textual description of the 'Fold' (SCOP
       only).

   The next 4 tokens refer to the hit, specifically, information about
   the search result as follows:
     * (xiv) Model type. The type of model that was used to generate the
       hit. For DHF files generated by using SEQWORDS a value of KEYWORD
       is given. Several other values are possible, however, see
       SEQSEARCH documentation.
     * (xv) SC - Score of hit from search algoritm (not written by
       SEQWORDS).
     * (xvi) P-value of hit (not written by SEQWORDS).
     * (xvii) E-value of hit (not written by SEQWORDS).

  Output files for usage example

  File: seqwords.dhf

> Q60150^.^1^312^SCOP^.^0^Alpha and beta proteins (a/b)^.^.^NAD(P)-binding Ross
mann-fold domains^NAD(P)-binding Rossmann-fold domains^Lactate & malate dehydro
genases, N-terminal domain^KEYWORD^0.00^0.000e+00^0.000e+00
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGV
ARKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIIRSNTFV
AELKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLS
LVRALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGEEFVNK

5.0 DATA FILES

   SEQWORDS does not use a data file.

6.0 USAGE

   Standard (Mandatory) qualifiers:
  [-keyfile]           infile     This option specifies the name of keywords
                                  file (input). This contains a list of
                                  keywords specific to a number of SCOP or
                                  CATH families and superfamilies used by
                                  SEQWORDS to search a sequence database.
  [-spfile]            infile     This option specifies the name of the
                                  sequence database (input) to search.
  [-outfile]           outfile    This option specifies the name of the DHF
                                  file (domain hits file) (output). A 'domain
                                  hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA-like).
                                  The hits are relatives to a SCOP or CATH
                                  family (or other node in the structural
                                  hierarchies) and are found from a search of
                                  a sequence database. Files containing hits
                                  retrieved by PSIBLAST are generated by using
                                  SEQSEARCH, hits retrieved by a sparse
                                  protein signatare by using SIGSCAN or
                                  various types of HMM and profile by using
                                  LIBSCAN.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers Allowed values Default
   [-keyfile]
   (Parameter 1) This option specifies the name of keywords file (input).
   This contains a list of keywords specific to a number of SCOP or CATH
   families and superfamilies used by SEQWORDS to search a sequence
   database. Input file Required
   [-spfile]
   (Parameter 2) This option specifies the name of the sequence database
   (input) to search. Input file Required
   [-outfile]
   (Parameter 3) This option specifies the name of the DHF file (domain
   hits file) (output). A 'domain hits file' contains database hits
   (sequences) with domain classification information, in the DHF format
   (FASTA-like). The hits are relatives to a SCOP or CATH family (or
   other node in the structural hierarchies) and are found from a search
   of a sequence database. Files containing hits retrieved by PSIBLAST
   are generated by using SEQSEARCH, hits retrieved by a sparse protein
   signatare by using SIGSCAN or various types of HMM and profile by
   using LIBSCAN. Output file test.hits
   Additional (Optional) qualifiers Allowed values Default
   (none)
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of SEQWORDS is shown below. Here is a
   sample session with seqwords


% seqwords 
Generates DHF files from keyword search of UniProt.
Name of keywords file (input): seqwords.terms
Name of sequence database (input): seqwords.seq
Name of DHF file (domain hits file) (output) [test.hits]: seqwords.dhf

   Go to the input files for this example
   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   SEQWORDS is slow - swissprot is read multiple times (once for each
   list of terms). Changing it to do a single file read would require
   modifying the program to take an array of hitlist and terms
   structures. An easy-ish change.

8.0 NOTES

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   Domain hits file DHF format (FASTA-like). Database hits (sequences)
   with domain classification information. The hits are relatives to a
   SCOP or CATH family (or other node in the structural hierarchies) and
   are found from a search of a discriminating element (e.g. a protein
   signature, hidden Markov model, simple frequency matrix, Gribskov
   profile or Hennikoff profile) against a sequence database. SEQSEARCH
   (hits retrieved by PSIBLAST). SIGSCAN (hits retrieved by sparse
   protein signature). LIBSCAN (hits retrieved by various types of HMM
   and profile). N.A.
   Keywords file Text Contains a list of keywords specific to a number of
   SCOP families and superfamilies used by SEQWORDS to search a sequence
   database. N.A. N.A.

   None

9.0 DESCRIPTION

   Retrieval of domain sequence information from swissprot by mutliple
   keywords can be a time-consuming process. SEQWORDS parses a file of
   keywords and the swissprot database and writes a file of sequences
   whose swissprot entries contains at least one of the keywords. The
   domain sequence is taken if the keyword appears in the domain record
   of the feature table, otherwise the full-length sequence is taken.

10.0 ALGORITHM

   None.

11.0 RELATED APPLICATIONS

See also

   Program name                       Description
   contactcount Count specific versus non-specific contacts
   contacts     Generate intra-chain CON files from CCF files
   domainalign  Generate alignments (DAF file) for nodes in a DCF file
   domainrep    Reorder DCF file to identify representative structures
   domainreso   Remove low resolution domains from a DCF file
   interface    Generate inter-chain CON files from CCF files
   libgen       Generate discriminating elements from alignments
   matgen3d     Generate a 3D-1D scoring matrix from CCF files
   psiphi       Phi and psi torsion angles from protein coordinates
   rocon        Generates a hits file from comparing two DHF files
   rocplot      Performs ROC analysis on hits files
   scorecmapdir Contact scores for cleaned protein chain contact files
   seqalign     Extend alignments (DAF file) with sequences (DHF file)
   seqfraggle   Removes fragment sequences from DHF files
   seqsearch    Generate PSI-BLAST hits (DHF file) from a DAF file
   seqsort      Remove ambiguous classified sequences from DHF files
   siggen       Generates a sparse protein signature from an alignment
   siggenlig    Generate ligand-binding signatures from a CON file
   sigscan      Generate hits (DHF file) from a signature search
   sigscanlig   Search ligand-signature library & write hits (LHF file)

12.0 DIAGNOSTIC ERROR MESSAGES

   None.

13.0 AUTHORS

   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references
