
                               patmatmotifs 



Function

   Search a PROSITE motif database with a protein sequence

Description

   patmatmotifs takes a protein sequence and compares it to the PROSITE
   database of motifs.

   For a description of PROSITE, we can do no better than to quote the
   PROSITE user's documentation: 

   PROSITE is a method of determining what is the function of
   uncharacterized proteins translated from genomic or cDNA sequences. It
   consists of a database of biologically significant sites and patterns
   formulated in such a way that with appropriate computational tools it
   can rapidly and reliably identify to which known family of protein (if
   any) the new sequence belongs.

   In some cases the sequence of an unknown protein is too distantly
   related to any protein of known structure to detect its resemblance by
   overall sequence alignment, but it can be identified by the occurrence
   in its sequence of a particular cluster of residue types which is
   variously known as a pattern, motif, signature, or fingerprint. These
   motifs arise because of particular requirements on the structure of
   specific region(s) of a protein which may be important, for example,
   for their binding properties or for their enzymatic activity. These
   requirements impose very tight constraints on the evolution of those
   limited (in size) but important portion(s) of a protein sequence. To
   paraphrase Orwell, in Animal Farm, we can say that "some regions of a
   protein sequence are more equal than others" !

   The use of protein sequence patterns (or motifs) to determine the
   function(s) of proteins is becoming very rapidly one of the essential
   tools of sequence analysis. This reality has been recognized by many
   authors, as it can be illustrated from the following citations from
   two of the most well known experts of protein sequence analysis, R.F.
   Doolittle and A.M. Lesk:
      "There are  many short  sequences  that  are  often  (but  not  always)
      diagnostics of certain binding properties or active sites. These can be
      set into a small subcollection and searched against your sequence (1)".

      "In some  cases, the structure and function of an unknown protein which
      is too  distantly related  to any  protein of known structure to detect
      its affinity  by overall  sequence alignment  may be  identified by its
      possession of  a particular  cluster of  residues types classified as a
      motifs. The  motifs, or  templates, or  fingerprints, arise  because of
      particular  requirements  of  binding  sites  that  impose  very  tight
      constraint on the evolution of portions of a protein sequence (2)."

   The home web page of PROSITE is: http://www.expasy.ch/prosite/

   It is common to find that a search of the PROSITE database against a
   protein sequence will report many matches to the short motifs that are
   indicative of the post-translational modification sites, such as
   glycolsylation, myristylation and phosphorylation sites. These reports
   are often unwanted and are not normally reported. You can turn
   reporting of these short motifs on by giving the '-noprune' option on
   the command-line.

   Your EMBOSS administrator must have set up the local EMBOSS PROSITE
   database using the utility 'prosextract' before this program will run.

Usage

   Here is a sample session with patmatmotifs


% patmatmotifs -full 
Search a PROSITE motif database with a protein sequence
Input sequence: tsw:opsd_human
Output report [opsd_human.patmatmotifs]: 

   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          sequence   Sequence USA
  [-outfile]           report     Output report file name

   Additional (Optional) qualifiers:
   -full               boolean    Provide full documentation for matching
                                  patterns
   -[no]prune          boolean    Ignore simple patterns. If this is true then
                                  these simple post-translational
                                  modification sites are not reported:
                                  myristyl, asn_glycosylation,
                                  camp_phospho_site, pkc_phospho_site,
                                  ck2_phospho_site, and tyr_phospho_site.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1             integer    Start of the sequence to be used
   -send1               integer    End of the sequence to be used
   -sreverse1           boolean    Reverse (if DNA)
   -sask1               boolean    Ask for begin/end/reverse
   -snucleotide1        boolean    Sequence is nucleotide
   -sprotein1           boolean    Sequence is protein
   -slower1             boolean    Make lower case
   -supper1             boolean    Make upper case
   -sformat1            string     Input sequence format
   -sdbname1            string     Database name
   -sid1                string     Entryname
   -ufo1                string     UFO features
   -fformat1            string     Features format
   -fopenfile1          string     Features file name

   "-outfile" associated qualifiers
   -rformat2            string     Report format
   -rname2              string     Base file name
   -rextension2         string     File name extension
   -rdirectory2         string     Output directory
   -raccshow2           boolean    Show accession number in the report
   -rdesshow2           boolean    Show description in the report
   -rscoreshow2         boolean    Show the score in the report
   -rusashow2           boolean    Show the full USA in the report

   General qualifiers:
   -auto                boolean    Turn off prompts
   -stdout              boolean    Write standard output
   -filter              boolean    Read standard input, write standard output
   -options             boolean    Prompt for standard and additional values
   -debug               boolean    Write debug output to program.dbg
   -verbose             boolean    Report some/full command line options
   -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning             boolean    Report warnings
   -error               boolean    Report errors
   -fatal               boolean    Report fatal errors
   -die                 boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-sequence]
   (Parameter 1) Sequence USA Readable sequence Required
   [-outfile]
   (Parameter 2) Output report file name Report output file
   Additional (Optional) qualifiers Allowed values Default
   -full Provide full documentation for matching patterns Boolean value
   Yes/No No
   -[no]prune Ignore simple patterns. If this is true then these simple
   post-translational modification sites are not reported: myristyl,
   asn_glycosylation, camp_phospho_site, pkc_phospho_site,
   ck2_phospho_site, and tyr_phospho_site. Boolean value Yes/No Yes
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

Input file format

   patmatmotifs reads a protein sequence USA.

  Input files for usage example

   'tsw:opsd_human' is a sequence entry in the example protein database
   'tsw'

  Database entry: tsw:opsd_human

ID   OPSD_HUMAN     STANDARD;      PRT;   348 AA.
AC   P08100; Q16414;
DT   01-AUG-1988 (Rel. 08, Created)
DT   01-AUG-1988 (Rel. 08, Last sequence update)
DT   15-JUL-1999 (Rel. 38, Last annotation update)
DE   RHODOPSIN.
GN   RHO.
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
RN   [1]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 84272729.
RA   NATHANS J., HOGNESS D.S.;
RT   "Isolation and nucleotide sequence of the gene encoding human
RT   rhodopsin.";
RL   Proc. Natl. Acad. Sci. U.S.A. 81:4851-4855(1984).
RN   [2]
RP   SEQUENCE OF 1-120 FROM N.A.
RA   BENNETT J., BELLER B., SUN D., KARIKO K.;
RL   Submitted (NOV-1994) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   REVIEW ON ADRP VARIANTS.
RX   MEDLINE; 94004905.
RA   AL-MAGHTHEH M., GREGORY C., INGLEHEARN C., HARDCASTLE A.,
RA   BHATTACHARYA S.;
RT   "Rhodopsin mutations in autosomal dominant retinitis pigmentosa.";
RL   Hum. Mutat. 2:249-255(1993).
RN   [4]
RP   VARIANT ADRP HIS-23.
RX   MEDLINE; 90136922.
RA   DRYJA T.P., MCGEE T.L., REICHEI E., HAHN L.B., COWLEY G.S.,
RA   YANDELL D.W., SANDBERG M.A., BERSON E.L.;
RT   "A point mutation of the rhodopsin gene in one form of retinitis
RT   pigmentosa.";
RL   Nature 343:364-366(1990).
RN   [5]
RP   VARIANTS ADRP.
RX   MEDLINE; 91051574.
RA   FARRAR G.J., KENNA P., REDMOND R., MCWILLIAM P., BRADLEY D.G.,
RA   HUMPHRIES M.M., SHARP E.M., INGLEHEARN C.F., BASHIR R., JAY M.,
RA   WATTY A., LUDWIG M., SCHINZEL A., SAMANNS C., GAL A.,
RA   BHATTACHARYA S.S., HUMPHRIES P.;
RT   "Autosomal dominant retinitis pigmentosa: absence of the rhodopsin
RT   proline-->histidine substitution (codon 23) in pedigrees from
RT   Europe.";
RL   Am. J. Hum. Genet. 47:941-945(1990).
RN   [6]
RP   VARIANTS ADRP HIS-23; ARG-58; LEU-347 AND SER-347.
RX   MEDLINE; 91015273.


  [Part of this file has been deleted for brevity]

FT                                /FTId=VAR_004816.
FT   VARIANT     209    209       V -> M (EFFECT NOT KNOWN).
FT                                /FTId=VAR_004817.
FT   VARIANT     211    211       H -> P (IN ADRP).
FT                                /FTId=VAR_004818.
FT   VARIANT     211    211       H -> R (IN ADRP).
FT                                /FTId=VAR_004819.
FT   VARIANT     216    216       M -> K (IN ADRP).
FT                                /FTId=VAR_004820.
FT   VARIANT     220    220       F -> C (IN ADRP).
FT                                /FTId=VAR_004821.
FT   VARIANT     222    222       C -> R (IN ADRP).
FT                                /FTId=VAR_004822.
FT   VARIANT     255    255       MISSING (IN ADRP).
FT                                /FTId=VAR_004823.
FT   VARIANT     264    264       MISSING (IN ADRP).
FT                                /FTId=VAR_004824.
FT   VARIANT     267    267       P -> L (IN ADRP).
FT                                /FTId=VAR_004825.
FT   VARIANT     267    267       P -> R (IN ADRP).
FT                                /FTId=VAR_004826.
FT   VARIANT     292    292       A -> E (IN CSNB4).
FT                                /FTId=VAR_004827.
FT   VARIANT     296    296       K -> E (IN ADRP).
FT                                /FTId=VAR_004828.
FT   VARIANT     297    297       S -> R (IN ADRP).
FT                                /FTId=VAR_004829.
FT   VARIANT     342    342       T -> M (IN ADRP).
FT                                /FTId=VAR_004830.
FT   VARIANT     345    345       V -> L (IN ADRP).
FT                                /FTId=VAR_004831.
FT   VARIANT     345    345       V -> M (IN ADRP).
FT                                /FTId=VAR_004832.
FT   VARIANT     347    347       P -> A (IN ADRP).
FT                                /FTId=VAR_004833.
FT   VARIANT     347    347       P -> L (IN ADRP; COMMON VARIANT).
FT                                /FTId=VAR_004834.
FT   VARIANT     347    347       P -> Q (IN ADRP).
FT                                /FTId=VAR_004835.
FT   VARIANT     347    347       P -> R (IN ADRP).
FT                                /FTId=VAR_004836.
FT   VARIANT     347    347       P -> S (IN ADRP).
FT                                /FTId=VAR_004837.
SQ   SEQUENCE   348 AA;  38892 MW;  07443BEA CRC32;
     MNGTEGPNFY VPFSNATGVV RSPFEYPQYY LAEPWQFSML AAYMFLLIVL GFPINFLTLY
     VTVQHKKLRT PLNYILLNLA VADLFMVLGG FTSTLYTSLH GYFVFGPTGC NLEGFFATLG
     GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLAGWSRYIP
     EGLQCSCGID YYTLKPEVNN ESFVIYMFVV HFTIPMIIIF FCYGQLVFTV KEAAAQQQES
     ATTQKAEKEV TRMVIIMVIA FLICWVPYAS VAFYIFTHQG SNFGPIFMTI PAFFAKSAAI
     YNPVIYIMMN KQFRNCMLTT ICCGKNPLGD DEASATVSKT ETSQVAPA
//

Output file format

   The output is a standard EMBOSS report file.

   The results can be output in one of several styles by using the
   command-line qualifier -rformat xxx, where 'xxx' is replaced by the
   name of the required format. The available format names are: embl,
   genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel,
   feattable, motif, regions, seqtable, simple, srs, table, tagseq

   See: http://emboss.sf.net/docs/themes/ReportFormats.html for further
   information on report formats.

   By default patmatmotifs writes a 'dbmotif' report file.

  Output files for usage example

  File: opsd_human.patmatmotifs

########################################
# Program: patmatmotifs
# Rundate: Fri Jul 15 2005 12:00:00
# Report_format: dbmotif
# Report_file: opsd_human.patmatmotifs
########################################

#=======================================
#
# Sequence: OPSD_HUMAN     from: 1   to: 348
# HitCount: 2
#
# Full: Yes
# Prune: Yes
# Data_file: ../prosextract-keep/PROSITE/prosite.lines
#
#=======================================

Length = 17
Start = position 123 of sequence
End = position 139 of sequence

Motif = G_PROTEIN_RECEP_F1_1

TLGGEIALWSLVVLAIERYVVVCKPMS
     |               |
   123               139

Length = 17
Start = position 290 of sequence
End = position 306 of sequence

Motif = OPSIN

PIFMTIPAFFAKSAAIYNPVIYIMMNK
     |               |
   290               306


#---------------------------------------
#
# Motif: G_PROTEIN_RECEP_F1_1
# Count: 1
#
# *****************************************
# * G-protein coupled receptors signature *
# *****************************************
#
# G-protein coupled receptors [1 to 4,E1,E2] (also called R7G) are  an extensiv
e
# group of  hormones,  neurotransmitters,  odorants  and  light  receptors whic
h


  [Part of this file has been deleted for brevity]

# Count: 1
#
# *************************************************
# * Visual pigments (opsins) retinal binding site *
# *************************************************
#
# Visual pigments [1,2] are the light-absorbing  molecules that  mediate vision
.
# They consist of  an apoprotein, opsin,  covalently  linked  to the chromophor
e
# cis-retinal.  Vision is  effected through  the absorption of a  photon by cis
-
# retinal  which is isomerized to  trans-retinal.  This isomerization leads to
a
# change  of conformation  of the protein. Opsins are integral membrane protein
s
# with  seven transmembrane regions that belong to family 1 of G-protein couple
d
# receptors (see <PDOC00210>).
#
# In vertebrates four different pigments are generally found.   Rod cells, whic
h
# mediate vision in dim light, contain the pigment rhodopsin.  Cone cells, whic
h
# function in bright light, are responsible  for  color vision and contain thre
e
# or more color pigments (for example, in mammals: red, blue and green).
#
# In Drosophila, the  eye   is composed   of 800   facets  or   ommatidia.  Eac
h
# ommatidium contains eight photoreceptor cells (R1-R8):  the R1 to R6 cells ar
e
# outer cells,  R7  and R8 inner cells. Each of the three types of cells (R1-R6
,
# R7 and R8) expresses a specific opsin.
#
# Proteins evolutionary related to opsins include squid retinochrome, also know
n
# as retinal  photoisomerase, which converts various isomers of retinal into 11
-
# cis retinal and mammalian retinal pigment  epithelium (RPE) RGR [3], a protei
n
# that may also act in retinal isomerization.
#
# The attachment  site  for  retinal in the above proteins is a conserved lysin
e
# residue in  the  middle  of  the  seventh  transmembrane helix. The pattern w
e
# developed includes this residue.
#
# -Consensus pattern: [LIVMWAC]-[PGAC]-x(3)-[SAC]-K-[STALIMR]-[GSACPNV]-[STACP]
-
#                     x(2)-[DENF]-[AP]-x(2)-[IY]
#                     [K is the retinal binding site]
# -Sequences known to belong to this class detected by the pattern: ALL.
# -Other sequence(s) detected in SWISS-PROT: NONE.
# -Last update: July 1998 / Pattern and text revised.
#
# [ 1] Applebury M.L., Hargrave P.A.
#      Vision Res. 26:1881-1895(1986).
# [ 2] Fryxell K.J., Meyerowitz E.M.
#      J. Mol. Evol. 33:367-378(1991).
# [ 3] Shen D., Jiang M., Hao W., Tao L., Salazar M., Fong H.K.W.
#      Biochemistry 33:13117-13125(1994).
#
# ***************
#
#
#---------------------------------------

Data files

   Data and documentation from PROSITE files is automatically read. This
   has been generated and formatted by running prosextract before running
   patmatmotifs.

Notes

   Program is only useful when prosextract is used beforehand.

References

   If you want to refer to PROSITE in a publication you can do so by
   citing:

   Bairoch A., Bucher P., Hofmann K. The PROSITE datatase, its status in
   1997. Nucleic Acids Res. 24:217-221(1997).

   Other references:

    1. Bairoch, A., Bucher P. (1994) PROSITE: recent developments.
       Nucleic Acids Research, Vol 22, No.17 3583-3589.
    2. Bairoch, A., (1992) PROSITE: a dictionary of sites and patterns in
       proteins. Nucleic Acids Research, Vol 20, Supplement, 2013-2018.
    3. Peek, J., O'Reilly, T., Loukides, M., (1997) Unix Power Tools, 2nd
       Edition.
    4. Doolittle R.F. (In) Of URFs and ORFs: a primer on how to analyze
       derived amino acid sequences., University Science Books, Mill
       Valley, California, (1986).
    5. Lesk A.M. (In) Computational Molecular Biology, Lesk A.M., Ed.,
       pp17-26, Oxford University Press, Oxford (1988).

Warnings

   Your EMBOSS administrator must have set up the local EMBOSS PROSITE
   database using the utility 'prosextract' before this program will run.

Diagnostic Error Messages

   The error message:

"Either EMBOSS_DATA undefined or PROSEXTRACT needs running"

   indicates that your local EMBOSS administrator has not yet correctly
   set up the local EMBOSS PROSITE database using the utility
   'prosextract'.

Exit status

   It always exits with status 0

Known bugs

   None.

See also

    Program name                         Description
   antigenic      Finds antigenic sites in proteins
   digest         Protein proteolytic enzyme or reagent cleavage digest
   epestfind      Finds PEST motifs as potential proteolytic cleavage sites
   fuzzpro        Protein pattern search
   fuzztran       Protein pattern search after translation
   helixturnhelix Report nucleic acid binding motifs
   oddcomp        Find protein sequence regions with a biased composition
   patmatdb       Search a protein sequence with a motif
   pepcoil        Predicts coiled coil regions
   preg           Regular expression search of a protein sequence
   pscan          Scans proteins using PRINTS
   sigcleave      Reports protein signal cleavage sites

Author(s)

   Sinead O'Leary (current e-mail address unknown)
   while she was at:
   HGMP-RC, Genome Campus, Hinxton, Cambridge CB10 1SB, UK

History

   Completed May 13 1999.

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.

Comments

   None
