|
|
SIGGEN documentation |
TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX FA Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX SI 54894 XX NP 15 XX NN [1] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA H ; 2 XX GA 12 ; 2 XX NN [2] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 1 ; 2 XX NN [3] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 26 ; 2 XX NN [4] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA F ; 2 XX GA 16 ; 2 XX NN [5] XX [Part of this file has been deleted for brevity] XX GA 4 ; 2 XX NN [10] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA D ; 2 XX GA 2 ; 2 XX NN [11] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA N ; 2 XX GA 0 ; 2 XX NN [12] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA Y ; 2 XX GA 0 ; 2 XX NN [13] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA G ; 2 XX GA 3 ; 2 XX NN [14] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 3 ; 2 XX NN [15] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 2 ; 2 // |
TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Adenylyl and guanylyl cyclase catalytic domain XX FA Adenylyl and guanylyl cyclase catalytic domain XX SI 55074 XX NP 38 XX NN [1] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA D ; 2 XX GA 13 ; 2 XX NN [2] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA V ; 2 XX GA 1 ; 2 XX NN [3] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA F ; 2 XX GA 3 ; 2 XX NN [4] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA D ; 2 XX GA 1 ; 2 XX NN [5] XX [Part of this file has been deleted for brevity] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA L ; 2 XX GA 4 ; 2 XX NN [34] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA E ; 1 AA D ; 1 XX GA 8 ; 2 XX NN [35] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA V ; 2 XX GA 0 ; 2 XX NN [36] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA I ; 1 AA V ; 1 XX GA 17 ; 2 XX NN [37] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA F ; 1 AA Y ; 1 XX GA 2 ; 2 XX NN [38] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA I ; 1 AA L ; 1 XX GA 1 ; 2 // |
Standard (Mandatory) qualifiers (* if not always prompted):
[-algpath] dirlist This option specifies the location of DAF
files (domain alignment files) (input). A
'domain alignment file' contains a sequence
alignment of domains belonging to the same
SCOP or CATH family (or other node in the
structural hierarchies). The file is in DAF
format (CLUSTAL-like) and is annotated with
domain family classification information.
The files generated by using SCOPALIGN will
contain a structure-based sequence alignment
of domains of known structure only. Such
alignments can be extended with sequence
relatives (of unknown structure) by using
SEQALIGN.
-mode menu This option specifies the mode of signature
generation. There are 3 modes for signatures
generatation: (1) Use positions specified
in alignment file. The alignment file must
contain a line beginning with the text
'Positions' for each line of the alignment.
A '1' in the 'Positions' line indicates that
the signature should include data from the
corresponding alignment site. The signature
will only include the positions that are
marked with a '1'. (2) Use a scoring method.
The alignment is scored (see 'Algorithm')
and the signature of a specified sparsity is
sampled from high scoring positions. (3):
Generate a randomised signature. A signature
of a specified sparsity is sampled at
random from the alignment.
* -conoption menu This option specifies the structure-based
scoring scheme. SIGGEN provides 2
structure-based scoring schemes (plus a
combination method) that are used to score
the input alignment.
* -conpath directory This option specifies the location of CON
files (contact files) (input). A 'contact
file' contains contact data for a protein or
a domain from SCOP or CATH, in the CON
format (EMBL-like). The contacts may be
intra-chain residue-residue, inter-chain
residue-residue or residue-ligand. The files
are generated by using CONTACTS, INTERFACE
and SITES.
* -cpdbpath directory This option specifies the location of domain
CCF files (clean coordinate files) (input).
A 'clean cordinate file' contains protein
coordinate and derived data for a single PDB
file ('protein clean coordinate file') or a
single domain from SCOP or CATH ('domain
clean coordinate file'), in CCF format
(EMBL-like). The files, generated by using
PDBPARSE (PDB files) or DOMAINER (domains),
contain 'cleaned-up' data that is
self-consistent and error-corrected. Records
for residue solvent accessibility and
secondary structure are added to the file by
using PDBPLUS.
* -seqoption menu This option specifies the sequence-based
scoring scheme. SIGGEN provides 2
sequence-based scoring schemes that are used
to score the input alignment.
* -datafile matrixf This option specifies the the substitution
matrix. The substitution matrix is used by
the sequence-based scoring schemes.
* -sparsity integer This option specifies the % sparsity of
signature. The signature sparsity is a
user-defined parameter that determines how
many residues the final signature will
contain, for example, if the average
sequence length of the proteins in the
alignment is 250 residues, then a signature
of sparsity 10% (default value) will contain
25 key residues or signature positions,
that correspond to the top 25% highest
scoring alignment positions.
-wsiz integer This option specifies the window size. When
a signature is aligned to a protein
sequence, the permissible gaps between two
signature positions is determined by the
empirical gaps and the window size. The user
is prompted for a window size that is used
for every position in the signature. Likely
this is not optimal. A future implementation
will provide a range of methods for
generating values of window size depending
upon the alignment (window size is
identified by the WSIZ record in the
signature output file).
* -filtercon toggle This option specifies whether to disregard
positions forming few contacts only during
the selection of signature positions.
* -conthresh integer This option specifies the threshold contact
number. This controls the selection of key
positions for the structure-based scoring
scheme (number of contacts).
* -[no]filterpsim boolean This option specifies whether to disregard
alignment sites that were not aligned
satisfactorily (STAMP alignments only).
[-sigoutdir] outdir This option specifies the location of
signature files (output). A 'signature file'
contains a sparse sequence signature
suitable for use with the SIGSCAN and SIGGEN
programs. The files are generated by using
SIGGEN & SIGGENLIG.
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers: (none)
Associated qualifiers: (none)
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report deaths
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [-algpath] (Parameter 1) |
This option specifies the location of DAF files (domain alignment files) (input). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is in DAF format (CLUSTAL-like) and is annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. | Directory with files | ./ | ||||||||||
| -mode | This option specifies the mode of signature generation. There are 3 modes for signatures generatation: (1) Use positions specified in alignment file. The alignment file must contain a line beginning with the text 'Positions' for each line of the alignment. A '1' in the 'Positions' line indicates that the signature should include data from the corresponding alignment site. The signature will only include the positions that are marked with a '1'. (2) Use a scoring method. The alignment is scored (see 'Algorithm') and the signature of a specified sparsity is sampled from high scoring positions. (3): Generate a randomised signature. A signature of a specified sparsity is sampled at random from the alignment. |
|
1 | ||||||||||
| -conoption | This option specifies the structure-based scoring scheme. SIGGEN provides 2 structure-based scoring schemes (plus a combination method) that are used to score the input alignment. |
|
5 | ||||||||||
| -conpath | This option specifies the location of CON files (contact files) (input). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and SITES. | Directory | ./ | ||||||||||
| -cpdbpath | This option specifies the location of domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. | Directory | ./ | ||||||||||
| -seqoption | This option specifies the sequence-based scoring scheme. SIGGEN provides 2 sequence-based scoring schemes that are used to score the input alignment. |
|
3 | ||||||||||
| -datafile | This option specifies the the substitution matrix. The substitution matrix is used by the sequence-based scoring schemes. | Comparison matrix file in EMBOSS data path | EBLOSUM62 | ||||||||||
| -sparsity | This option specifies the % sparsity of signature. The signature sparsity is a user-defined parameter that determines how many residues the final signature will contain, for example, if the average sequence length of the proteins in the alignment is 250 residues, then a signature of sparsity 10% (default value) will contain 25 key residues or signature positions, that correspond to the top 25% highest scoring alignment positions. | Any integer value | 10 | ||||||||||
| -wsiz | This option specifies the window size. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size. The user is prompted for a window size that is used for every position in the signature. Likely this is not optimal. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file). | Any integer value | 0 | ||||||||||
| -filtercon | This option specifies whether to disregard positions forming few contacts only during the selection of signature positions. | Toggle value Yes/No | No | ||||||||||
| -conthresh | This option specifies the threshold contact number. This controls the selection of key positions for the structure-based scoring scheme (number of contacts). | Any integer value | 10 | ||||||||||
| -[no]filterpsim | This option specifies whether to disregard alignment sites that were not aligned satisfactorily (STAMP alignments only). | Boolean value Yes/No | Yes | ||||||||||
| [-sigoutdir] (Parameter 2) |
This option specifies the location of signature files (output). A 'signature file' contains a sparse sequence signature suitable for use with the SIGSCAN and SIGGEN programs. The files are generated by using SIGGEN & SIGGENLIG. | Output directory | ./ | ||||||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||||||
| (none) | |||||||||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||
| (none) | |||||||||||||
% siggen
Generates a sparse protein signature from an alignment.
Location of DAF files (domain alignment files) (input) [./]: ../domainalign-keep/daf
Specify mode of signature generation
1 : Use positions specified in alignment file
2 : Use a scoring method
3 : Generate a randomised signature
Select number [1]: 2
Residue contacts scoring method
1 : Number
2 : Conservation
3 : Number and conservation
4 : None (structural data available)
5 : None (no structural data available)
Select number [5]: 5
Sequence variability scoring method
1 : Substitution matrix
2 : Residue class
3 : None
Select number [3]: 1
Substitution matrix to be used [EBLOSUM62]: EBLOSUM62
The % sparsity of signature [10]: 15
Window size [0]: 0
Ignore alignment postitions with post_similar value of 0 [Y]: Y
Location of signature files (output) [./]:
|
Go to the output files for this example
| FILE TYPE | FORMAT | DESCRIPTION | CREATED BY | SEE ALSO |
| Clean coordinate file (for domain) | CCF format (EMBL-like). | Protein coordinate and derived data for a single domain from SCOP or CATH. The data are 'cleaned-up': self-consistent and error-corrected. | DOMAINER | Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. |
| Contact file (intra-chain residue-residue contacts) | CON format (EMBL-like.) | Intra-chain residue-residue contact data for a protein or a domain from SCOP or CATH. | CONTACTS | N.A. |
| Domain alignment file | DAF format (CLUSTAL-like). | Sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is annotated with domain family classification information. | DOMAINALIGN (structure-based sequence alignment of domains of known structure). | DOMAINALIGN alignments can be extended with sequence relatives (of unknown structure) to the family in question by using SEQALIGN. |
| Signature file | SIG format | Contains a sparse sequence signature suitable for use with the SIGSCAN program. Contains a sparse sequence signature. | SIGGENLIG, LIBGEN | The files are generated by using SIGGEN. |
| Program name | Description |
|---|---|
| contactcount | Count specific versus non-specific contacts |
| contacts | Generate intra-chain CON files from CCF files |
| domainalign | Generate alignments (DAF file) for nodes in a DCF file |
| domainrep | Reorder DCF file to identify representative structures |
| domainreso | Remove low resolution domains from a DCF file |
| interface | Generate inter-chain CON files from CCF files |
| libgen | Generate discriminating elements from alignments |
| matgen3d | Generate a 3D-1D scoring matrix from CCF files |
| psiphi | Phi and psi torsion angles from protein coordinates |
| rocon | Generates a hits file from comparing two DHF files |
| rocplot | Performs ROC analysis on hits files |
| scorecmapdir | Contact scores for cleaned protein chain contact files |
| seqalign | Extend alignments (DAF file) with sequences (DHF file) |
| seqfraggle | Removes fragment sequences from DHF files |
| seqsearch | Generate PSI-BLAST hits (DHF file) from a DAF file |
| seqsort | Remove ambiguous classified sequences from DHF files |
| seqwords | Generates DHF files from keyword search of UniProt |
| siggenlig | Generate ligand-binding signatures from a CON file |
| sigscan | Generate hits (DHF file) from a signature search |
| sigscanlig | Search ligand-signature library & write hits (LHF file) |
See also http://emboss.sourceforge.net/
Automatic generation and evaluation of sparse protein signatures for families of protein
structural domains. MJ Blades, JC Ison, R Ranasinghe, and JBC Findlay. Protein Science. 2005 (accepted)
A key residues approach to the definition of protein families and analysis
of sparse family signatures. JC Ison, AJ Bleasby, MJ Blades, SC Daniel,
JH Parish, JBC Findlay. PROTEINS: Structure, Function & Genetics. 2000,
40:330-341
Alignment of a sparse protein signature with protein sequences: application
to fold prediction for three small globulins. SC Daniel, JH Parish,
JC Ison, MJ Blades & JBC Findlay. FEBS Letters. 1999, 459:349-352.