
                           SEQSORT documentation
                                      
   

CONTENTS

   1.0 SUMMARY 
   2.0 INPUTS & OUTPUTS 
   3.0 INPUT FILE FORMAT 
   4.0 OUTPUT FILE FORMAT 
   5.0 DATA FILES 
   6.0 USAGE 
   7.0 KNOWN BUGS & WARNINGS 
   8.0 NOTES 
   9.0 DESCRIPTION 
   10.0 ALGORITHM 
   11.0 RELATED APPLICATIONS 
   12.0 DIAGNOSTIC ERROR MESSAGES 
   13.0 AUTHORS 
   14.0 REFERENCES 

1.0 SUMMARY

   Remove ambiguous classified sequences from DHF files

2.0 INPUTS & OUTPUTS

   SEQSORT reads a directory of DHF files (domain hits files) where each
   file contains hits to a single SCOP family, compares, processes and
   collates the hits and writes a directory of DHF files which contain
   only those hits that could be uniquely assigned to a SCOP family.
   Optionally, two further files of hits are written: (i) a domain
   families file, of ALL hits from the input files that could be uniquely
   assigned to a SCOP family and (ii) a domain ambiguities file, of hits
   from ALL the input files that are of ambiguous family assignment and
   are assigned as relatives to a SCOP superfamily or fold instead.
   The path for the domain hits files (input and output) and the names of
   the output files are specified by the user. The file extension of the
   DHF files are set in the ACD file.

3.0 INPUT FILE FORMAT

   The format of the domain hits file is described in SEQSEARCH
   documentation.

4.0 OUTPUT FILE FORMAT

   The format of the domain hits file is described in SEQSEARCH
   documentation.
   The domain families file and domain ambiguities file also use the DHF
   format. Whereas normally a DHF file contains hits for a single node
   from SCOP or CATH, the families and ambiguities files may contain
   domains from multiple different families (domain families file), or
   superfamilies or folds (ambiguities file). Domains of the same node
   (e.g. family) will be grouped together in blocks, i.e. all hits for
   domain A, then all hits for domain B and so on (see Figure 1).

  Output files for usage example

  File: fam.dhf

> Q9YBD5^.^1^95^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^55.30^0.000e+00^2.00
0e-11
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFN
VIEDYKVVEKVKLKLP
> Q97FS4^.^1^90^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^42.60^0.000e+00^1.00
0e-07
INSIKNGIVIDHIKAGHGIKIYNYLKLGEAEFPTALIMNAISKKNKAKDIIKIENVMDLDLAVLGFLDPNITVNIIEDE
KIRQKIQLKLP
> Q7MX57^.^1^92^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^72.70^0.000e+00^1.00
0e-16
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIR
DYEVVEKRQVEVP
> P96111^.^1^98^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^42.20^0.000e+00^1.00
0e-07
GIKPIENGTVIDHIAKGKTPEEIYSTILKIRKILRLYDVDSADGIFRSSDGSFKGYISLPDRYLSKKEIKKLSAISPNT
TVNIIKNSTVVEKYRIKLP
> Q08462^.^1^167^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^49.70^0.000e+00^3.000e-09
DCVCVMFASIPDFKEFYTESDVNKEGLECLRLLNEIIADFDDLLSKPKFSGVEKIKTIGSTYMAATGLSAVPSQEHSQE
PERQYMHIGTMVEFAFALVGKLDAINKHSFNDFKLRVGINHGPVIAGVIGAQKPQYDIWGNTVNVASRMDSTGVLDKIQ
VTEETSLVL
> Q03101^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^70.90^0.000e+00^1.000e-15
NNACVFFLDIAGFTRFSSIHSPEQVIQVLIKIFNSMDLLCAKHGIEKIKTIGDAYMATCGIFPKCDDIRHNTYKMLGFA
MDVLEFIPKEMSFHLGLQVRVGIHCGPVISGVISGYAKPHFDVWGDTVNVASRMESTGIAGQIHVSDRVY
> Q02153^.^1^165^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^75.90^0.000e+00^4.000e-17
HKRPVPAKRYDNVTILFSGIVGFNAFCSKHASGEGAMKIVNLLNDLYTRFDTLTDSRKNPFVYKVETVGDKYMTVSGLP
EPCIHHARSICHLALDMMEIAGQVQVDGESVQITIGIHTGEVVTGVIGQRMPRYCLFGNTVNLTSRTETTGEKGKINVS
EYTYRCL
> P46197^.^1^168^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^84.40^0.000e+00^1.000e-19
VQAEAFDSVTIYFSDIVGFTALSAESTPMQVVTLLNDLYTCFDAIIDNFDVYKVETIGDAYMVVSGLPGRNGQRHAPEI
ARMALALLDAVSSFRIRHRPHDQLRLRIGVHTGPVCAGVVGLKMPRYCLFGDTVNTASRMESNGQALKIHVSSTTKDAL
DELGCFQLEL
> P40137^.^1^139^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^50.90^0.000e+00^1.000e-09
VTLLFADIRDFTSLSERLRPEQVVTLLNEYYGRMVEVVFRHGGTLDKFIGDALMVYFGAPIADPAHARRGVQCALDMVQ
ELETVNALRSARGEPCLRIGVGVHTGPAVLGNIGSATRRLEYTAIGDTVNLASRIESLTK
> P23466^.^1^154^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^55.90^0.000e+00^4.000e-11
PTGNVAIVFTDIKNSTFLWELFPDAMRAAIKTHNDIMRRQLRIYGGYEVKTEGDAFMVAFPTPTSALVWCLSVQLKLLE
AEWPEEITSIQDGCLITDNSGTKVYLGLSVRMGVHWGCPVPEIDLVTQRMDYLGPVVNKAARVSGVADGGQITLS
> O30820^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^80.20^0.000e+00^2.000e-18
DEASVLFADIVGFTERASSTAPADLVRFLDRLYSAFDELVDQHGLEKIKVSGDSYMVVSGVPRPRPDHTQALADFALDM
TNVAAQLKDPRGNPVPLRVGLATGPVVAGVVGSRRFFYDVWGDAVNVASRMESTDSVGQIQVPDEVYERL

  File: oth.dhf

  File: 54894.dhf

> Q9YBD5^.^1^95^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^55.30^0.000e+00^2.00
0e-11
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFN
VIEDYKVVEKVKLKLP
> Q97FS4^.^1^90^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^42.60^0.000e+00^1.00
0e-07
INSIKNGIVIDHIKAGHGIKIYNYLKLGEAEFPTALIMNAISKKNKAKDIIKIENVMDLDLAVLGFLDPNITVNIIEDE
KIRQKIQLKLP
> Q7MX57^.^1^92^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^72.70^0.000e+00^1.00
0e-16
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIR
DYEVVEKRQVEVP
> P96111^.^1^98^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^
Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate c
arbamoyltransferase, Regulatory-chain, N-terminal domain^.^42.20^0.000e+00^1.00
0e-07
GIKPIENGTVIDHIAKGKTPEEIYSTILKIRKILRLYDVDSADGIFRSSDGSFKGYISLPDRYLSKKEIKKLSAISPNT
TVNIIKNSTVVEKYRIKLP

  File: 55074.dhf

> Q08462^.^1^167^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^49.70^0.000e+00^3.000e-09
DCVCVMFASIPDFKEFYTESDVNKEGLECLRLLNEIIADFDDLLSKPKFSGVEKIKTIGSTYMAATGLSAVPSQEHSQE
PERQYMHIGTMVEFAFALVGKLDAINKHSFNDFKLRVGINHGPVIAGVIGAQKPQYDIWGNTVNVASRMDSTGVLDKIQ
VTEETSLVL
> Q03101^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^70.90^0.000e+00^1.000e-15
NNACVFFLDIAGFTRFSSIHSPEQVIQVLIKIFNSMDLLCAKHGIEKIKTIGDAYMATCGIFPKCDDIRHNTYKMLGFA
MDVLEFIPKEMSFHLGLQVRVGIHCGPVISGVISGYAKPHFDVWGDTVNVASRMESTGIAGQIHVSDRVY
> Q02153^.^1^165^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^75.90^0.000e+00^4.000e-17
HKRPVPAKRYDNVTILFSGIVGFNAFCSKHASGEGAMKIVNLLNDLYTRFDTLTDSRKNPFVYKVETVGDKYMTVSGLP
EPCIHHARSICHLALDMMEIAGQVQVDGESVQITIGIHTGEVVTGVIGQRMPRYCLFGNTVNLTSRTETTGEKGKINVS
EYTYRCL
> P46197^.^1^168^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^84.40^0.000e+00^1.000e-19
VQAEAFDSVTIYFSDIVGFTALSAESTPMQVVTLLNDLYTCFDAIIDNFDVYKVETIGDAYMVVSGLPGRNGQRHAPEI
ARMALALLDAVSSFRIRHRPHDQLRLRIGVHTGPVCAGVVGLKMPRYCLFGDTVNTASRMESNGQALKIHVSSTTKDAL
DELGCFQLEL
> P40137^.^1^139^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^50.90^0.000e+00^1.000e-09
VTLLFADIRDFTSLSERLRPEQVVTLLNEYYGRMVEVVFRHGGTLDKFIGDALMVYFGAPIADPAHARRGVQCALDMVQ
ELETVNALRSARGEPCLRIGVGVHTGPAVLGNIGSATRRLEYTAIGDTVNLASRIESLTK
> P23466^.^1^154^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^55.90^0.000e+00^4.000e-11
PTGNVAIVFTDIKNSTFLWELFPDAMRAAIKTHNDIMRRQLRIYGGYEVKTEGDAFMVAFPTPTSALVWCLSVQLKLLE
AEWPEEITSIQDGCLITDNSGTKVYLGLSVRMGVHWGCPVPEIDLVTQRMDYLGPVVNKAARVSGVADGGQITLS
> O30820^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like
^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase c
atalytic domain^.^80.20^0.000e+00^2.000e-18
DEASVLFADIVGFTERASSTAPADLVRFLDRLYSAFDELVDQHGLEKIKVSGDSYMVVSGVPRPRPDHTQALADFALDM
TNVAAQLKDPRGNPVPLRVGLATGPVVAGVVGSRRFFYDVWGDAVNVASRMESTDSVGQIQVPDEVYERL

5.0 DATA FILES

   None.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-dhfindir]          directory  This option specifies the location of DHF
                                  files (domain hits files) (input). A 'domain
                                  hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA or
                                  EMBL-like). The hits are relatives to a SCOP
                                  or CATH family and are found from a search
                                  of a sequence database. Files containing
                                  hits retrieved by PSIBLAST are generated by
                                  using SEQSEARCH.
   -overlap            integer    This option specifies the number of
                                  overlapping residues required for merging of
                                  two hits. Each family is also processed so
                                  that ovlerapping hits (hits with identical
                                  accesssion number that overlap by at least a
                                  user-defined number of residues) are
                                  replaced by a hit that is produced from
                                  merging the two overlapping hits.
   -dofamilies         toggle     This option specifies to write a domain
                                  families file. If this option is set a
                                  domain families file is written.
   -doambiguities      toggle     This option specifies whether to write a
                                  domain ambiguities file. If this option is
                                  set a domain ambiguities file is written.
  [-dhfoutdir]         outdir     This option specifies the location of DHF
                                  files (domain hits files) (output). A
                                  'domain hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA or
                                  EMBL-like). The hits are relatives to a SCOP
                                  or CATH family and are found from a search
                                  of a sequence database. Files containing
                                  hits retrieved by PSIBLAST are generated by
                                  using SEQSEARCH.
*  -hitsfile           outfile    This option specifies the name of domain
                                  families file (output). A 'domain families
                                  file' contains sequence relatives (hits) for
                                  each of a number of different SCOP or CATH
                                  families found from searching a sequence
                                  database, e.g. by using SEQSEARCH
                                  (psiblast). The file contains the collated
                                  search results for the indvidual families;
                                  only those hits of unambiguous family
                                  assignment are included. Hits of ambiguous
                                  family assignment are assigned as relatives
                                  to a SCOP or CATH superfamily or fold
                                  instead and are collated into a 'domain
                                  ambiguities file'. The domain families and
                                  ambiguities files are generated by using
                                  SEQSORT and use the same format as a DHF
                                  file (domain hits file).
*  -ambigfile          outfile    This option specifies the name of domain
                                  ambiguities file (output). A 'domain
                                  families file' contains sequence relatives
                                  (hits) for each of a number of different
                                  SCOP or CATH families found from searching a
                                  sequence database, e.g. by using SEQSEARCH
                                  (psiblast). The file contains the collated
                                  search results for the indvidual families;
                                  only those hits of unambiguous family
                                  assignment are included. Hits of ambiguous
                                  family assignment are assigned as relatives
                                  to a SCOP or CATH superfamily or fold
                                  instead and are collated into a 'domain
                                  ambiguities file'. The domain families and
                                  ambiguities files are generated by using
                                  SEQSORT and use the same format as a DHF
                                  file (domain hits file).

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-hitsfile" associated qualifiers
   -odirectory         string     Output directory

   "-ambigfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-dhfindir]
   (Parameter 1) This option specifies the location of DHF files (domain
   hits files) (input). A 'domain hits file' contains database hits
   (sequences) with domain classification information, in the DHF format
   (FASTA or EMBL-like). The hits are relatives to a SCOP or CATH family
   and are found from a search of a sequence database. Files containing
   hits retrieved by PSIBLAST are generated by using SEQSEARCH. Directory
   ./
   -overlap This option specifies the number of overlapping residues
   required for merging of two hits. Each family is also processed so
   that ovlerapping hits (hits with identical accesssion number that
   overlap by at least a user-defined number of residues) are replaced by
   a hit that is produced from merging the two overlapping hits. Any
   integer value 10
   -dofamilies This option specifies to write a domain families file. If
   this option is set a domain families file is written. Toggle value
   Yes/No No
   -doambiguities This option specifies whether to write a domain
   ambiguities file. If this option is set a domain ambiguities file is
   written. Toggle value Yes/No No
   [-dhfoutdir]
   (Parameter 2) This option specifies the location of DHF files (domain
   hits files) (output). A 'domain hits file' contains database hits
   (sequences) with domain classification information, in the DHF format
   (FASTA or EMBL-like). The hits are relatives to a SCOP or CATH family
   and are found from a search of a sequence database. Files containing
   hits retrieved by PSIBLAST are generated by using SEQSEARCH. Output
   directory ./
   -hitsfile This option specifies the name of domain families file
   (output). A 'domain families file' contains sequence relatives (hits)
   for each of a number of different SCOP or CATH families found from
   searching a sequence database, e.g. by using SEQSEARCH (psiblast). The
   file contains the collated search results for the indvidual families;
   only those hits of unambiguous family assignment are included. Hits of
   ambiguous family assignment are assigned as relatives to a SCOP or
   CATH superfamily or fold instead and are collated into a 'domain
   ambiguities file'. The domain families and ambiguities files are
   generated by using SEQSORT and use the same format as a DHF file
   (domain hits file). Output file fam.dhf
   -ambigfile This option specifies the name of domain ambiguities file
   (output). A 'domain families file' contains sequence relatives (hits)
   for each of a number of different SCOP or CATH families found from
   searching a sequence database, e.g. by using SEQSEARCH (psiblast). The
   file contains the collated search results for the indvidual families;
   only those hits of unambiguous family assignment are included. Hits of
   ambiguous family assignment are assigned as relatives to a SCOP or
   CATH superfamily or fold instead and are collated into a 'domain
   ambiguities file'. The domain families and ambiguities files are
   generated by using SEQSORT and use the same format as a DHF file
   (domain hits file). Output file oth.dhf
   Additional (Optional) qualifiers Allowed values Default
   (none)
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

  6.2 EXAMPLE SESSION

   An example of interactive use of SEQSORT is shown below. Here is a
   sample session with seqsort


% seqsort 
Remove ambiguous classified sequences from DHF files.
Location of DHF files (domain hits files) (input). [./]: ../seqnr-keep/hitsnr
Number of overlapping residues required for merging of two hits. [10]: 10
Write domain families file. [N]: Y
Write domain ambiguities file. [N]: Y
Location of DHF files (domain hits files) (output). [./]: 
Name of domain families file (output). [fam.dhf]: 
Name of domain ambiguities file (output). [oth.dhf]: 

   Go to the output files for this example

7.0 KNOWN BUGS & WARNINGS

   None.

8.0 NOTES

   None.

  8.1 GLOSSARY OF FILE TYPES

   FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
   Domain hits file DHF format (FASTA-like). Database hits (sequences)
   with domain classification information. The hits are relatives to a
   SCOP or CATH family (or other node in the structural hierarchies) and
   are found from a search of a discriminating element (e.g. a protein
   signature, hidden Markov model, simple frequency matrix, Gribskov
   profile or Hennikoff profile) against a sequence database. SEQSEARCH
   (hits retrieved by PSIBLAST). SIGSCAN (hits retrieved by sparse
   protein signature). LIBSCAN (hits retrieved by various types of HMM
   and profile). N.A.
   Domain families & ambiguities file Contains sequence relatives (hits)
   for each of a number of different SCOP or CATH families found from
   PSIBLAST searches of a sequence database. The file contains the
   collated search results for the indvidual families; only those hits of
   unambiguous family assignment are included. Hits of ambiguous family
   assignment are assigned as relatives to a SCOP or CATH superfamily or
   fold instead and are collated into a 'domain ambiguities file'. The
   domain families and ambiguities files are generated by using SEQSORT
   and use the same format as a DHF file (domain hits file). N.A.
   Domain validation file Contains sequence relatives (hits) for each of
   a number of different SCOP or CATH families, superfamilies and folds.
   The file contains the collated results from PSIBLAST searches of a
   sequence database for the indvidual families; hits of unambiguous
   family assignment are listed under their respective family, otherwise
   a hit is assigned as relatives to a superfamily or fold instead. The
   domain validation file is generated by using SEQNR and uses the same
   format as a DHF file (domain hits file). N.A.

   None

9.0 DESCRIPTION

   The results of multiple searches of a sequence database using an
   homology search tool such as blast may contain overlapping or
   identical hits, especially where the query sequences are related, for
   instance, belong to different families but the same superfamily. For
   certain analyses it is desirable to assign a hit with confidence to a
   unique family, or otherwise assign it as a member of a larger
   superfamily or fold instead. SEQSORT reads a directory of DHF files
   (domain hits files) where each file containing hits to a different
   SCOP family, compares, processes and collates the hits and writes a
   directory of DHF files which contain only those hits that could be
   uniquely assigned to a SCOP family. Optionally, two further files are
   written: (i) a domain families file, of hits (from ALL the input
   files) that could be uniquely assigned to a SCOP family and (ii) a
   domain ambiguities file, for hits (from ALL the input files) of
   ambiguous family assignment which are assigned as relatives to a SCOP
   superfamily or fold instead.

10.0 ALGORITHM

   A rough outline of the algorithm follows; a better description will
   appear in a publication in preparation. Hits from searches for all
   domain families are collated into a single list and the list sorted
   according to family name. The hits hits within each family are sorted
   by accession number, then hits within a family and with identical
   accession number are sorted by the start position of the hit relative
   to the full length sequence in swissprot. In each family identical
   hits (i.e. those with identical accession number and the same start
   and end points relative to the full-length sequence in swissprot) were
   removed leaving only a single copy. Each family is also processed so
   that ovlerapping hits (hits with identical accesssion number that
   overlap by at least a user-defined number of residues) are replaced by
   a hit that is produced from merging the two overlapping hits. If two
   hits have the same accession number and overlap but are from searches
   for different families, the hits are merged and the merged hit placed
   into a new list for hits to superfamilies (if the two families
   belonged to the same superfamily) or for hits to folds (if the two
   families were in different superfamilies but the same fold). In this
   way hits that are unique to a particular family are identified, and
   hits of ambiguous family assignment are assigned as belonging to a
   superfamily or fold instead.

11.0 RELATED APPLICATIONS

See also

   Program name                       Description
   contactcount Count specific versus non-specific contacts
   contacts     Generate intra-chain CON files from CCF files
   domainalign  Generate alignments (DAF file) for nodes in a DCF file
   domainrep    Reorder DCF file to identify representative structures
   domainreso   Remove low resolution domains from a DCF file
   interface    Generate inter-chain CON files from CCF files
   libgen       Generate discriminating elements from alignments
   matgen3d     Generate a 3D-1D scoring matrix from CCF files
   psiphi       Phi and psi torsion angles from protein coordinates
   rocon        Generates a hits file from comparing two DHF files
   rocplot      Performs ROC analysis on hits files
   scorecmapdir Contact scores for cleaned protein chain contact files
   seqalign     Extend alignments (DAF file) with sequences (DHF file)
   seqfraggle   Removes fragment sequences from DHF files
   seqsearch    Generate PSI-BLAST hits (DHF file) from a DAF file
   seqwords     Generates DHF files from keyword search of UniProt
   siggen       Generates a sparse protein signature from an alignment
   siggenlig    Generate ligand-binding signatures from a CON file
   sigscan      Generate hits (DHF file) from a signature search
   sigscanlig   Search ligand-signature library & write hits (LHF file)

12.0 DIAGNOSTIC ERROR MESSAGES

   None.

13.0 AUTHORS

   Ranjeeva Ranasinghe (rranasin@rfcgr.mrc.ac.uk)
   Jon Ison (jison@rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research, Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

   Please cite the authors and EMBOSS.
   Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European
   Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.
   
   See also http://emboss.sourceforge.net/

  14.1 Other useful references
