
                                 fneighbor 



Function

   Phylogenies from distance matrix by N-J or UPGMA method

Description

   An implementation by Mary Kuhner and John Yamato of Saitou and Nei's
   "Neighbor Joining Method," and of the UPGMA (Average Linkage
   clustering) method. Neighbor Joining is a distance matrix method
   producing an unrooted tree without the assumption of a clock. UPGMA
   does assume a clock. The branch lengths are not optimized by the least
   squares criterion but the methods are very fast and thus can handle
   much larger data sets.

Algorithm

   This program implements the Neighbor-Joining method of Saitou and Nei
   (1987) and the UPGMA method of clustering. The program was written by
   Mary Kuhner and Jon Yamato, using some code from program FITCH. An
   important part of the code was translated from FORTRAN code from the
   neighbor-joining program written by Naruya Saitou and by Li Jin, and
   is used with the kind permission of Drs. Saitou and Jin.

   NEIGHBOR constructs a tree by successive clustering of lineages,
   setting branch lengths as the lineages join. The tree is not
   rearranged thereafter. The tree does not assume an evolutionary clock,
   so that it is in effect an unrooted tree. It should be somewhat
   similar to the tree obtained by FITCH. The program cannot evaluate a
   User tree, nor can it prevent branch lengths from becoming negative.
   However the algorithm is far faster than FITCH or KITSCH. This will
   make it particularly effective in their place for large studies or for
   bootstrap or jackknife resampling studies which require runs on
   multiple data sets.

   The UPGMA option constructs a tree by successive (agglomerative)
   clustering using an average-linkage method of clustering. It has some
   relationship to KITSCH, in that when the tree topology turns out the
   same, the branch lengths with UPGMA will turn out to be the same as
   with the P = 0 option of KITSCH.

   The programs FITCH, KITSCH, and NEIGHBOR are for dealing with data
   which comes in the form of a matrix of pairwise distances between all
   pairs of taxa, such as distances based on molecular sequence data,
   gene frequency genetic distances, amounts of DNA hybridization, or
   immunological distances. In analyzing these data, distance matrix
   programs implicitly assume that:
     * Each distance is measured independently from the others: no item
       of data contributes to more than one distance.
     * The distance between each pair of taxa is drawn from a
       distribution with an expectation which is the sum of values (in
       effect amounts of evolution) along the tree from one tip to the
       other. The variance of the distribution is proportional to a power
       p of the expectation.

   These assumptions can be traced in the least squares methods of
   programs FITCH and KITSCH but it is not quite so easy to see them in
   operation in the Neighbor-Joining method of NEIGHBOR, where the
   independence assumptions is less obvious.

   THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not
   be expected to be true in most kinds of data, such as genetic
   distances from gene frequency data. For genetic distance data in which
   pure genetic drift without mutation can be assumed to be the mechanism
   of change CONTML may be more appropriate. However, FITCH, KITSCH, and
   NEIGHBOR will not give positively misleading results (they will not
   make a statistically inconsistent estimate) provided that additivity
   holds, which it will if the distance is computed from the original
   data by a method which corrects for reversals and parallelisms in
   evolution. If additivity is not expected to hold, problems are more
   severe. A short discussion of these matters will be found in a review
   article of mine (1984a). For detailed, if sometimes irrelevant,
   controversy see the papers by Farris (1981, 1985, 1986) and myself
   (1986, 1988b).

   For genetic distances from gene frequencies, FITCH, KITSCH, and
   NEIGHBOR may be appropriate if a neutral mutation model can be assumed
   and Nei's genetic distance is used, or if pure drift can be assumed
   and either Cavalli-Sforza's chord measure or Reynolds, Weir, and
   Cockerham's (1983) genetic distance is used. However, in the latter
   case (pure drift) CONTML should be better.

   Restriction site and restriction fragment data can be treated by
   distance matrix methods if a distance such as that of Nei and Li
   (1979) is used. Distances of this sort can be computed in PHYLIp by
   the program RESTDIST.

   For nucleic acid sequences, the distances computed in DNADIST allow
   correction for multiple hits (in different ways) and should allow one
   to analyse the data under the presumption of additivity. In all of
   these cases independence will not be expected to hold. DNA
   hybridization and immunological distances may be additive and
   independent if transformed properly and if (and only if) the standards
   against which each value is measured are independent. (This is rarely
   exactly true).

   FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has
   the branch lengths unconstrained. KITSCH and the UPGMA option of
   NEIGHBOR, by contrast, assume that an "evolutionary clock" is valid,
   according to which the true branch lengths from the root of the tree
   to each tip are the same: the expected amount of evolution in any
   lineage is proportional to elapsed time.

Usage

   Here is a sample session with fneighbor


% fneighbor 
Phylogenies from distance matrix by N-J or UPGMA method
Input file: neighbor.dat
Output file [neighbor.fneighbor]: 



Cycle   4: species 1 (   0.91769) joins species 2 (   0.76891)
Cycle   3: node 1 (   0.42027) joins species 3 (   0.35793)
Cycle   2: species 6 (   0.15168) joins species 7 (   0.11752)
Cycle   1: node 1 (   0.04648) joins species 4 (   0.28469)
last cycle:
 node 1  (   0.02696) joins species 5  (   0.15393) joins node 6  (   0.03982)
coordinates lengthsum: 0.000000
Calling coordinates q->v 0.768910
coordinates lengthsum: 0.768910
Calling coordinates q->v 0.420269
coordinates lengthsum: 0.420269
Calling coordinates q->v 0.357931
coordinates lengthsum: 0.778200
Calling coordinates q->v 0.046481
coordinates lengthsum: 0.466750
Calling coordinates q->v 0.284694
coordinates lengthsum: 0.751444
Calling coordinates q->v 0.026956
coordinates lengthsum: 0.493706
Calling coordinates q->v 0.153931
coordinates lengthsum: 0.647637
Calling coordinates q->v 0.039819
coordinates lengthsum: 0.533525
Calling coordinates q->v 0.151675
coordinates lengthsum: 0.685200
Calling coordinates q->v 0.117525
coordinates lengthsum: 0.651050
Calling coordinates q->v 0.917690
coordinates lengthsum: 0.917690

Output written on file "neighbor.fneighbor"

Tree written on file "neighbor.treefile"

Done.


   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-datafile]          distances  (no help text) distances value
  [-outfile]           outfile    Output file name

   Additional (Optional) qualifiers (* if not always prompted):
   -matrixtype         menu       Type of data matrix
   -treetype           menu       Neighbor-joining or UPGMA tree
*  -outgrno            integer    Species number to use as outgroup
   -jumble             toggle     Randomise input order of species
*  -seed               integer    Random number seed between 1 and 32767 (must
                                  be odd)
   -replicates         boolean    Subreplicates
   -[no]trout          toggle     Write out trees to tree file
*  -outtreefile        outfile    Tree file name
   -printdata          boolean    Print data at start of run
   -[no]progress       boolean    Print indications of progress of run
   -[no]treeprint      boolean    Print out tree

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   "-outtreefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-datafile]
   (Parameter 1) (no help text) distances value Distance matrix
   [-outfile]
   (Parameter 2) Output file name Output file <sequence>.fneighbor
   Additional (Optional) qualifiers Allowed values Default
   -matrixtype Type of data matrix
   s (Square)
   u (Upper triangular)
   l (Lower triangular)
   s
   -treetype Neighbor-joining or UPGMA tree
   n (Neighbor-joining)
   u (UPGMA)
   n
   -outgrno Species number to use as outgroup Integer 0 or more 0
   -jumble Randomise input order of species Toggle value Yes/No No
   -seed Random number seed between 1 and 32767 (must be odd) Integer
   from 1 to 32767 1
   -replicates Subreplicates Boolean value Yes/No No
   -[no]trout Write out trees to tree file Toggle value Yes/No Yes
   -outtreefile Tree file name Output file
   -printdata Print data at start of run Boolean value Yes/No No
   -[no]progress Print indications of progress of run Boolean value
   Yes/No Yes
   -[no]treeprint Print out tree Boolean value Yes/No Yes
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

Input file format

   fneighbor reads any normal sequence USAs.

  Input files for usage example

  File: neighbor.dat

    7
Bovine      0.0000  1.6866  1.7198  1.6606  1.5243  1.6043  1.5905
Mouse       1.6866  0.0000  1.5232  1.4841  1.4465  1.4389  1.4629
Gibbon      1.7198  1.5232  0.0000  0.7115  0.5958  0.6179  0.5583
Orang       1.6606  1.4841  0.7115  0.0000  0.4631  0.5061  0.4710
Gorilla     1.5243  1.4465  0.5958  0.4631  0.0000  0.3484  0.3083
Chimp       1.6043  1.4389  0.6179  0.5061  0.3484  0.0000  0.2692
Human       1.5905  1.4629  0.5583  0.4710  0.3083  0.2692  0.0000

Output file format

   fneighbor output consists of an tree (rooted if UPGMA, unrooted if
   Neighbor-Joining) and the lengths of the interior segments. The
   Average Percent Standard Deviation is not computed or printed out. If
   the tree found by Neighbor is fed into FITCH as a User Tree, it will
   compute this quantity if one also selects the N option of FITCH to
   ensure that none of the branch lengths is re-estimated.

   As NEIGHBOR runs it prints out an account of the successive clustering
   levels, if you allow it to. This is mostly for reassurance and can be
   suppressed using menu option 2. In this printout of cluster levels the
   word "OTU" refers to a tip species, and the word "NODE" to an interior
   node of the resulting tree.

  Output files for usage example

  File: neighbor.fneighbor


Neighbor-Joining/UPGMA method version 3.6b


   7 Populations

 Neighbor-joining method

 Negative branch lengths allowed


  +---------------------------------------------Mouse
  !
  !                        +---------------------Gibbon
  1------------------------2
  !                        !  +----------------Orang
  !                        +--4
  !                           ! +--------Gorilla
  !                           +-5
  !                             ! +--------Chimp
  !                             +-3
  !                               +------Human
  !
  +------------------------------------------------------Bovine


remember: this is an unrooted tree!

Between        And            Length
-------        ---            ------
   1          Mouse           0.76891
   1             2            0.42027
   2          Gibbon          0.35793
   2             4            0.04648
   4          Orang           0.28469
   4             5            0.02696
   5          Gorilla         0.15393
   5             3            0.03982
   3          Chimp           0.15168
   3          Human           0.11752
   1          Bovine          0.91769


  File: neighbor.treefile

(Mouse:0.76891,(Gibbon:0.35793,(Orang:0.28469,(Gorilla:0.15393,
(Chimp:0.15168,Human:0.11752):0.03982):0.02696):0.04648):0.42027,Bovine:0.91769
);

Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

   Program name                       Description
   distmat      Creates a distance matrix from multiple alignments
   efitch       Fitch-Margoliash and Least-Squares Distance Methods
   ekitsch      Fitch-Margoliash method with contemporary tips
   eneighbor    Phylogenies from distance matrix by N-J or UPGMA method
   ffitch       Fitch-Margoliash and Least-Squares Distance Methods
   fkitsch      Fitch-Margoliash method with contemporary tips

Author(s)

   This program is an EMBOSS conversion of a program written by Joe
   Felsenstein as part of his PHYLIP package.

   Although we take every care to ensure that the results of the EMBOSS
   version are identical to those from the original package, we recommend
   that you check your inputs give the same results in both versions
   before publication.

   Please report all bugs in the EMBOSS version to the EMBOSS bug team,
   not to the original author.

History

   Written (2004) - Joe Felsenstein, University of Washington.

   Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
