
                                  fkitsch 



Function

   Fitch-Margoliash method with contemporary tips

Description

   Estimates phylogenies from distance matrix data under the
   "ultrametric" model which is the same as the additive tree model
   except that an evolutionary clock is assumed. The Fitch-Margoliash
   criterion and other least squares criteria, or the Minimum Evolution
   criterion are possible. This program will be useful with distances
   computed from molecular sequences, restriction sites or fragments
   distances, with distances from DNA hybridization measurements, and
   with genetic distances computed from gene frequencies.

Algorithm

   This program carries out the Fitch-Margoliash and Least Squares
   methods, plus a variety of others of the same family, with the
   assumption that all tip species are contemporaneous, and that there is
   an evolutionary clock (in effect, a molecular clock). This means that
   branches of the tree cannot be of arbitrary length, but are
   constrained so that the total length from the root of the tree to any
   species is the same. The quantity minimized is the same weighted sum
   of squares described in the Distance Matrix Methods documentation
   file.

   The programs FITCH, KITSCH, and NEIGHBOR are for dealing with data
   which comes in the form of a matrix of pairwise distances between all
   pairs of taxa, such as distances based on molecular sequence data,
   gene frequency genetic distances, amounts of DNA hybridization, or
   immunological distances. In analyzing these data, distance matrix
   programs implicitly assume that:
     * Each distance is measured independently from the others: no item
       of data contributes to more than one distance.
     * The distance between each pair of taxa is drawn from a
       distribution with an expectation which is the sum of values (in
       effect amounts of evolution) along the tree from one tip to the
       other. The variance of the distribution is proportional to a power
       p of the expectation.

   These assumptions can be traced in the least squares methods of
   programs FITCH and KITSCH but it is not quite so easy to see them in
   operation in the Neighbor-Joining method of NEIGHBOR, where the
   independence assumptions is less obvious.

   THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not
   be expected to be true in most kinds of data, such as genetic
   distances from gene frequency data. For genetic distance data in which
   pure genetic drift without mutation can be assumed to be the mechanism
   of change CONTML may be more appropriate. However, FITCH, KITSCH, and
   NEIGHBOR will not give positively misleading results (they will not
   make a statistically inconsistent estimate) provided that additivity
   holds, which it will if the distance is computed from the original
   data by a method which corrects for reversals and parallelisms in
   evolution. If additivity is not expected to hold, problems are more
   severe. A short discussion of these matters will be found in a review
   article of mine (1984a). For detailed, if sometimes irrelevant,
   controversy see the papers by Farris (1981, 1985, 1986) and myself
   (1986, 1988b).

   For genetic distances from gene frequencies, FITCH, KITSCH, and
   NEIGHBOR may be appropriate if a neutral mutation model can be assumed
   and Nei's genetic distance is used, or if pure drift can be assumed
   and either Cavalli-Sforza's chord measure or Reynolds, Weir, and
   Cockerham's (1983) genetic distance is used. However, in the latter
   case (pure drift) CONTML should be better.

   Restriction site and restriction fragment data can be treated by
   distance matrix methods if a distance such as that of Nei and Li
   (1979) is used. Distances of this sort can be computed in PHYLIp by
   the program RESTDIST.

   For nucleic acid sequences, the distances computed in DNADIST allow
   correction for multiple hits (in different ways) and should allow one
   to analyse the data under the presumption of additivity. In all of
   these cases independence will not be expected to hold. DNA
   hybridization and immunological distances may be additive and
   independent if transformed properly and if (and only if) the standards
   against which each value is measured are independent. (This is rarely
   exactly true).

   FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has
   the branch lengths unconstrained. KITSCH and the UPGMA option of
   NEIGHBOR, by contrast, assume that an "evolutionary clock" is valid,
   according to which the true branch lengths from the root of the tree
   to each tip are the same: the expected amount of evolution in any
   lineage is proportional to elapsed time.

   The method may be considered as providing an estimate of the
   phylogeny. Alternatively, it can be considered as a phenetic
   clustering of the tip species. This method minimizes an objective
   function, the sum of squares, not only setting the levels of the
   clusters so as to do so, but rearranging the hierarchy of clusters to
   try to find alternative clusterings that give a lower overall sum of
   squares. When the power option P is set to a value of P = 0.0, so that
   we are minimizing a simple sum of squares of the differences between
   the observed distance matrix and the expected one, the method is very
   close in spirit to Unweighted Pair Group Arithmetic Average Clustering
   (UPGMA), also called Average-Linkage Clustering. If the topology of
   the tree is fixed and there turn out to be no branches of negative
   length, its result should be the same as UPGMA in that case. But since
   it tries alternative topologies and (unless the N option is set) it
   combines nodes that otherwise could result in a reversal of levels, it
   is possible for it to give a different, and better, result than simple
   sequential clustering. Of course UPGMA itself is available as an
   option in program NEIGHBOR.

   An important use of this method will be to do a formal statistical
   test of the evolutionary clock hypothesis. This can be done by
   comparing the sums of squares achieved by FITCH and by KITSCH, BUT
   SOME CAVEATS ARE NECESSARY. First, the assumption is that the observed
   distances are truly independent, that no original data item
   contributes to more than one of them (not counting the two reciprocal
   distances from i to j and from j to i). THIS WILL NOT HOLD IF THE
   DISTANCES ARE OBTAINED FROM GENE FREQUENCIES, FROM MORPHOLOGICAL
   CHARACTERS, OR FROM MOLECULAR SEQUENCES. It may be invalid even for
   immunological distances and levels of DNA hybridization, provided that
   the use of common standard for all members of a row or column allows
   an error in the measurement of the standard to affect all these
   distances simultaneously. It will also be invalid if the numbers have
   been collected in experimental groups, each measured by taking
   differences from a common standard which itself is measured with
   error. Only if the numbers in different cells are measured from
   independent standards can we depend on the statistical model. The
   details of the test and the assumptions are discussed in my review
   paper on distance methods (Felsenstein, 1984a). For further and
   sometimes irrelevant controversy on these matters see the papers by
   Farris (1981, 1985, 1986) and myself (Felsenstein, 1986, 1988b).

   A second caveat is that the distances must be expected to rise
   linearly with time, not according to any other curve. Thus it may be
   necessary to transform the distances to achieve an expected linearity.
   If the distances have an upper limit beyond which they could not go,
   this is a signal that linearity may not hold. It is also VERY
   important to choose the power P at a value that results in the
   standard deviation of the variation of the observed from the expected
   distances being the P/2-th power of the expected distance.

   To carry out the test, fit the same data with both FITCH and KITSCH,
   and record the two sums of squares. If the topology has turned out the
   same, we have N = n(n-1)/2 distances which have been fit with 2n-3
   parameters in FITCH, and with n-1 parameters in KITSCH. Then the
   difference between S(K) and S(F) has d1 = n-2 degrees of freedom. It
   is statistically independent of the value of S(F), which has d2 =
   N-(2n-3) degrees of freedom. The ratio of mean squares

      [S(K)-S(F)]/d1
     ----------------
          S(F)/d2

   should, under the evolutionary clock, have an F distribution with n-2
   and N-(2n-3) degrees of freedom respectively. The test desired is that
   the F ratio is in the upper tail (say the upper 5%) of its
   distribution. If the S (subreplication) option is in effect, the above
   degrees of freedom must be modified by noting that N is not n(n-1)/2
   but is the sum of the numbers of replicates of all cells in the
   distance matrix read in, which may be either square or triangular. A
   further explanation of the statistical test of the clock is given in a
   paper of mine (Felsenstein, 1986).

   The program uses a similar tree construction method to the other
   programs in the package and, like them, is not guaranteed to give the
   best-fitting tree. The assignment of the branch lengths for a given
   topology is a least squares fit, subject to the constraints against
   negative branch lengths, and should not be able to be improved upon.
   KITSCH runs more quickly than FITCH.

Usage

   Here is a sample session with fkitsch


% fkitsch 
Fitch-Margoliash method with contemporary tips
Input file: kitsch.dat
Input tree file: 
Output file [kitsch.fkitsch]: 


 inseed: 0
 jumble: false
 njumble: 1
 lengths: false
 lower: false
 negallowed: false
 power: 2.000000
 replicates: false
 trout: true
 upper: false
 usertree: false
 printdata: false
 progress: true
 treeprint: true
 mulsets: false
 datasets: true
 nummatrices: 0Adding species:
   1. Bovine
   2. Mouse
   3. Gibbon
   4. Orang
   5. Gorilla
   6. Chimp
   7. Human

Doing global rearrangements
  !-------------!
   .............
coordinates lengthsum: 0.000000
Calling coordinates q->v 0.077904
coordinates lengthsum: 0.077904
Calling coordinates q->v 0.429227
coordinates lengthsum: 0.507131
Calling coordinates q->v 0.066390
coordinates lengthsum: 0.573520
Calling coordinates q->v 0.076377
coordinates lengthsum: 0.649897
Calling coordinates q->v 0.028355
coordinates lengthsum: 0.678252
Calling coordinates q->v 0.134600
coordinates lengthsum: 0.812852
Calling coordinates q->v 0.134600
coordinates lengthsum: 0.812852
Calling coordinates q->v 0.162955
coordinates lengthsum: 0.812852
Calling coordinates q->v 0.239332
coordinates lengthsum: 0.812852
Calling coordinates q->v 0.305722
coordinates lengthsum: 0.812852
Calling coordinates q->v 0.734948
coordinates lengthsum: 0.812852
Calling coordinates q->v 0.812852
coordinates lengthsum: 0.812852

Output written to file "kitsch.fkitsch"

Tree also written onto file "kitsch.treefile"

Done.


   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-datafile]          distances  File containing one or more distance
                                  matrices
  [-intreefile]        tree       (no help text) tree value
  [-outfile]           outfile    Output file name

   Additional (Optional) qualifiers (* if not always prompted):
   -matrixtype         menu       Type of data matrix
   -minev              boolean    Minimum evolution
*  -njumble            integer    Number of times to randomise
*  -seed               integer    Random number seed between 1 and 32767 (must
                                  be odd)
   -power              float      Power
   -negallowed         boolean    Negative branch lengths allowed
   -replicates         boolean    Subreplicates
   -[no]trout          toggle     Write out trees to tree file
*  -outtreefile        outfile    Tree file name
   -printdata          boolean    Print data at start of run
   -[no]progress       boolean    Print indications of progress of run
   -[no]treeprint      boolean    Print out tree

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   "-outtreefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


   Standard (Mandatory) qualifiers Allowed values Default
   [-datafile]
   (Parameter 1) File containing one or more distance matrices Distance
   matrix
   [-intreefile]
   (Parameter 2) (no help text) tree value Phylogenetic tree
   [-outfile]
   (Parameter 3) Output file name Output file <sequence>.fkitsch
   Additional (Optional) qualifiers Allowed values Default
   -matrixtype Type of data matrix
   s (Square)
   u (Upper triangular)
   l (Lower triangular)
   s
   -minev Minimum evolution Boolean value Yes/No No
   -njumble Number of times to randomise Integer 0 or more 0
   -seed Random number seed between 1 and 32767 (must be odd) Integer
   from 1 to 32767 1
   -power Power Any numeric value 2.0
   -negallowed Negative branch lengths allowed Boolean value Yes/No No
   -replicates Subreplicates Boolean value Yes/No No
   -[no]trout Write out trees to tree file Toggle value Yes/No Yes
   -outtreefile Tree file name Output file
   -printdata Print data at start of run Boolean value Yes/No No
   -[no]progress Print indications of progress of run Boolean value
   Yes/No Yes
   -[no]treeprint Print out tree Boolean value Yes/No Yes
   Advanced (Unprompted) qualifiers Allowed values Default
   (none)

Input file format

   fkitsch requires a bifurcating tree, unlike FITCH, which requires an
   unrooted tree with a trifurcation at its base. Thus the tree shown
   below would be written:

     ((D,E),(C,(A,B)));

   If a tree with a trifurcation at the base is by mistake fed into the U
   option of KITSCH then some of its species (the entire rightmost furc,
   in fact) will be ignored and too small a tree read in. This should
   result in an error message and the program should stop. It is
   important to understand the difference between the User Tree formats
   for KITSCH and FITCH. You may want to use RETREE to convert a user
   tree that is suitable for FITCH into one suitable for KITSCH or vice
   versa.

  Input files for usage example

  File: kitsch.dat

    7
Bovine      0.0000  1.6866  1.7198  1.6606  1.5243  1.6043  1.5905
Mouse       1.6866  0.0000  1.5232  1.4841  1.4465  1.4389  1.4629
Gibbon      1.7198  1.5232  0.0000  0.7115  0.5958  0.6179  0.5583
Orang       1.6606  1.4841  0.7115  0.0000  0.4631  0.5061  0.4710
Gorilla     1.5243  1.4465  0.5958  0.4631  0.0000  0.3484  0.3083
Chimp       1.6043  1.4389  0.6179  0.5061  0.3484  0.0000  0.2692
Human       1.5905  1.4629  0.5583  0.4710  0.3083  0.2692  0.0000

Output file format

   fkitsch output is a rooted tree, together with the sum of squares, the
   number of tree topologies searched, and, if the power P is at its
   default value of 2.0, the Average Percent Standard Deviation is also
   supplied. The lengths of the branches of the tree are given in a
   table, that also shows for each branch the time at the upper end of
   the branch. "Time" here really means cumulative branch length from the
   root, going upwards (on the printed diagram, rightwards). For each
   branch, the "time" given is for the node at the right (upper) end of
   the branch. It is important to realize that the branch lengths are not
   exactly proportional to the lengths drawn on the printed tree diagram!
   In particular, short branches are exaggerated in the length on that
   diagram so that they are more visible.

  Output files for usage example

  File: kitsch.fkitsch


   7 Populations

Fitch-Margoliash method with contemporary tips, version 3.6b

                  __ __             2
                  \  \   (Obs - Exp)
Sum of squares =  /_ /_  ------------
                                2
                   i  j      Obs

negative branch lengths not allowed


                                           +-------Human
                                         +-6
                                    +----5 +-------Chimp
                                    !    !
                                +---4    +---------Gorilla
                                !   !
       +------------------------3   +--------------Orang
       !                        !
  +----2                        +------------------Gibbon
  !    !
--1    +-------------------------------------------Mouse
  !
  +------------------------------------------------Bovine


Sum of squares =      0.107

Average percent standard deviation =   5.16213

From     To            Length          Height
----     --            ------          ------

   6   Human           0.13460         0.81285
   5      6            0.02836         0.67825
   6   Chimp           0.13460         0.81285
   4      5            0.07638         0.64990
   5   Gorilla         0.16296         0.81285
   3      4            0.06639         0.57352
   4   Orang           0.23933         0.81285
   2      3            0.42923         0.50713
   3   Gibbon          0.30572         0.81285
   1      2            0.07790         0.07790
   2   Mouse           0.73495         0.81285
   1   Bovine          0.81285         0.81285

  File: kitsch.treefile

((((((Human:0.13460,Chimp:0.13460):0.02836,Gorilla:0.16296):0.07638,
Orang:0.23933):0.06639,Gibbon:0.30572):0.42923,Mouse:0.73495):0.07790,
Bovine:0.81285);

Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

   Program name                       Description
   distmat      Creates a distance matrix from multiple alignments
   efitch       Fitch-Margoliash and Least-Squares Distance Methods
   ekitsch      Fitch-Margoliash method with contemporary tips
   eneighbor    Phylogenies from distance matrix by N-J or UPGMA method
   ffitch       Fitch-Margoliash and Least-Squares Distance Methods
   fneighbor    Phylogenies from distance matrix by N-J or UPGMA method

Author(s)

   This program is an EMBOSS conversion of a program written by Joe
   Felsenstein as part of his PHYLIP package.

   Although we take every care to ensure that the results of the EMBOSS
   version are identical to those from the original package, we recommend
   that you check your inputs give the same results in both versions
   before publication.

   Please report all bugs in the EMBOSS version to the EMBOSS bug team,
   not to the original author.

History

   Written (2004) - Joe Felsenstein, University of Washington.

   Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
