|
|
esim4 |
% esim4 Align an mRNA to a genomic DNA sequence Input sequence: tembl:hs989235 Genomic sequence: tembl:hsnfg9 Output file [hs989235.esim4]: |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers:
[-asequence] sequence Sequence USA
[-bsequence] seqall Genomic sequence
[-outfile] outfile Output file name
Additional (Optional) qualifiers (* if not always prompted):
-word integer Sets the word size (W) for blast hits.
Default value: 12.
-extend integer Sets the word extension termination limit
(X) for the blast-like stage of the
algorithm. Default value: 12.
-cutoff integer Sets the cutoff (E) in range [3,10].
-usermspa toggle False: esim4 calculates mspA, True: value
from mspA command line argument.
* -mspa integer MSP score threshold (K) for the first stage
of the algorithm. (If this option is not
specified, the threshold is computed from
the lengths of the sequences, using
statistical criteria.) For example, a good
value for genomic sequences in the range of
a few hundred Kb is 16. To avoid spurious
matches, however, a larger value may be
needed for longer sequences.
-usermspb toggle False: esim4 calculates mspB, True: value
from mspB command line argument.
* -mspb integer Sets the threshold for the MSP scores (C)
when aligning the as-yet-unmatched
fragments, during the second stage of the
algorithm. By default, the smaller of the
constant 12 and a statistics-based threshold
is chosen.
-weight integer Weight value (H) (undocumented). 0 uses a
default, >0 is a value
-diagonal integer Bound (K) for the diagonal distance within
consecutive MSPs in an exon.
-strand menu This determines the strand of the genome (R)
with which the mRNA will be aligned. The
default value is 'both', in which case both
strands of the genome are attempted. The
other allowed modes are 'forward' and
'reverse'.
-format integer Sets the output format (A). Exon endpoints
only (format=0), exon endpoints and
boundaries of the coding region (CDS) in the
genomic sequence, when specified for the
input mRNA (-format=5), alignment text
(-format=1), alignment in lav-block format
(-format=2), or both exon endpoints and
alignment text (-format=3 or -format=4). If
a reverse complement match is found,
-format=0,1,2,3,5 will give its position in
the plus strand of the longer sequence and
the minus strand of the shorter sequence.
-format=4 will give its position in the plus
strand of the first sequence (mRNA) and the
minus strand of the second sequence
(genome), regardless of which sequence is
longer. The -format=5 option can be used
with the S command line option to specify
the endpoints of the CDS in the mRNA, and
produces output in the exons file format
required by PipMaker.
-cliptails boolean Trim poly-A tails (P). Specifies whether or
not the program should report the fragment
of the alignment containing the poly-A tail
(if found). By default (-nocliptails) the
alignment is displayed as computed. When
this feature is enabled (-cliptails), sim4
will remove the poly-A tails and all format
options will produce additional lav
alignment headers.
-smallexons boolean Requests an additional search for small
marginal exons (N) (N=1) guided by the
splice-site recognition signals. This option
can be used when a high accuracy match is
expected. The default value is N=0,
specifying no additional search.
-[no]ambiguity boolean Controls the set of characters allowed in
the input sequences (B). By default
(-ambiguity), ambiguity characters
(ABCDGHKMNRSTVWXY) are allowed. By
specifying -noambiguity, the set of
acceptable characters is restricted to
A,C,G,T,N and X only.
-cdsregion string Allows the user to specify the endpoints of
the CDS in the input mRNA (S), with the
syntax: -cdsregion=n1..n2. This option is
only available with the -format=5 flag,
which produces output in the format required
by PipMaker. Alternatively, the CDS
coordinates could appear in a construct
CDS=n1..n2 in the FastA header of the mRNA
sequence. When the second file is an mRNA
database, the command line specification for
the CDS will apply to the first sequence in
the file only.
-offseta integer Undocumented (f) - some sort of offset in
first sequence.
-offsetb integer Undocumented (F) - some sort of offset in
second sequence.
-toa integer Undocumented (t)- offset end of first
sequence?
-tob integer Undocumented (T) - offset end of second
sequence?
Advanced (Unprompted) qualifiers: (none)
Associated qualifiers:
"-asequence" associated qualifiers
-sbegin1 integer Start of the sequence to be used
-send1 integer End of the sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-bsequence" associated qualifiers
-sbegin2 integer Start of each sequence to be used
-send2 integer End of each sequence to be used
-sreverse2 boolean Reverse (if DNA)
-sask2 boolean Ask for begin/end/reverse
-snucleotide2 boolean Sequence is nucleotide
-sprotein2 boolean Sequence is protein
-slower2 boolean Make lower case
-supper2 boolean Make upper case
-sformat2 string Input sequence format
-sdbname2 string Database name
-sid2 string Entryname
-ufo2 string UFO features
-fformat2 string Features format
-fopenfile2 string Features file name
"-outfile" associated qualifiers
-odirectory3 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report deaths
|
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||
|---|---|---|---|---|---|---|---|---|---|
| [-asequence] (Parameter 1) |
Sequence USA | Readable sequence | Required | ||||||
| [-bsequence] (Parameter 2) |
Genomic sequence | Readable sequence(s) | Required | ||||||
| [-outfile] (Parameter 3) |
Output file name | Output file | <sequence>.esim4 | ||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||
| -word | Sets the word size (W) for blast hits. Default value: 12. | Any integer value | 12 | ||||||
| -extend | Sets the word extension termination limit (X) for the blast-like stage of the algorithm. Default value: 12. | Any integer value | 12 | ||||||
| -cutoff | Sets the cutoff (E) in range [3,10]. | Integer up to 10 | 3 | ||||||
| -usermspa | False: esim4 calculates mspA, True: value from mspA command line argument. | Toggle value Yes/No | No | ||||||
| -mspa | MSP score threshold (K) for the first stage of the algorithm. (If this option is not specified, the threshold is computed from the lengths of the sequences, using statistical criteria.) For example, a good value for genomic sequences in the range of a few hundred Kb is 16. To avoid spurious matches, however, a larger value may be needed for longer sequences. | Any integer value | 16 | ||||||
| -usermspb | False: esim4 calculates mspB, True: value from mspB command line argument. | Toggle value Yes/No | No | ||||||
| -mspb | Sets the threshold for the MSP scores (C) when aligning the as-yet-unmatched fragments, during the second stage of the algorithm. By default, the smaller of the constant 12 and a statistics-based threshold is chosen. | Any integer value | 12 | ||||||
| -weight | Weight value (H) (undocumented). 0 uses a default, >0 is a value | Any integer value | 0 | ||||||
| -diagonal | Bound (K) for the diagonal distance within consecutive MSPs in an exon. | Any integer value | 10 | ||||||
| -strand | This determines the strand of the genome (R) with which the mRNA will be aligned. The default value is 'both', in which case both strands of the genome are attempted. The other allowed modes are 'forward' and 'reverse'. |
|
both | ||||||
| -format | Sets the output format (A). Exon endpoints only (format=0), exon endpoints and boundaries of the coding region (CDS) in the genomic sequence, when specified for the input mRNA (-format=5), alignment text (-format=1), alignment in lav-block format (-format=2), or both exon endpoints and alignment text (-format=3 or -format=4). If a reverse complement match is found, -format=0,1,2,3,5 will give its position in the plus strand of the longer sequence and the minus strand of the shorter sequence. -format=4 will give its position in the plus strand of the first sequence (mRNA) and the minus strand of the second sequence (genome), regardless of which sequence is longer. The -format=5 option can be used with the S command line option to specify the endpoints of the CDS in the mRNA, and produces output in the exons file format required by PipMaker. | Integer from 0 to 5 | 0 | ||||||
| -cliptails | Trim poly-A tails (P). Specifies whether or not the program should report the fragment of the alignment containing the poly-A tail (if found). By default (-nocliptails) the alignment is displayed as computed. When this feature is enabled (-cliptails), sim4 will remove the poly-A tails and all format options will produce additional lav alignment headers. | Boolean value Yes/No | No | ||||||
| -smallexons | Requests an additional search for small marginal exons (N) (N=1) guided by the splice-site recognition signals. This option can be used when a high accuracy match is expected. The default value is N=0, specifying no additional search. | Boolean value Yes/No | No | ||||||
| -[no]ambiguity | Controls the set of characters allowed in the input sequences (B). By default (-ambiguity), ambiguity characters (ABCDGHKMNRSTVWXY) are allowed. By specifying -noambiguity, the set of acceptable characters is restricted to A,C,G,T,N and X only. | Boolean value Yes/No | Yes | ||||||
| -cdsregion | Allows the user to specify the endpoints of the CDS in the input mRNA (S), with the syntax: -cdsregion=n1..n2. This option is only available with the -format=5 flag, which produces output in the format required by PipMaker. Alternatively, the CDS coordinates could appear in a construct CDS=n1..n2 in the FastA header of the mRNA sequence. When the second file is an mRNA database, the command line specification for the CDS will apply to the first sequence in the file only. | Any string is accepted | An empty string is accepted | ||||||
| -offseta | Undocumented (f) - some sort of offset in first sequence. | Any integer value | 0 | ||||||
| -offsetb | Undocumented (F) - some sort of offset in second sequence. | Any integer value | 0 | ||||||
| -toa | Undocumented (t)- offset end of first sequence? | Any integer value | 0 | ||||||
| -tob | Undocumented (T) - offset end of second sequence? | Any integer value | 0 | ||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||
| (none) | |||||||||
ID HS989235 standard; RNA; EST; 495 BP.
XX
AC H45989;
XX
SV H45989.1
XX
DT 18-NOV-1995 (Rel. 45, Created)
DT 04-MAR-2000 (Rel. 63, Last updated, Version 2)
XX
DE yo13c02.s1 Soares adult brain N2b5HB55Y Homo sapiens cDNA clone
DE IMAGE:177794 3', mRNA sequence.
XX
KW EST.
XX
OS Homo sapiens (human)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN [1]
RP 1-495
RA Hillier L., Clark N., Dubuque T., Elliston K., Hawkins M., Holman M.,
RA Hultman M., Kucaba T., Le M., Lennon G., Marra M., Parsons J., Rifkin L.,
RA Rohlfing T., Soares M., Tan F., Trevaskis E., Waterston R., Williamson A.,
RA Wohldmann P., Wilson R.;
RT "The WashU-Merck EST Project";
RL Unpublished.
XX
DR RZPD; IMAGp998F03326; IMAGp998F03326.
XX
CC On May 8, 1995 this sequence version replaced gi:800819.
CC Contact: Wilson RK
CC Washington University School of Medicine
CC 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
CC Tel: 314 286 1800
CC Fax: 314 286 1810
CC Email: est@watson.wustl.edu
CC Insert Size: 544
CC High quality sequence stops: 265
CC Source: IMAGE Consortium, LLNL
CC This clone is available royalty-free through LLNL ; contact the
CC IMAGE Consortium (info@image.llnl.gov) for further information.
CC Possible reversed clone: polyT not found
CC Insert Length: 544 Std Error: 0.00
CC Seq primer: SP6
CC High quality sequence stop: 265.
XX
FH Key Location/Qualifiers
FH
FT source 1..495
FT /db_xref="taxon:9606"
FT /db_xref="ESTLIB:300"
FT /db_xref="RZPD:IMAGp998F03326"
FT /note="Organ: brain; Vector: pT7T3D (Pharmacia) with a
FT modified polylinker; Site_1: Not I; Site_2: Eco RI; 1st
FT strand cDNA was primed with a Not I - oligo(dT) primer [5'
FT TGTTACCAATCTGAAGTGGGAGCGGCCGCGCTTTTTTTTTTTTTTTTTTT 3'],
FT double-stranded cDNA was size selected, ligated to Eco RI
FT adapters (Pharmacia), digested with Not I and cloned into
FT the Not I and Eco RI sites of a modified pT7T3 vector
FT (Pharmacia). Library went through one round of
FT normalization to a Cot = 53. Library constructed by Bento
FT Soares and M.Fatima Bonaldo. The adult brain RNA was
FT provided by Dr. Donald H. Gilden. Tissue was acquired 17-18
FT hours after death which occurred in consequence of a
FT ruptured aortic aneurysm. RNA was prepared from a pool of
FT tissues representing the following areas of the brain:
FT frontal, parietal, temporal and occipital cortex from the
FT left and right hemispheres, subcortical white matter, basal
FT ganglia, thalamus, cerebellum, midbrain, pons and medulla."
FT /sex="Male"
FT /organism="Homo sapiens"
FT /clone="IMAGE:177794"
FT /clone_lib="Soares adult brain N2b5HB55Y"
FT /dev_stage="55-year old"
FT /lab_host="DH10B (ampicillin resistant)"
XX
SQ Sequence 495 BP; 73 A; 135 C; 169 G; 104 T; 14 other;
ccggnaagct cancttggac caccgactct cgantgnntc gccgcgggag ccggntggan 60
aacctgagcg ggactggnag aaggagcaga gggaggcagc acccggcgtg acggnagtgt 120
gtggggcact caggccttcc gcagtgtcat ctgccacacg gaaggcacgg ccacgggcag 180
gggggtctat gatcttctgc atgcccagct ggcatggccc cacgtagagt ggnntggcgt 240
ctcggtgctg gtcagcgaca cgttgtcctg gctgggcagg tccagctccc ggaggacctg 300
gggcttcagc ttcccgtagc gctggctgca gtgacggatg ctcttgcgct gccatttctg 360
ggtgctgtca ctgtccttgc tcactccaaa ccagttcggc ggtccccctg cggatggtct 420
gtgttgatgg acgtttgggc tttgcagcac cggccgccga gttcatggtn gggtnaagag 480
atttgggttt tttcn 495
//
|
ID HSNFG9 standard; DNA; HUM; 33760 BP.
XX
AC Z69719;
XX
SV Z69719.1
XX
DT 26-FEB-1996 (Rel. 46, Created)
DT 22-NOV-1999 (Rel. 61, Last updated, Version 3)
XX
DE Human DNA sequence from cosmid NFG9 from a contig from the tip of the short
DE arm of chromosome 16, spanning 2Mb of 16p13.3. Contains Interleukin 9
DE Receptor Pseudogene, repeat polymorphism, ESTs, CpG islands and endogenous
DE retroviral DNA.
XX
KW 16p13.3; CpG island; Interleukin 9 Receptor Pseudogene;
KW repeat polymorphism.
XX
OS Homo sapiens (human)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN [1]
RP 1-33760
RA Kershaw J.;
RT ;
RL Submitted (22-FEB-1996) to the EMBL/GenBank/DDBJ databases.
RL Sanger Centre, Hinxton, Cambridgeshire, CB10 1RQ, England. E-mail enquires:
RL humquery@sanger.ac.uk
XX
CC IMPORTANT: This sequence is not the entire insert of clone
CC NFG9. It may be shorter because we only sequence overlapping
CC sections once, or longer because we arrange for a small
CC overlap between neighbouring submissions.
XX
CC The true left end of clone NFG9 is at 1 in this sequence.
CC The true left end of clone RA36 is at 25872.
XX
CC NFG9 is from a 280kb clone contig extending from the telomere of 16p.
CC Higgs D.R., Flint J. unpublished. MRC Molecular Haematology Unit,
CC Institute of Molecular Medicine, Oxford.
CC NFG9 is from the library CV007K. Choo et al.,(1986) Gene 46. 277-286.
XX
FH Key Location/Qualifiers
FH
FT source 1..33760
FT /chromosome="16"
FT /db_xref="taxon:9606"
FT /organism="Homo sapiens"
FT /map="16p13.3"
FT /clone_lib="CV007K"
[Part of this file has been deleted for brevity]
gagacagcag agtgctcagc tcatgaagga ggcaccagcc gccatgcctc tacatccagg 30840
tctcctgggg ttcccacctc cacaaaaacc cccactgcta ggagtgcagg caggagggga 30900
cctgagaacc gacagttata ggtcctgcgg gtgggcagtg ctgggtgttc tggtctgccc 30960
cacccctgtg tgcctagatc cccatctggg cctcaagtgg gtgggattcc aaaggaagag 31020
ccggagtagg cgtggggagg ggcaggccca ggctggacaa agagtctggc cagggagcgg 31080
cacattgccc tcccagagac agtggctcag tgtccaggcc ttccccaggc gcacagtggg 31140
ctcttgttcc cagaaagccc ctcgggggga tccaaacagt gtctccccca ccccgctgac 31200
ccctcagtgt atggggaaac cgtggcccac ggaaggcctc actgcctggg gtcacacagc 31260
atctgagtca ctgcagcagc ctcacagctg ccagcccagg cccagcccca tcaggagaca 31320
cccaaagcca cagtgcatcc caggaccagc tgggggggct gcgggcagga ctctcgatga 31380
ggctgaggga cgaggagggt caagggagcc actggcgcca tgcatgctga cgtcccctct 31440
ggctgcctgc agagcctggt gtggaagggc tgagtggggg atggtggaga gtcctgttaa 31500
ctcaggtttc tgctctgggg atgtctgggc acccatcaag ctggccgcgt gcacaggtgc 31560
agggagagcc agaaagcagg agccgatgca gggaggccac tggggacagc ccaggctgat 31620
gcttgggccc catgtgtctc caccacctac aaccctaagc aagcctcagc tttcccatct 31680
ggaaatcagg ggtcacagca gtgcctggca cagtagcagc ggctgactcc atcacagggt 31740
ggtgtagcct gtgggtactt ggcactctct gaggggcagg agctgggggg tgaaaggacc 31800
ctagagcata tgcaacaaga gggcagccct ggggacacct ggggacagaa ccctccaaag 31860
gtgtcgagtt tgggaagaga ctagagagaa gctctggcca gtccaggcat agacagtggc 31920
cacagccagt ggagagctgc atcctcaggt gtgagcagca accacctctg tactcaggcc 31980
tgccctgcac actcacagga ccatgctggc agggacaact ggcggcggag ttgactgcca 32040
accccggggc cagaaccatc aagcctgggc tctgctccgc ccaaggaact gcctgctgcc 32100
gaggtcagct ggagcaaggg gcctcacccc gggacacctt cccagacgtg tcctcagctc 32160
acatgagcct catcccaggg ggatgtggct cctccagcat ccccacccac acgctgctct 32220
ctgaccctca gtcttctgtt tgactcctaa tctgaagctc aatcctagat ctcccttgag 32280
aagggggtca ccagctgtct ggcagcccag cctccaggtc ttctggatta atgaagggaa 32340
agtcacctgg cctctctgcc ttgtctatta atggcatcat gctgagaatg atatttgcta 32400
ggccctttgc aaaccccaaa gtgctcttca accctcccag tgaagcctct tcttttctgt 32460
ggaagaaatg aggttcaggg tggagcaggg caggcctgag acctttgcag ggttctctcc 32520
aggtccccag caggacagac tggcaccctg cctcccctca tcaccctaga caaggagaca 32580
gaacaagagg ttccctgcta caggccatct gtgagggaag ccgccctagg gcctgtagac 32640
acaggaatcc ctgaggacct gacctgtgag ggtagtgcac aaaggggcca gcacttggca 32700
ggaggggggg gggcactgcc ccaaggctca gctagcaaat gtggcacagg ggtcaccaga 32760
gctaaacccc tgactcagtt gggtctgaca ggggctgaca tggcagacac acccaggaat 32820
caggggacac caagtgcagc tcagggcacc tgtccaggcc acacagtcag aaaggggatg 32880
gcagcaagga cttagctaca ctagattctg ggggtaaact gcctggtatg ctggtcactg 32940
ctagtcccca gtctggagtc tagctgggtc tcaggagtta ggcgaaaaca ccctccccag 33000
gctgcaggtg ggagaggccc acatcccctg cacacgtctg gccagaggac agatgggcag 33060
cccagtcacc agtcagagcc ctccagaggt gtccctgact gaccctacac acatgcaccc 33120
aggtgcccag gcacccttgg gctcagcaac cctgcaaccc cctcccagga cccaccagaa 33180
gcaggatagg actagagagg ccacaggagg gaaaccaagt cagagcagaa atggcttcgg 33240
tcctcagcag cctggctcag cttcctcaaa ccagatcctg actgatcaca ctggtctgtc 33300
taacccctgg gaggggtcct ctgtatccat cttacagata aggaaactga ggctcagaga 33360
agcccatcac tgcctaaggt cccagggcct ataagggagc tcaaagcctt gggccaggtc 33420
tgcccaggag ctgcagtgga agggaccctg tctgcagacc cccagaagac aaggcagacc 33480
acctgggttc ttcagccttg tggctgtgga cggctgtcag acccttctaa gaccccttgc 33540
cacctgctcc atcaggggca tctcagttga agaaggaagg actcaccccc aaaatcgtcc 33600
aactcagaaa aaaaggcaga agccaaggaa tccaatcact gggcaaaatg tgatcctggc 33660
acagacactg aggtggggga actggagccg gtgtggcgga ggccctcaca gccaagagca 33720
actgggggtg ccctgggcag ggactgtagc tgggaagatc 33760
//
|
seq1 = HS989235, 495 bp seq2 = HSNFG9 (HSNFG9), 33760 bp 1-193 (25686-25874) 92% <- 194-407 (26279-26492) 98% <- 408-487 (27391-27469) 88% |
| Program name | Description |
|---|---|
| alignrunner | Align sequence pairs in a directory |
| comparator | Compare contact scores of two sequence alignments |
| contactalign | Damian Counsell's experimental 2.5-D alignment algorithm |
| est2genome | Align EST and genomic DNA sequences |
| nawalign | Damian Counsell's NW implementation |
| nawalignrunner | Nawalign all sequence pairs in a directory |
| needle | Needleman-Wunsch global alignment |
| needlerunner | Needle all sequence pairs in a directory |
| scorer | Score alignments using structural alignments |
| scorerrunner | SCORER for ordered pairs of substituted seqs |
| stretcher | Finds the best global alignment between two sequences |
| substitute | Substitute matches into a template |
| substituterunner | Run SUBSTITUTE on a directory of traces |
Although we take every care to ensure that the results of the EMBOSS version are identical to those from the original package, we recommend that you check your inputs give the same results in both versions before publication.
Please report all bugs in the EMBOSS version to the EMBOSS bug team, not to the original author.