Restrict search region
When clicked off, PickAl will look anywhere in the rectangle between two anchors for
additional anchors, but when on, it will not look close to the non-anchor corners and some edges.
This is to avoid having long gaps in both sequences, which though possible, violates the
assumption of colinearity to some extent.
The plot of BLAST hits PickAl now generates the BLAST output parsed file as it reads in the file of hits;
watch the lower left corner of the window for updates as to progress. When it is
done, it will show you a plot of BLAST hits. The horizontal axis shows the sequence
position of the first sequence, and the vertical axis shows the sequence position
of the second sequence. BLAST hits are indicated by colored dots on the plot, and the
E-value of a hit is indicated by its color, with red ones being of high quality,
and green ones of moderate quality, and blue ones of questionable quality.
Choosing colinear segments Colinear regions will be long diagonals that consist primarily of high (red, orange, and
to a lesser extent, green) BLAST hits. Note that they can run like so /, or like so \.
The former represents colinear sequences that should be aligned 5' to 3' in one sequence
and 5' to 3' in the other; the latter represents colinear sequences that should aligned
5' to 3' in one sequence and 3' to 5' in the other. Note that sometimes whole
pairs of sequences are colinear and other times only portions are. Sometimes
colinear regions are interrupted by long gaps in one sequence (or less often
both). Occasionally, one will see inversions, segments where one part of the sequence
has been flipped around, and these will look like a series of mixed diagonals like this.
/ \
\ or /
/ \
To choose a region to be aligned by COMGAP, click one hit to be at one end of the
alignment, then click another hit to be at the other end. If these two hits
agree in direction with the direction of the alignment, you will be asked if they
are correct. If you answer no, it will ask you to choose the first hit again.
If you answer yes, PickAl will create a string of hits between
these two that COMGAP will use to span the alignment.
If COMGAP will not be able to span the alignment, PickAl will ask you if you would
like it to try harder. If you answer yes, a series of black windows will open
and then disappear as it runs subsequences of your sequences though the
BLAST 2 sequences program again. If COMGAP can still cannot span the gaps, it will ask
you if you would like PickAl to try even harder. It will continue to do so until
(1) all gaps are spanned, (2) it is futile to try any longer, or (3) you tell it
to stop trying. At this point, PickAl will now write the file of anchors, and ask you if
you would like to quit or select another segment to be aligned.
Merge is conceptually a much simpler program than PickAl. Let's say that you'd like to
align 3 or more (n) sequences together. (If you're only interested in pair-wise
alignment, you do not need MERGE.) Choose one of those sequences to be your guide,
and run BLAST 2 sequences (n-1)-times with that sequence as the -j sequence and each of the
other sequences as the -i sequence. Now, run each of the resulting files of hits through
PickAl and choose the regions to be aligned. MERGE takes the resulting files of anchors
and MERGES them together so that COMGAP can now make a global genomic alignment.
Merge works exclusively from a command- or DOS-Prompt like BLAST. To run MERGE, first place
the executable and the anchors files you wish to merge in the same folder. Using the "cd" command
in the command-prompt, navigate your way to this folder. (If this confuses you, click
here for the previous discussion of the CD command.) At the prompt, type:
merge firstfile.anc+secondfile.anc newfile.anc
Here firstfile.anc and secondfile.anc represent the names of the old files to merge, and
newfile.anc represents the the name of the new merged file of anchors. You may merge
together more than one file at a time, and there is practically no limit to the number you may merge.
Also, COMGAP is also a much simpler program than PickAl. By concatenating together a series of
Needleman-Wunsch generated alignments, COMGAP creates alignments of genomic segments.
You can run the program by double-clicking
on it, or by typing COMGAP in the proper directory of a Command-Prompt. If you do the former, it the
program will ask you for one thing, the name of the file of anchors, which should contain the names
of the files containing the sequences. Aligning things can, at times, be quite time consuming
so please be patient while it runs.
If you run the program from the command-prompt, you may type the name of the anchors file immediately
after the program call. Example
The command-prompt can also allow you to change some default alignment parameters. COMGAP uses a
transition-transversion matrix to calculate its alignments that gives a score of 6 for identities,
2 for transitions, and 0 for transversions. You can change the gap parameters, which are a penalty
of 20 for the first gap, 1 for each subsequent gap, and a maximum penalized length of 30 gaps.
These are expressed as the second, third, and fourth arguments, respectively, of the program call.
Example
First, we'll talk about the alignment quality and then the format of the file.
The quality is good. We did an assessment of 17 colinear pairs of sequences with 467 exons.
95% of the exons aligned perfectly on both ends, and 99% matched on one end. This compares
to BLASTN aligning 58% perfectly, 85% matching at least one end, and 94% with overlap;
consider that was with no BLASTN cutoff, which yields 8080 hits for 467 exons. That's a lot
of false positives! Sure, some exons will have more than one BLASTN hit, and there are more
conserved features than just exons, but still that's well over 17 hits for every exon.
Besides, BLASTN hits with E-values of 1.0 are very unreliable.
So let's lower the E-value cutoff to 10E-10, which yields 810 hits. Now All those percentages
drop to 52%, 76%, and 81% respectively. No matter how you slice it, you gain by the
extra steps of PickAl and COMGAP.
As for the format an n-sequence alignment, it comes in pieces of n+1 rows.
The very first row has the number of sequences aligned, the date the alignment was run, and
n numbers, representing the starting positions of each sequence in their respective order.
These next n rows are (n)(50 characters) of the alignment.
The next line has n numbers, representing the ending positions of each sequence
in the previous block. Then there are n more lines of alignment. This repeats until
the very end.
3 12/04/02 98416 18896 44024
JAR12.M1 TGGTGAAATG CTTAGCCCTA AGTTGGGCTT C------ACA CAGCAGTACA
JAR12.M2 TGATGAAATG CTCAGCCCCA AGTTGGGCTT C------ACA CAGCCGTACA
JAR12.H TGAGGCAAAG CTCAGACTTC ACCTCTACCC CTAAACAAGG CACCCAAACA
98459 18939 44073
JAR12.M1 TACACCACAC TGAACAAAGG ACA--CAGAA AGAAGTACAG GCACAAGTAT
JAR12.M2 TACACCACAC TGAACAAAGG ACA--CAGAA AGAAGTGCAG GCACGAGTAT
JAR12.H CACGTCACAG TAAATAAAGG ACATCCAGAA AATATCACAG GCAGAGGTAC
98507 18987 44123
JAR12.M1 TTTATTTGGC AATTTCAGCC TGACGTGAAG GGCAGAGTTT TCTACTC-CC
JAR12.M2 TTTATTTGGC AATTTCAGCC TGACGTGAAG GGCAGAGTTT TCTACTC-CC
JAR12.H TTTATTTGGC AATTTTAACA TGACACGTAG AGAAAAGAAC CCTGCCCTCC
98556 19036 44173
JAR12.M1 TGCTCCAGTG TCTCCAGCAA CCCCACCTTC TCATCCC--- ----------
JAR12.M2 TGCTCCAGTG TCTCTAGCAA TCCCACCTTC TCATCCC--- ----------
JAR12.H TTCACCAGCC TCCCCAGAAA TCCCACCTTC CTATTTCAAG ACAGAGTAAT
98593 19073 44223
JAR12.M1 ---------- ---------- ---------- -------CTT GGCACGGCTC
JAR12.M2 ---------- ---------- ---------- -------CTT GGCACGGCTC
JAR12.H AACAGCACCA TTTTACACGA AAGGGAACAG CCACAGCCTT GGCACCATTT
98606 19086 44273
JAR12.M1 CTGACTCCAC ACGCACAGAA --GCAAGAGC TGCAATGCCC ACAGCCCAGC
JAR12.M2 CTGACTCCAC ACGCACAGAA --GCAAGAGC TGCAATGCCC ACAGCCCGGC
JAR12.H CTGGTTCCAC TTTCCATGGA AGGGCAGAGA AGCATTGCTC AAACCCCACC
.
.
.
.
The only thing I believe requires more explanation is what happens if the subsequences
between anchors cannot be aligned in the available memory. COMGAP includes the
last sequence aligned to O's in the other sequences in the alignment;
the O's serve as a space holder that stands for any nucleotide or a gap symbol.
Any comments about this software or this page should be sent to
g e n o i n f o @ g e n o m i c s . u c l a . e d u
Examples section
BLAST
If you would like to work the example, just substitute the words HumanEx.seq for first_seq and
MouseEx.seq for second_seq like so at the Command-prompt; in other words, type
what is below as one line.
C:\Blast_PickAl>bl2seq -i HumanEx.seq -j MouseEx.seq
-o Example.blt -p blastn
and press Enter.
Go back to the BLAST 2 sequences section
PickAl
Double-click the PickAl icon.
Click Browse beside the "file of BLAST hits."
Choose the file named Example.blt.
Many default names will appear in the other boxes. The ones for the first and second sequence should in this
case be right, but one will often like to check this. If they are not right, you may either change
them in the text box, or click Browse beside them and find the file.
You may choose at times to change the name of the two output files, but for now, do not.
For this example, leave the lowest four fields alone. Click Ok.
For the rest of the example of PickAl, just play. Read the text in the PickAl section and
then follow the directions that appear at the bottom of the plot. Have fun.
Go back to the PickAl section
COMGAP, example 1
C:\>COMGAP Example.anc
Go back to the COMGAP section
COMGAP, example 2
If you wished to change these to say 15 for the first gap, 2 for each subsequent gap, and a maximum
penalized length of 100 gaps, then you would enter at the command prompt
C:\>COMGAP Example.anc 15 2 100
Go back to the COMGAP section
During the course of evolution, the mutation
process causes DNA sequences to change at random. This by-in-large causes sequences in
diverging species to gradually become more dissimilar. However, some DNA segments become
dissimilar more slowly than other DNA segments. This is because these segments code for
genes and regulatory elements, which when changed, often work less well than before;
these changed defective versions are usually "weeded out" by natural selection, leaving the
original sequence. Now consider the sequences which don't code for genes or regulatory elements,
which are often called "junk DNA"; these sequences have little purpose, perhaps only to hold space,
and can freely change. By comparing the amount of change shown by different regions
of related sequences, we have a simple measure for the detection of genes and regulatory
elements.
This approach, called phylogenetic footprinting, depends critically on how the sequences
are aligned; we really want to be comparing segments that are related to each other to one
another. If we don't, it is like trying to compare the differences between siblings from different
families, but mixing up who belongs with whom. Any differences we see between Karin
Jorgeson and Pei-Shan Wang don't have anything to do with their siblinghood because they're
not siblings!
Therefore, aligning sequences properly is critical for such an analysis. Keep in mind this isn't
trivial, because sequences the regions in which we're interested are around 100 nucleotides long,
and genomes like mouse and human contain about 3 billion nucleotides. (Nucleotides are the
building blocks of DNA.) That's like trying to match up 30,000,000 pairs of siblings! Imagine everyone
in California having a sibling in New York or Pennsylvania and trying to pair them up!
Now it is a bit simpler than this. DNA segments in somewhat close species, like mouse and human
come in segments called syntenic blocks. These are regions in both species 10,000 to 10 million
nucleotides long that are related to one another. So now it's like breaking these U. S. regions down
into towns and small cities, and saying, "Ok, everyone in Albany, New York has a sibling in
Pasadena, California. Find them." And we need to do this about 1,000 times!
Now it also gets harder. Some segments clearly don't show any relationship to each other,
so it is a bit like having only half of the people in each city being related, and the others are
just there to distract you from your problem. This makes the number of pairs you're looking for
fewer, but now all the others get in the way.
Now it gets a bit easier still. These syntenic blocks have segments which are colinear, which means
that the order of segments is the same as the other. It is a bit like having siblings living
on the same street in different cities, and not just that, but living in the same order as you go down that
street! Now keep in mind that as you pass Reza Tehrani's place and then Maria Rodriguez's next door and then
Cliff Tailfeather's next door to her in Albany, you should find their siblings in Pasadena in the same order on a street
in Pasadena, but they may not be next door to one another.
Last is one way it gets harder. We said above that these regions are about a hundred
nucleotides long, but they vary in length from several bases to thousands. We have no problem
discerning where Jack Sprat stops and Dela Botchwey begins, but this is not the case with DNA.
We also want to know the borders of these segments.
With all that background, given syntenic regions (Albany and Pasadena in our examples), PickAl
and COMGAP allow you to find the colinear segments (streets), and align the DNA sequences (pairing Jamshid & Reza Tehrani and Jose and Maria Rodriguez). What they won't do is tell you exactly where each region we're interested in (Jamshid and Jose) start and end, and it definitely won't tell you their
professions and skills and habits, which would also be helpful.
Back to the Table of contents or
back to the What are these section.