PickAl and COMGAP Manual

© 2003

Programs for the Identification & Alignment of
Colinear Genomic Sequences for Comparative Analysis

Table of Contents

What are these?      (for non-specialists)
For what purposes might I use these?
What are the hardware and operating system requirements?
How do I obtain these?
How do I install these?
How do I use these?
     Specifically, BLAST 2 sequences?
          Setting it up
          BLAST
          MEGABLAST, an alternative
     Specifically, PickAl?
     Specifically, MERGE?
     Specifically, COMGAP?
What sort of results should I expect?
How do I comment?

What are these?

PickAl and COMGAP are alignment programs to produce (1) a graphical display that allows the quick identification and selection of large colinear segments of genomic sequence, and (2) accurate large-scale multiple global alignments of such segments for further analysis. Large genomic rearrangements can be readily identified from the graphical display, and the alignment program easily takes these into account.
Want to see the non-specialist explanation and motivation?

For what purposes might I use these?

Two things. Let's say that you have large colinear pieces of DNA and you want to produce an alignment with which to produce a more detailed comparison. That's one thing you might do. Let's also say that you want to see where all the conserved sequences between two large DNA segments are. PickAl gives you a graphical display that shows these.

What are the hardware and operating system requirements?

These are written for Windows operating systems, and to our knowledge, these will work under any operating system later than and including Windows 95. These have not been run on any computer with less than 64 megabytes of RAM, but there is no reason of which we're aware it should not work.

How do I obtain these?

If you're not at http://www.genomics.ucla.edu/pickal/, go there now. Then, right click HERE, click "Save As," and save the file. This is a WinZipped compressed file.

How do I install these?

Uncompress the WinZipped compressed file with the WinZip application.
When this is done, you should find that you now have generated 9 new files.
  • blastz.exe, which is a self-extracting archive for stand-alone blast,
  • install.bat, which will install all the other programs,
  • Initialize.exe, an executable that will be called during installation and you should never need to call again,
  • PickAl.exe, the executable used to identify and choose for alignments colinear sequences,
  • Merge.exe, the executable used only if you wish to create a multiple (3 or more sequence) alignment,
  • COMGAP.exe, the executable that will finally create the long genomic alignments,
  • PickAl_COMGAP_Manual.html, a file of this web page,
  • and HumanEx.seq & MouseEx.seq, sequence files for running the example.

    Double click on blastz.exe and it will create a new folder with all its files in it.

    Double click on install.bat, which will rename the blast folder "Blast_PickAl", place all the PickAl & COMGAP associated files in it, and open initilize.exe. Fill out the requested information about your computer and press Ok; this information is needed to maximize the performance of PickAl and COMGAP and will not leave your computer.

    How do I use these?

    Files of sequence can be analyzed to find colinear sequences and then possibly align them. In order to find colinear sequence pairs, first, the user will need to blast the sequences against each other in pairs with the BLAST 2 Sequences application. Then, the output file from BLAST is entered into PickAl, which shows one a graphical display which allows the user to pick out the colinear regions.

    Once a region is identified, PickAl can be used to choose anchors for the alignment of the colinear segments. This long genomic alignment is produced with the program COMGAP. Sets of anchors from pairs of sequences with one sequence in common can be used to create a multiple alignment with the programs MERGE and COMGAP.

    Specifically, BLAST 2 sequences?

    Setting it up  All the applications that come with the self-extracting executable, blastz.exe, are from the National Center for Biotechnology Information (NCBI) at the National Library of Medicine, which is part of the National Institute of Health. Do not write us about the applications in here, because we did not write them. However, we will tell you how to use them.

    Ordinarily, you will use the BLAST 2 sequences application, bl2seq.exe. To run this program, you will need to open a command prompt, which on some machines is called the MS-DOS prompt. If you do not know how to open one, first click on the Start Menu, and then click "Search", and then "for Files or Folders". In the top box, which is labeled "Search for files or folders named", type, "prompt", and click "Search Now". After it searches your hard drive, it should show one or more icons labeled "command prompt" or "MS-DOS prompt". Right click on one of these and choose "Create Shortcut". Windows will tell you that it cannot put a shortcut there, and would you like to put one on the desktop; choose yes. Now close the search window. Open the Blast_PickAl folder, then drag the newly created shortcut on your desktop into that folder. Now that that icon is present in the working folder, you should never need to do the previous steps again.

    BLAST  Double-click on the new shortcut (MS-DOS prompt or command prompt); this will bring up a text based window, probably with white lettering on a black background. There will be a flashing cursor right behind a c:\> prompt. Type "cd Blast_PickAl" and press enter. This will change the active directory to the Blast_PickAl directory. (Incidentally, "cd .." and enter will take you back up. This allows you to navigate around in a MS-DOS environment.)"

    Now, to BLAST two sequences, type
    bl2seq -i first_seq -j second_seq -o fileofhits -p blastn
    and press enter. This could go very quickly, or could take many hours, depending on the exact sizes of your sequences and the speed of your computer. My advice is to let it run for a while to get some idea of how it will run on your machine.
    example

    MEGABLAST, an alternative  If it takes too long, MEGABLAST is an alternative you might choose. The advantage of MEGABLAST is that it is much faster than BLASTN/ BLAST 2 sequences, but the disadvantages are
    (1) it is not as sensitive at picking out regions of homology (It actually is solving a different problem from BLASTN.), and
    (2) it takes two steps to run rather than the one of BLAST 2 sequences.
    These two steps are to create the database and to actually run MEGABLAST. First, we will explain how the results between MEGABLAST and BLASTN differ, then we'll explain the two steps.

    MEGABLAST uses a very fast search algorithm devised by Webb Miller. Instead of looking for regions of similarity, this algorithm looks for regions of identity. Integral to this is the concept of a word, which is a short DNA sequence. MEGABLAST will find all words n-long that are identical between two sequences, and might find identical words as short as n-3 long. n is a multiple of 4 (such as 16, 20, or 24) that the user enters into MEGABLAST.

    To run MEGABLAST, first open the command prompt and change to the Blast_PickAl directory as above. To create the database, type "formatdb -i second_seq -p F -o F" and press enter. This step might take a while.

    Now, type "megablast -d second_seq -i first_seq -W n -o fileofhits -D 2" and press enter. Here >I>n represents the word length, and remember that it needs to be a multiple of 4.

    If you might use one the two sequences in more than one search, you should choose this sequence to be the database, so that you will not need to create its database again.

    Specifically PickAl?

    To start PickAl, double-click on the PickAl icon.

    PickAl has two screens, the initial user dialog, and the plot of blast hits. The first five boxes are for entering file names. The first one is for the BLAST or MEGABLAST file of hits (what you typed after the -o). You can either type the path name to that file or click on "Choose" and browse around to find it. The next box is for the file of anchors, which is created by PickAl and is needed if you're going to use COMGAP to create a genomic alignment. The third box is for the BLAST output Parsed file, which contains useful information about the BLAST hits in an easily accessed table. The fourth and fifth boxes are for the sequence files you used for the BLAST or MEGABLAST runs. Note that as soon as you choose one of these files, PickAl automatically chooses default file names for the other four files; these are just the path and root name of the first file you chose followed by the appendices you used the last time you ran PickAl.
    example

    Additional Parameters The lower half of the dialog has four parameters that can be changed; if you do not wish to deal with these, you do not need to. The defaults work relatively well. The four are
  • How many bases to extend alignment beyond last hit
  • How many repeats will be allowed
  • Exponent of the E-value above which not to display
  • Restrict search region
    Each of these are explained below:
  • How many bases to extend alignment beyond last hit
    When n, the number in this field, is greater than 0, the alignment is extended n bases in each sequence beyond the terminal hits of the alignment. If the ends of a sequence are reached, obviously it will not extend the alignment beyond these, but it will try to align as much as possible. Though the first default will be the maximum length allowed, we would suggest not going higher than about 1000 without a reason particular to these sequences.
  • How many repeats will be allowed
    Complex repeats and hits from noise can cloud the display, and make the identification of colinear segments difficult. For this reason, this option was added to PickAl. When considering whether or not to display a particular hit, PickAl counts the number of hits which have lower E-values and whose local alignments share at least one nucleotide in common with the local alignment under consideration. If this number is greater than the number of repeats to be allowed, the hit is not shown. When set to zero, this feature greatly cleans up the noise, but could cause the user to miss genomic duplications. When set to a large number, like 1000, practically every hit will be shown, so the user might have some difficulty seeing everything through the haze of hits, but everything to be seen will be there. The best option is to first choose a small number like 1 or 2, and if no duplications are seen, the user knows there are none. If many are seen, then if might be safer to close the program and up the number to 4 or 5 and see if more duplications appear.
  • Exponent of the E-value above which not to display
    If the E-value of a hit is greater than (worse than) 10 to the power in this box, PickAl will not show it.
  • Restrict search region
    When clicked off, PickAl will look anywhere in the rectangle between two anchors for additional anchors, but when on, it will not look close to the non-anchor corners and some edges. This is to avoid having long gaps in both sequences, which though possible, violates the assumption of colinearity to some extent.


    The plot of BLAST hits PickAl now generates the BLAST output parsed file as it reads in the file of hits; watch the lower left corner of the window for updates as to progress. When it is done, it will show you a plot of BLAST hits. The horizontal axis shows the sequence position of the first sequence, and the vertical axis shows the sequence position of the second sequence. BLAST hits are indicated by colored dots on the plot, and the E-value of a hit is indicated by its color, with red ones being of high quality, and green ones of moderate quality, and blue ones of questionable quality.

    Choosing colinear segments Colinear regions will be long diagonals that consist primarily of high (red, orange, and to a lesser extent, green) BLAST hits. Note that they can run like so /, or like so \. The former represents colinear sequences that should be aligned 5' to 3' in one sequence and 5' to 3' in the other; the latter represents colinear sequences that should aligned 5' to 3' in one sequence and 3' to 5' in the other. Note that sometimes whole pairs of sequences are colinear and other times only portions are. Sometimes colinear regions are interrupted by long gaps in one sequence (or less often both). Occasionally, one will see inversions, segments where one part of the sequence has been flipped around, and these will look like a series of mixed diagonals like this.
           /      \
          \   or   /
         /          \
    
    To choose a region to be aligned by COMGAP, click one hit to be at one end of the alignment, then click another hit to be at the other end. If these two hits agree in direction with the direction of the alignment, you will be asked if they are correct. If you answer no, it will ask you to choose the first hit again. If you answer yes, PickAl will create a string of hits between these two that COMGAP will use to span the alignment.

    If COMGAP will not be able to span the alignment, PickAl will ask you if you would like it to try harder. If you answer yes, a series of black windows will open and then disappear as it runs subsequences of your sequences though the BLAST 2 sequences program again. If COMGAP can still cannot span the gaps, it will ask you if you would like PickAl to try even harder. It will continue to do so until (1) all gaps are spanned, (2) it is futile to try any longer, or (3) you tell it to stop trying. At this point, PickAl will now write the file of anchors, and ask you if you would like to quit or select another segment to be aligned.

    Specifically MERGE?

    Merge is conceptually a much simpler program than PickAl. Let's say that you'd like to align 3 or more (n) sequences together. (If you're only interested in pair-wise alignment, you do not need MERGE.) Choose one of those sequences to be your guide, and run BLAST 2 sequences (n-1)-times with that sequence as the -j sequence and each of the other sequences as the -i sequence. Now, run each of the resulting files of hits through PickAl and choose the regions to be aligned. MERGE takes the resulting files of anchors and MERGES them together so that COMGAP can now make a global genomic alignment.

    Merge works exclusively from a command- or DOS-Prompt like BLAST. To run MERGE, first place the executable and the anchors files you wish to merge in the same folder. Using the "cd" command in the command-prompt, navigate your way to this folder. (If this confuses you, click here for the previous discussion of the CD command.) At the prompt, type:
    merge firstfile.anc+secondfile.anc newfile.anc
    Here firstfile.anc and secondfile.anc represent the names of the old files to merge, and newfile.anc represents the the name of the new merged file of anchors. You may merge together more than one file at a time, and there is practically no limit to the number you may merge.

    Specifically COMGAP?

    Also, COMGAP is also a much simpler program than PickAl. By concatenating together a series of Needleman-Wunsch generated alignments, COMGAP creates alignments of genomic segments. You can run the program by double-clicking on it, or by typing COMGAP in the proper directory of a Command-Prompt. If you do the former, it the program will ask you for one thing, the name of the file of anchors, which should contain the names of the files containing the sequences. Aligning things can, at times, be quite time consuming so please be patient while it runs. If you run the program from the command-prompt, you may type the name of the anchors file immediately after the program call. Example

    The command-prompt can also allow you to change some default alignment parameters. COMGAP uses a transition-transversion matrix to calculate its alignments that gives a score of 6 for identities, 2 for transitions, and 0 for transversions. You can change the gap parameters, which are a penalty of 20 for the first gap, 1 for each subsequent gap, and a maximum penalized length of 30 gaps. These are expressed as the second, third, and fourth arguments, respectively, of the program call. Example

    What sort of results should I expect?

    First, we'll talk about the alignment quality and then the format of the file. The quality is good. We did an assessment of 17 colinear pairs of sequences with 467 exons. 95% of the exons aligned perfectly on both ends, and 99% matched on one end. This compares to BLASTN aligning 58% perfectly, 85% matching at least one end, and 94% with overlap; consider that was with no BLASTN cutoff, which yields 8080 hits for 467 exons. That's a lot of false positives! Sure, some exons will have more than one BLASTN hit, and there are more conserved features than just exons, but still that's well over 17 hits for every exon. Besides, BLASTN hits with E-values of 1.0 are very unreliable. So let's lower the E-value cutoff to 10E-10, which yields 810 hits. Now All those percentages drop to 52%, 76%, and 81% respectively. No matter how you slice it, you gain by the extra steps of PickAl and COMGAP.

    As for the format an n-sequence alignment, it comes in pieces of n+1 rows. The very first row has the number of sequences aligned, the date the alignment was run, and n numbers, representing the starting positions of each sequence in their respective order. These next n rows are (n)(50 characters) of the alignment. The next line has n numbers, representing the ending positions of each sequence in the previous block. Then there are n more lines of alignment. This repeats until the very end.
          3    12/04/02      98416     18896     44024
    JAR12.M1  TGGTGAAATG CTTAGCCCTA AGTTGGGCTT C------ACA CAGCAGTACA
    JAR12.M2  TGATGAAATG CTCAGCCCCA AGTTGGGCTT C------ACA CAGCCGTACA
    JAR12.H   TGAGGCAAAG CTCAGACTTC ACCTCTACCC CTAAACAAGG CACCCAAACA
         98459     18939     44073
    JAR12.M1  TACACCACAC TGAACAAAGG ACA--CAGAA AGAAGTACAG GCACAAGTAT
    JAR12.M2  TACACCACAC TGAACAAAGG ACA--CAGAA AGAAGTGCAG GCACGAGTAT
    JAR12.H   CACGTCACAG TAAATAAAGG ACATCCAGAA AATATCACAG GCAGAGGTAC
         98507     18987     44123
    JAR12.M1  TTTATTTGGC AATTTCAGCC TGACGTGAAG GGCAGAGTTT TCTACTC-CC
    JAR12.M2  TTTATTTGGC AATTTCAGCC TGACGTGAAG GGCAGAGTTT TCTACTC-CC
    JAR12.H   TTTATTTGGC AATTTTAACA TGACACGTAG AGAAAAGAAC CCTGCCCTCC
         98556     19036     44173
    JAR12.M1  TGCTCCAGTG TCTCCAGCAA CCCCACCTTC TCATCCC--- ----------
    JAR12.M2  TGCTCCAGTG TCTCTAGCAA TCCCACCTTC TCATCCC--- ----------
    JAR12.H   TTCACCAGCC TCCCCAGAAA TCCCACCTTC CTATTTCAAG ACAGAGTAAT
         98593     19073     44223
    JAR12.M1  ---------- ---------- ---------- -------CTT GGCACGGCTC
    JAR12.M2  ---------- ---------- ---------- -------CTT GGCACGGCTC
    JAR12.H   AACAGCACCA TTTTACACGA AAGGGAACAG CCACAGCCTT GGCACCATTT
         98606     19086     44273
    JAR12.M1  CTGACTCCAC ACGCACAGAA --GCAAGAGC TGCAATGCCC ACAGCCCAGC
    JAR12.M2  CTGACTCCAC ACGCACAGAA --GCAAGAGC TGCAATGCCC ACAGCCCGGC
    JAR12.H   CTGGTTCCAC TTTCCATGGA AGGGCAGAGA AGCATTGCTC AAACCCCACC
    .
    .
    .
    .
    
    The only thing I believe requires more explanation is what happens if the subsequences between anchors cannot be aligned in the available memory. COMGAP includes the last sequence aligned to O's in the other sequences in the alignment; the O's serve as a space holder that stands for any nucleotide or a gap symbol.

    How do I comment?

    Any comments about this software or this page should be sent to
    g e n o i n f o @ g e n o m i c s . u c l a . e d u


    Examples section

    BLAST If you would like to work the example, just substitute the words HumanEx.seq for first_seq and MouseEx.seq for second_seq like so at the Command-prompt; in other words, type what is below as one line.
    C:\Blast_PickAl>bl2seq -i HumanEx.seq -j MouseEx.seq 
                           -o Example.blt -p blastn
    
    and press Enter.
    Go back to the BLAST 2 sequences section

    PickAl Double-click the PickAl icon.
    Click Browse beside the "file of BLAST hits."
    Choose the file named Example.blt.
    Many default names will appear in the other boxes. The ones for the first and second sequence should in this case be right, but one will often like to check this. If they are not right, you may either change them in the text box, or click Browse beside them and find the file.
    You may choose at times to change the name of the two output files, but for now, do not.
    For this example, leave the lowest four fields alone. Click Ok.
    For the rest of the example of PickAl, just play. Read the text in the PickAl section and then follow the directions that appear at the bottom of the plot. Have fun. Go back to the PickAl section

    COMGAP, example 1
         C:\>COMGAP Example.anc
    
    Go back to the COMGAP section

    COMGAP, example 2 If you wished to change these to say 15 for the first gap, 2 for each subsequent gap, and a maximum penalized length of 100 gaps, then you would enter at the command prompt
         C:\>COMGAP Example.anc 15 2 100
    
    Go back to the COMGAP section

    Non-specialist explanation and motivation
    behind PickAl and COMGAP

    During the course of evolution, the mutation process causes DNA sequences to change at random. This by-in-large causes sequences in diverging species to gradually become more dissimilar. However, some DNA segments become dissimilar more slowly than other DNA segments. This is because these segments code for genes and regulatory elements, which when changed, often work less well than before; these changed defective versions are usually "weeded out" by natural selection, leaving the original sequence. Now consider the sequences which don't code for genes or regulatory elements, which are often called "junk DNA"; these sequences have little purpose, perhaps only to hold space, and can freely change. By comparing the amount of change shown by different regions of related sequences, we have a simple measure for the detection of genes and regulatory elements.

    This approach, called phylogenetic footprinting, depends critically on how the sequences are aligned; we really want to be comparing segments that are related to each other to one another. If we don't, it is like trying to compare the differences between siblings from different families, but mixing up who belongs with whom. Any differences we see between Karin Jorgeson and Pei-Shan Wang don't have anything to do with their siblinghood because they're not siblings!

    Therefore, aligning sequences properly is critical for such an analysis. Keep in mind this isn't trivial, because sequences the regions in which we're interested are around 100 nucleotides long, and genomes like mouse and human contain about 3 billion nucleotides. (Nucleotides are the building blocks of DNA.) That's like trying to match up 30,000,000 pairs of siblings! Imagine everyone in California having a sibling in New York or Pennsylvania and trying to pair them up!

    Now it is a bit simpler than this. DNA segments in somewhat close species, like mouse and human come in segments called syntenic blocks. These are regions in both species 10,000 to 10 million nucleotides long that are related to one another. So now it's like breaking these U. S. regions down into towns and small cities, and saying, "Ok, everyone in Albany, New York has a sibling in Pasadena, California. Find them." And we need to do this about 1,000 times!

    Now it also gets harder. Some segments clearly don't show any relationship to each other, so it is a bit like having only half of the people in each city being related, and the others are just there to distract you from your problem. This makes the number of pairs you're looking for fewer, but now all the others get in the way.

    Now it gets a bit easier still. These syntenic blocks have segments which are colinear, which means that the order of segments is the same as the other. It is a bit like having siblings living on the same street in different cities, and not just that, but living in the same order as you go down that street! Now keep in mind that as you pass Reza Tehrani's place and then Maria Rodriguez's next door and then Cliff Tailfeather's next door to her in Albany, you should find their siblings in Pasadena in the same order on a street in Pasadena, but they may not be next door to one another.

    Last is one way it gets harder. We said above that these regions are about a hundred nucleotides long, but they vary in length from several bases to thousands. We have no problem discerning where Jack Sprat stops and Dela Botchwey begins, but this is not the case with DNA. We also want to know the borders of these segments.

    With all that background, given syntenic regions (Albany and Pasadena in our examples), PickAl and COMGAP allow you to find the colinear segments (streets), and align the DNA sequences (pairing Jamshid & Reza Tehrani and Jose and Maria Rodriguez). What they won't do is tell you exactly where each region we're interested in (Jamshid and Jose) start and end, and it definitely won't tell you their professions and skills and habits, which would also be helpful.

    Back to the Table of contents or
    back to the What are these section.