This page provides a detailed step-by-step tutorial for comparing partitioning schemes for DNA sequence alignments using PartitionFinder2. If you just want to try it out, this page isn't for you - there are much simpler quick-start instructions in the PartitionFinder2 manual. This tutorial might help if you are unsure about any particular aspect of your analysis, or you have never thought about partitioning schemes before. It assumes you have PartitionFinder2 installed and running, you can find out how to do this in the PartitionFinder2 manual (in the 'docs' folder when you download PartitionFinder). For the purposes of this tutorial, I'll assume that you're using a Mac, and that you have PartitionFinder2 installed in your Applications folder. If you're using Windows, you'll need to slightly modify section 4 - details are in the manual.
If you haven't already done so, click here to download and install PartitionFinder.
Then, if you haven't used PartitionFinder2 before, make sure that you can run at least one of the example datasets. For instructions, turn to page 7 of the manaul for Mac/Linux systems, and page 9 for Windows systems.
1. Get the dataset
For this tutorial, we'll be analysing a publicly available dataset from bark beetles (Cognato et al 2001). This dataset consists of a nuclear protein coding gene, a mitochondrial protein coding gene, and 16S rRNA. You can download the file we'll be using in this tutorial by clicking here. The original data is available from DataDryad here. I'm going to save the file as "cognato.nex" in a folder on the desktop called "beetles". I'll assume you've done the same.
2. Convert the alignment to phylip format
PartitionFinder2 works with phylip formatted alignments. Right now, our alignment is in nexus format, so we need to convert it. This section describes how do this, but if you want to skip this step, click here to dowload the phylip formatted alignment.
Changing alignment formats can be a notorious pain, but luckily it's made pretty simple by Geneious. If you don't have Geneious, it's free and you can download it from http://www.geneious.com/. And don't worry - the free version is all we need here.
Once you have Geneious, open up the 'cognato.nex' file, then go the 'File' menu, click 'Export', then 'Selected documents...'. You'll get a list of options for the file format. Scroll down and choose 'Phylip (*.phy)'. Click 'OK', then save the file in the "beetles" folder on the desktop as 'cognato.phy'. When you click 'OK' you'll get a warning you can ignore, then an option box which asks how long the names should be in the exported file, choose the 'Relaxed Phylip - Full Length...' option, and leave the other boxes unticked.
3. Set up the partition_finder.cfg file
Next we need to set up the file that tells PartitionFinder2 what kind of analysis we want to do. PartitionFinder2 gets most of its instructions from a simple configuration file called 'partition_finder.cfg'. If you want to skip this step, you can just download the pre-formatted configuration file by clicking here. What follows is a description of how you would set up this file from scratch.
The .cfg file needs to be in the same folder as your alignment file. The easiest way to set up a .cfg file is to first copy the partition_finder.cfg file from the Applications/PartitionFinder/examples/nucleotide folder, and paste it into your folder for this analysis (/Desktop/beetles if you're doing the same as me). Next, we'll edit each of the options in that file to set up our analysis:
- The 'alignment' option tells PartitionFinder2 the name of the alignment, so here we'll set this as follows: 'alignment = cognato.phy; '
- The 'branchlengths' option tells PartitionFinder2 whether to estimate a single set of branchlengths across all partitions, or whether each partition should have its own set of independent branchlengths. It can be difficult to know beforehand which of these options will be the best, but not all phylogenetics programs will allow for unlinked branchlengths. For this reason, we're going to leave this option as 'branchlenghts = linked; '. The PartitionFinder2 manual has a more in-depth explanation of this option.
- The 'models' option defines which substitution models will be analysed for each partition. We're going to choose a small but sensible set of models. Keeping it small is useful here, because it will make our analyses run quickly and this is just an example. So, set this to be
models = GTR, GTR+G, GTR+I+G; This means that PartitionFinder2 will compare these 3 different substitution models for each subset, and select the best one for each.
- 'model_selection' we'll leave as 'AICc'. This means that model selection and partitioning scheme comparison will be performed using the corrected Aikaike Information Criterion. You should always choose this over the AIC, but whether you choose AICc or BIC is still something of a matter of personal choice. It does affect the results sometimes, though usually not by much, and not in a way that usually affects the tree topology.
- The '[data_blocks]' section is the most important. This is where we define the sets of sites on which the entire analysis is based. PartitionFinder2 works by trying out different partitioning schemes based on these data blocks, so the idea is to use the data blocks to define sets of sites that you think might have evolved in a similar way. PartitionFinder2 will never try to subdivide any of your data blocks, so its better to define lots of small data blocks, rather than a few big ones (within reason). A sensible thing to do with protein coding genes is to define one data block for each codon position in each gene. We can find out where the different genes are by looking at the bottom of the 'cognato.nex' file - you'll notice that there are a few lines starting 'charset'. These tell us which sites in the alignment correspond to which of the genes and codon positions. You'll see that the codon positions are already defined for us for the two protein coding genes (ef1a and COI), so we'll use those as is. We'll keep the rRNA gene (16S) as a single data block.
If you are in a situation where you don't know where the codon positions are, it's important that you figure this out and provide the information to PartitionFinder2 in the datablocks section. Failing to define codon positions in protein coding genes can lead to very poor estimates of phylogenetic trees.
The manual contains detailed instructions on defining data blocks. In this case, our data blocks should look like this:
ef1a_1stpos = 1-649\3;
ef1a_2ndpos = 2-649\3;
ef1a_3rdpos = 3-649\3;
COI_1stpos = 650-1415\3;
COI_2ndpos = 651-1415\3;
COI_3rdpos = 652-1415\3;
16S = 1416-1897;
Each of these lines defines a single data block. The syntax is just 'name = start-stop\gap'. Now we have 7 data blocks defined. One key difference between PartitionFinder2 and PartitionFinder1 is that in PartitionFinder2 it's better to make sure that every site in your alignment is defined in a data block. This is because by default PF2 tries to estimate a Maximum Likelihood starting tree from your data using all of the data blocks you've defined, and for that we need every site to be assigned to a data block. The reasons for this are outlined in the PartitionFinder2 paper in MBE.
- The '[schemes]' section tells PartitionFinder2 how to compare partitioning schemes. For this tutorial, we're going to set 'search=greedy'. This tells PartitionFinder2 to search for a good partitioning scheme using a heuristic search algorithm described in the 2012 PartitionFinder paper in MBE. Since we've defined 7 data blocks, it would be possible to analyse all possible partitioning schemes ('search=all'), but this can take quite a long time because there are so many possible partitioning schemes for 7 data blocks. If you have a computer with lots of processors, or time to wait, you might want to try this out. But for most analyses you'd get exactly the same answer with a greedy search as with search=all.
4. Running PartitionFinder
To run PartitionFinder, follow these steps:
- Make sure you've followed the installation instructions in the manual, installing Python 2.7.x using Anaconda (or installing the dependencies yourself if you know how).
- Open up Terminal, which is in the /Applications/Utilities/ folder
- Type "python" into the terminal, followed by a space
- Open up a Finder window and navigate to the PartitionFinder2 program folder (I'm going to assume that this is located in a folder called /Applications/partitionfinder-2.1.0 for now)
- Drag and drop the "PartitionFinder.py" file into the terminal window - you'll see that the full path to the file is filled out automatically
- Drag and drop your analysis folder (for me this is a folder called "beetles" on my desktop) into the terminal
- Now, you terminal should look something like this (the part of the line that says 'Mellisuga:~ robertlanfear$' just describes my computer name and my username, so that will differ for you, unless there has been a strange coincidence):
Mellisuga:~ robertlanfear$ python /Applications/partitionfinder-2.1.0/PartitionFinder.py ~/Desktop/beetles
- Now just hit enter. You should see some information like this start to get printed to the terminal window:
INFO | 2016-12-02 18:52:18,186 | ------------- PartitionFinder 2.1.0 -----------------
INFO | 2016-12-02 18:52:18,186 | You have Python version 2.7
INFO | 2016-12-02 18:52:18,186 | Command-line arguments used: /Applications/partitionfinder-2.1.0/PartitionFinder.py /Users/robertlanfear/Desktop/beetles
INFO | 2016-12-02 18:52:18,186 | ------------- Configuring Parameters -------------
INFO | 2016-12-02 18:52:18,187 | Setting datatype to 'DNA'
INFO | 2016-12-02 18:52:18,187 | Setting phylogeny program to 'phyml'
INFO | 2016-12-02 18:52:18,187 | Program path is here /Applications/partitionfinder-2.1.0/programs
INFO | 2016-12-02 18:52:18,188 | Setting working folder to: '/Users/robertlanfear/Desktop/beetles'
INFO | 2016-12-02 18:52:18,188 | Loading configuration at './partition_finder.cfg'
INFO | 2016-12-02 18:52:18,197 | Setting 'alignment' to 'cognato.phy'
INFO | 2016-12-02 18:52:18,198 | Setting 'branchlengths' to 'linked'
INFO | 2016-12-02 18:52:18,199 | You set 'models' to: GTR, GTR+G, GTR+I+G
INFO | 2016-12-02 18:52:18,215 | This analysis will use the following 3 models of molecular evolution
INFO | 2016-12-02 18:52:18,216 | GTR, GTR+G, GTR+I+G
INFO | 2016-12-02 18:52:18,217 | Setting 'model_selection' to 'aicc'
INFO | 2016-12-02 18:52:18,230 | Setting 'search' to 'greedy'
INFO | 2016-12-02 18:52:18,230 | ------------------------ BEGINNING NEW RUN -------------------------------
INFO | 2016-12-02 18:52:18,231 | Looking for alignment file './cognato.phy'...
INFO | 2016-12-02 18:52:18,238 | Using 4 cpus
INFO | 2016-12-02 18:52:18,238 | Beginning Analysis
INFO | 2016-12-02 18:52:18,263 | Reading alignment file './cognato.phy'
INFO | 2016-12-02 18:52:18,276 | Starting tree will be estimated from the data.
INFO | 2016-12-02 18:52:18,277 | Estimating Maximum Likelihood tree with RAxML fast experimental tree search for ./analysis/start_tree/filtered_source.phy
INFO | 2016-12-02 18:52:18,278 | Using a separate GTR+G model for each data block
INFO | 2016-12-02 18:52:45,662 | ML topology estimation finished
INFO | 2016-12-02 18:52:45,663 | Performing Greedy Analysis
INFO | 2016-12-02 18:52:45,664 | *** Analysing starting scheme ***
INFO | 2016-12-02 18:52:50,057 | Finished subset 1/7, 14.29 percent done
INFO | 2016-12-02 18:52:57,972 | Finished subset 2/7, 28.57 percent done
INFO | 2016-12-02 18:52:59,855 | Finished subset 3/7, 42.86 percent done
INFO | 2016-12-02 18:53:00,728 | Finished subset 4/7, 57.14 percent done
INFO | 2016-12-02 18:53:06,518 | Finished subset 5/7, 71.43 percent done
INFO | 2016-12-02 18:53:07,658 | Finished subset 6/7, 85.71 percent done
INFO | 2016-12-02 18:53:14,037 | Finished subset 7/7, 100.00 percent done
INFO | 2016-12-02 18:53:14,038 | ***Greedy algorithm step 1***
INFO | 2016-12-02 18:53:14,039 | Analysing 21 new subset pairs
INFO | 2016-12-02 18:53:24,687 | Finished subset 1/21, 4.76 percent done
When using the greedy algorithm, PartitionFinder2 might finish before it gets to 100%, that's because the 100% for the greedy algorithm is the worst-case scenario. All you have to do now is wait for the analysis to finish. How long this takes will depend on your computer - but it shouldn't take more than a few minutes.
5. Interpreting the results
Once PartitionFinder2 has finished, the results will be stored in the /Armadillos/analysis/ folder. There is a lot of information stored here, descriptions of what it all is can be found at the end of the PartitionFinder2 manual. The main result is in the 'best_scheme.txt' file. If you open up that file, you will notice that it starts by describing the settings used for the run. Below that, you should see something a bit like this (note that results may differ slightly on different systems, because PhyML works a little different on Linux, Mac and Windows):
Best partitioning scheme
Scheme Name : step_1
Scheme lnL : -18431.2988281
Scheme AICc : 37183.8287775
Number of params : 148
Number of sites : 1897
Number of subsets : 6
Subset | Best Model | # sites | subset id | Partition names
1 | GTR+I+G | 217 | d382a6be22e2495575e4d97003dcc693 | ef1a_1stpos
2 | GTR+I+G | 471 | 1638e275adde7fea81f6d7e3ba7b4d49 | ef1a_2ndpos, COI_2ndpos
3 | GTR+G | 216 | d6496c25c9e9550070851560ef7d9b29 | ef1a_3rdpos
4 | GTR+I+G | 256 | a51426084bd4082ec2c2c49b88ac4628 | COI_1stpos
5 | GTR+G | 255 | 0195ef30a620e50b9307a2be784bf230 | COI_3rdpos
6 | GTR+I+G | 482 | 7b895a8ce777707287ca690df280b5dd | 16S
The first few lines tell you some general information about this partitioning scheme. The useful information is in the table below those lines. Each row in this table tells you about a subset from the best partitioning scheme. First, this table shows you that the best-fit partitioning scheme found by the greedy algorithm has 6 subsets. In this analysis, the best scheme just merges two of the original data blocks that correspond to the 2nd codon positions of the protein-coding genes. This seems sensible, since we know that 2nd codon positions tend to evolve a lot more slowly than other codon positions.
The output file also tells you the results of model selection on each of the subsets. For instance, the first row tells you that for the partition made up of the first codon position of ef1a the best-fit substitution model according to the AICc was the GTR+I+G model. This information tells you everything you'd need to go ahead and use this dataset for a partitioned phylogenetic analysis, for instance using Garli, BEAST, or MrBayes. To help you out with downstream analyses, you'll notice that lower down in the file the partitioning scheme is written in a range of formats suitable for different programs.
If you want more details on the model selection for a particular subset, you can get that very easily. To get more information on the model selection results for a particular subset, just get that subset's name from the 'name' column "best_scheme.txt" (shown above), then look for a file with that name in the analysis/subsets/ folder. This file contains information on each of the models analysed for that subset, ranked in order of their AICc score.
Here are the first few lines from the model selection file that corresponds to the first subset in the analysis above (note that there may be minor differences in the results between operating systems, because of the way that PhyML works on different machines):
Model selection results for subset: d382a6be22e2495575e4d97003dcc693
Subset alignment stored here: ./analysis/phylofiles/d382a6be22e2495575e4d97003dcc693.phy
This subset contains the following data_blocks: ef1a_1stpos
Number of columns in subset: 217
Models are organised according to their AICc scores
Model | Parameters | lnL | AICc | AIC | BIC
GTR+I+G | 10 | -708.347 | 1437.76 | 1436.69 | 1470.49
GTR+G | 9 | -730.689 | 1480.25 | 1479.38 | 1509.8
GTR | 8 | -808.053 | 1632.8 | 1632.11 | 1659.15
These results show that the GTR+I+G model is quite a lot better than the other two models on this subset of sites.
6. What nextPartitionFinder2 provides you with all the information necessary to carry out a partitioned phylogenetic analysis, e.g. in RaxML, BEAST, MrBayes, or Garli. You don't need to do any further model-comparison or partitioning-scheme comparison after running PartitionFinder2. If you're using a program (like RaxML or MrBayes 3.1.2) that doesn't implement all of the 56 substitution models, PartionFinder inlcudes a couple of extra options which might be helpful. For instance there are various pre-defined lists of models, which are described in the manual. This can save you from writing out long model lists. Also note that if you change the model list for an analysis, PF2 will re-use any old results it has, so can be very quick. For example, if you include an extra model (e.g. JC+I+G) in the list in the configuration file, and re-run your analysis, it will be quicker than last time.
PartitionFinder2 also includes the option to use RAxML for all of the model selection and partitioning scheme comparison (which is much faster than PhyML), and also includes two new search algorithms (hcluster and rcluster) for finding partitioning schemes on very large datasets.
Here are a few suggestions of analyses you might try next, to get a feel for how PartitionFinder2 works:
- Try running PartitionFinder2 with a smaller set of models (e.g. "models = GTR+G;"). You'll notice it runs very fast, that's because it has stored all the results of the previous analysis and is using those.
- Try running PartitionFinder2 with the "--raxml" commandline option. If you have been following through this tutorial, you'll notice that this gives you an error - that's because PartitionFinder2 can't re-use data from the previous runs (in which PhyML was used to calculate likelihoods), so it exits without doing anything. This is to make sure that you don't accidentally overwrite your old analyses. Follow the instructions in the error message and add the "--force-restart" option. That will make PartitionFinder2 delete all the stored results and start again. This analysis will be faster than the PhyML analysis (even with the same three GTR models being analysed from scratch) because RAxML is quicker than PhyML.
- Try changing the search option to "search = rcluster;" and running it again. This implements a fast clustering method to search for partitioning schemes. You may get a slightly different result. In general, you should only use this option when the "search=greedy;" option is too slow for your dataset.