PartitionFinder tutorial
This page provides a detailed step-by-step tutorial for comparing partitioning schemes for DNA sequence alignments using PartitionFinder. If you just want to try it out, this page isn't for you - there are much simpler quick-start instructions in the PartitionFinder manual. This tutorial might help if you are unsure about any particular aspect of your analysis, or you have never thought about partitioning schemes before. It assumes you have PartitionFinder installed and running, you can find out how to do this in the PartitionFinder manual (in the 'docs' folder when you download PartitionFinder). For the purposes of this tutorial, I'll assume that you're using a Mac, and that you have PartitionFinder installed in your Applications folder. If you're using Windows, you'll need to slightly modify section 4 - details are in the manual.
1. Get the dataset
For this tutorial, we'll be analysing a publicly available dataset from armadillos (Delsuc et al 2003). This dataset consists of 3 protein coding nuclear genes. To get this dataset, you can either download it from this website by clicking here, or you can download it from the original DataDryad page by clicking on this link: http://datadryad.org/handle/10255/dryad.1839. I'm going to save the file as "delsuc.nex" in a folder on the desktop called "Armadillo". I'll assume you've done the same. 2. Convert the alignment to phylip format
PartitionFinder works with phylip formatted alignments. Right now, our alignment is in nexus format, so we need to convert it. This section describes how do this, but if you want to skip this step, click here to dowload the phylip formatted alignment. 3. Set up the partition_finder.cfg file
Next we need to set up the file that tells PartitionFinder what kind of analysis we want to do. PartitionFinder gets all of its instructions from a simple configuration file called 'partition_finder.cfg'. If you want to skip this step, you can just download the pre-formatted configuration file by clicking here. What follows is a description of how you would set up this file from scratch. 4. Running PartitionFinder
To run PartitionFinder, follow these steps:
5. Interpreting the results
Once PartitionFinder has finished, the results will be stored in the /Armadillos/analysis/ folder. There is a lot of information stored here, descriptions of what it all is can be found at the end of the PartitionFinder manual. The main result is in the 'best_schemes.txt' file. If you open up that file, you should see something a bit like this: 6. What next
PartitionFinder provides you with all the information necessary to carry out a partitioned phylogenetic analysis, e.g. in RaxML, BEAST, MrBayes, or Garli. You don't need to do any further model-comparison or partitioning-scheme comparison after running PartitionFinder. If you're using a program (like RaxML or MrBayes 3.1.2) that doesn't implement all of the 56 substitution models, PartionFinder inlcudes a couple of extra options which might be helpful. For instance you can set 'models=MrBayes' to include only the set of models implemented in MrBayes 3.1.2. If you try that, you'll note that when you do that, the analysis is extremely fast - this is because PartitionFinder is using the stored results from the previous analysis (where you analysed all 56 models), so rather than analysing all the MrBayes models from scratch, it just loads the results from the last run. References
Delsuc, F., M. J. Stanhope, and E. J. Douzery. 2003. Molecular systematics of armadillos (Xenarthra, Dasypodidae): contribution of maximum likelihood and Bayesian analyses of mitochondrial and nuclear genes. Mol Phylogenet Evol 28:261-275.
Partition Finder
If you haven't already done so, click here to download and install PartitionFinder.
One thing you might notice if you scroll to the bottom of the .nex file is that there aren't just 3 nuclear genes in the dataset, there are two mitochondrial genes as well (have a look at the 'charset' lines). For this tutorial, we're going to focus on the three nuclear genes, and ignore the two mitochondrial genes.
Changing alignment formats can be a notorious pain, but luckily it's made pretty simple by Geneious. If you don't have Geneious, it's free and you can download it from http://www.geneious.com/.
Once you have Geneious, open up the 'delsuc.nex' file, then go the 'File' menu, click 'Export', then 'Selected documents...'. You'll get a list of options for the file format. Scroll down and choose 'Phylip (*.phy)'. Click 'OK', then save the file in the "Armadillo" folder on the desktop as 'delsuc.phy'. When you click 'OK' you'll get an option box which asks how long the names should be in the exported file, choose the 'Export full length' option.
The .cfg file needs to be in the same folder as your alignment file. The easiest way to set up a .cfg file is to first copy the partition_finder.cfg file from the Applications/PartitionFinder/example folder, and paste it into your folder for this analysis (/Desktop/Armadillo if you're doing the same as me). Next, we'll edit each of the options in that file to set up our analysis:
Now the .cfg file is set up, we just save it and we're ready to run PartitionFinder.
The first thing to realise before defining the data blocks is that we don't know from the information in 'delsuc.nex' which codon position is which, or even whether each gene is in frame. Luckily, Geneious can come to our rescue here. If you open up the 'delsuc.nex' file in Geneious, you can have a look at the amino-acid translation of each of the sequences. You'll notice that our three genes (bases 1-5113) are all in frame, and translated correctly. You can determine the correct translation by looking up these genes on GenBank, and making sure that the amino-acid sequence shown in genbank matches the one you see in Geneious. You'll also see that after base 5113 the translation goes a bit crazy - that's because this gene is 12SrRNA, which doesn't code for a protein, so the translation here is garbage.
Having confirmed that our three genes of interest are in frame, we can define our nine data blocks. These go on the lines after [data_blocks], and should look like this in your file:
ADRA2B_1 = 1 - 1158\3;
ADRA2B_2 = 2 - 1158\3;
ADRA2B_3 = 3 - 1158\3;
BRCA1_1  = 1159 - 3951\3;
BRCA1_2  = 1160 - 3951\3;
BRCA1_3  = 1161 - 3951\3;
VWF_1    = 3952 - 5113\3;
VWF_2    = 3953 - 5113\3;
VWF_3    = 3954 - 5113\3;
Each of these lines defines a single data block. The syntax is just 'name = start-stop\gap'. Now we have 9 data blocks defined. You'll note that we haven't mentioned the last couple of thousand base-pairs in the alignment. PartitionFinder will ignore these bases, and just perform the analysis on the partitions we've defined here. It would be precisely the same if we went into our alignment file and deleted the unused bases, but PartitionFinder is written so that you don't have to.
Last login: Wed Aug 17 13:24:09 on ttys001
tarquin:~ Rob$ python /Applications/PartitionFinder/PartitionFinder.py /Users/Rob/Desktop/Armadillo
INFO | 2011-08-18 11:43:44,014 | You appear to have 16 cpus
INFO | 2011-08-18 11:43:44,016 | Using folder: '/Users/Rob/Desktop/Armadillo'
INFO | 2011-08-18 11:43:44,016 | Loading configuration at '/Users/Rob/Desktop/Armadillo/partition_finder.cfg'
INFO | 2011-08-18 11:43:44,031 | Setting 'alignment' to 'delsuc.phy'
INFO | 2011-08-18 11:43:44,031 | Setting 'branchlengths' to 'linked'
INFO | 2011-08-18 11:43:44,032 | Setting 'models' to 'all'
INFO | 2011-08-18 11:43:44,032 | Setting 'model_selection' to 'bic'
INFO | 2011-08-18 11:43:44,046 | Setting 'search' to 'greedy'
INFO | 2011-08-18 11:43:44,046 | Beginning Analysis
You'll then notice that you get a warning that you haven't included all the alignment sites in your analysis. That's OK, we meant to do that because we're ignoring the mitochondrial genes at the end of the alignment file. After that, PartitionFinder will keep you updated with its progress. When using the greedy heuristic algorithm, PartitionFinder might finish before it gets to 100%, that's because the 100% for the greedy algorithm is the worst-case scenario. All you have to do now is wait for the analysis to finish. How long this takes will depend on your computer - it might take anything from 10 minutes to a few hours.
Best scheme according to Greedy algorithm, analysed with bic
Scheme Name : 70
Scheme lnL : -16862.47009
Scheme AIC : 33850.94018
Scheme AICc : 33852.537328
Scheme BIC : 34262.9313005
Num params : 63
Num sites : 5113
Num subsets : 6
Subset | Best Model | Subset Partitions | Subset Sites
1 | TrN+G | ADRA2B_1, VWF_1 | 1-1158\3, 3952-5113\3
2 | TVMef+G | ADRA2B_2, VWF_2 | 2-1158\3, 3953-5113\3
3 | TVM+I+G | ADRA2B_3, VWF_3 | 3-1158\3, 3954-5113\3
4 | HKY+I | BRCA1_1 | 1159-3951\3
5 | HKY+G | BRCA1_2 | 1160-3951\3
6 | HKY+G | BRCA1_3 | 1161-3951\3
The first few lines tell you some general information about this partitioning scheme. The useful information is in the table below those lines. Each row in this table tells you about a subset from the best partitioning scheme. First, this table shows you that the best-fit partitioning scheme found by the greedy algorithm has 6 subsets. In this partitioning scheme each of the three codon positions from ADRA2B and VWF are merged together, but each codon position from BRCA1 is treated separately. This scheme tells you something interesting - for some reason, the substitutional process underlying the evolution of the ADRA2B and VWF genes seem to be similar, but different from the substitutional processes underlying the evolution of BRCA1.
The output file also tells you the results of model selection on each of the subsets. For instance, the first row tells you that for the partition made up of the first codon position of ADRA2B and VWF the best-fit substitution model according to the BIC was the TrN+G model. This tells you everything you'd need to go ahead and use this dataset for a partitioned phylogenetic analysis, for instance using Garli, BEAST, or MrBayes.
If you want more details on the model selection for a particular subset, you can get that very easily. In the full "best_schemes.txt" file, the name of the subset alignment file is also provided in the final column, I just didn't paste it in above. So, to get more information on the model selection results for a particular subset, just get that subsets name from the 'name' column "best_schemes.txt", then look for a file with that name in the analysis/subsets/ folder. This file contains information on each of the 56 models analysed for that subset, ranked in order of their BIC score. For instance if you try this for subset 3, above, you'll see that the TVM+I+G is the best-scoring model, followed closely by the GTR+I+G model.
