Information and ideas in evolutionary research & academia.

Grab & Graph GBIF Biodiversity Data Using R

6/2/2019

R + GBIF = rgbif
Here is a tutorial for the R package 'rgbif'. This allows you to access specimen information in the Global Biodiversity Information Facility database. GBIF has hundreds of millions of species occurrence records from around the globe, open for anyone to use.

How did these records get into GBIF in the first place? The data come from many sources -- various museums, universities, and other institutions. Specimen label data has been recorded and digitized in spreadsheets, and this info is contributed to GBIF. I wrote a script in R to demonstrate how to import records for a desired taxon found in a region, and then I use R functions to display the data.

Orasema

Lophyrocera

Kapala

As shown in the tutorial, I accessed GBIF records for all the Eucharitidae collected in the US. Accessing the data is rather straightforward. Manipulating the data for display takes a few more functions. The records at right are grouped by genus, with any N/As renamed as 'no genus ID'. The genera were then sorted by their number of records. I didn't clean the data, which would be smart to do if you're using it for a project.

rgbif R script, opens in a new window

Below I display the number of eucharitid records per year in two different ways.
On the left is a violin plot, which I think is a neat way to look at how data are distributed. The width of the violin is relative to the number of yearly records -- not the actual values, but a probability density distribution. The boxplot within shows the median value as a white dot. I excluded zeros here, so we aren't seeing the years where no Eucharitidae were recorded (this keeps the script slightly simpler).
On the right is a chronological view of records per year, excluding years with zero records. I've distinguished the years having more than ten records by using a darker bar color.

I enjoyed exploring rgbif and thought it turned into a good introduction to how to plot data from a biodiversity database. I hope you try it out, too! Also, here's another rgbif tutorial I recently found that looks pretty useful; it focuses more on the data manipulation than the data display.

2 Comments

Combining UCEs and transcriptomes

10/30/2018

0 Comments

Phylogenomic methods are incredibly popular, and there are various baits and probes for capturing different pieces of the insect genome for analysis. One of the next aspects of 'big data' will be to develop ways to combine all of these sources!

My previous post covered a method for visualizing data on tree tips -- specifically I showed a plot of the number of loci recovered after doing a quick combination of genomes, transcriptomes, and UCEs (using the Phyluce pipeline). I'm happy to write that our research paper on the combination of phylogenomic data has been published in Molecular Phylogenetics and Evolution. Bossert et al. 2019 report successful results when three types of datasets are combined for Apidae (the largest bee family).

Bossert et al. figure 1. Graphical summary of the workflow developed for combining genome, transcriptome, and UCE data, exemplified for the widely shared HIPK2 gene of the honey bee.

Summary figure of the phylogeny of the bee family Apidae. The topology was in agreement with many of the previously published higher-level groupings. Despite widespread acknowledgement of a non-monophyletic Apinae, no taxonomic changes were proposed until now. Find the revised Apidae classification in Appendix A!

The trick to getting the transcriptomes to align without an excessive amount of error was to use the available genomes as a backbone. A transcriptome of the same length as a UCE in reality covers a longer region of the genome due to the fact that introns are excluded in these coding regions. By grabbing a long piece of the genome, exons at the ends of the transcriptome fragments could 'stretch out' and would not be misaligned with the ends of the UCEs.

Around the time our work came out, another group published a paper with a similar objective, using the Hemiptera UCE probe set. These authors had an alternative approach to combining transcriptomes and UCEs -- using tblastx to search for homologous loci of UCEs in transcriptomes.

references:
Bossert, S., Murray, E.A., Almeida, E.A.B., Brady, S.G., Blaimer, B.B. & Danforth, B.N. (2019) Combining transcriptomes and ultraconserved elements to illuminate the phylogeny of Apidae. Molecular Phylogenetics and Evolution, 130, 121-131. doi.org/10.1016/j.ympev.2018.10.012
Kieran, T.J., Gordon, E.R., Forthman, M., Hoey-Chamberlain, R., Kimball, R.T., Faircloth, B.C., Weirauch, C. and Glenn, T.C. (2019) Insight from an ultraconserved element bait set designed for Hemipteran phylogenetics integrated with genomic resources. Molecular Phylogenetics and Evolution, 130, 297-303.

0 Comments

Visualizing data -- matrix completeness

9/26/2018

0 Comments

Phylogenomic datasets can be gappier than matrices concatenated from a small number of hand-selected, Sanger-sequenced genes. Here is a nice way to visualize the percent data coverage and see the distribution of missing data across the tree.

I have a dataset of hundreds of loci acquired using ultraconserved element (UCE) probe matching. Besides my own data, there are other types of data that can potentially be incorporated (published genomes, transcriptomes, etc.). I want to see how the topology may be sensitive to different data and parameter permutations. I can use this handy barplot visual for dataset exploration, such as identifying problem clades with low data coverage.

A phylogeny of some aculeates with the percent of the loci present in the data matrix. Genomes are labeled, TS = transcriptomes, and EX#s are UCEs. The dashed line is the median value of loci completeness. Taxa above the median have green bars, and those below have blue.

To make the figure to the right: import your tree & data into R and adapt this phytools script.
In the tree shown at left, I've included data from genomes, transcriptomes ('TS'), & UCE probes matched in vitro to DNA extracts ('EX') (data in Dryad; Branstetter et al. 2017a). Full genomes should theoretically match all UCE probes, and we are recovering these genome-based taxa at 98-100% completeness, dependent on parameters in phyluce. Using default matching settings, our UCE probes match to transcriptomes (data in Dryad; Lopez-Osorio et al. 2017) at a low rate, which seems to cause artificially long terminal branches in this case.

Programs and packages involved in producing this figure:

I used the Hym-v2 probe set (Dryad; Branstetter et al. 2017b) to match UCE probes to published assemblies. The assemblies with an 'EX' in the name are from Branstetter et al. 2017. All probe-matching and matrix-building was done using phyluce. I used RAxML (on CIPRES) to build a concatenated phylogeny from >400 genes (the tree shown is pruned from >100 taxa).
I discovered last year that Windows 10 users can click a box to install the Bash shell for Ubuntu (a beta version) and thereby use Linux command-line tools. I used some functions in Linux to build the gene completeness data, since I haven't seen a different way to calculate this. I sometimes end up doing 'MacGyver Scripting' (TM). It gets the job done, but is not the most elegant. I used the Linux function 'awk' to record the names of all taxa in each individual locus alignment file and then counted the occurrences of each to output a sorted table of the number of loci for each taxon. I could've put this directly into R, but I opened it in Excel and converted the number of loci to a percent, based on the taxon with highest number of loci set at 100% (which was conveniently first on the list). This is the data table I loaded into R.
PLOTTING: From there, I was going to use the ggtree facet_plot function to display the phylogeny and data completeness. But I found something that seemed even better (and was less of a struggle than getting ggtree dependencies to work) -- a script that on Liam Revell's blog for the phytools function plotTree.barplot. I removed the argument for converting the data to log scale and tweaked a few small things. I like that the median value is marked on the graph, too.

Load your tree and your data (count data, continuous measurements, etc.) into R, use the handy script from Liam Revell's site, and in short time, you'll have a beautiful and informative figure!

references:
Branstetter, M.G., Danforth, B.N., Pitts, J.P., Faircloth, B.C., Ward, P.S., Buffington, M.L., Gates, M.W., Kula, R.R. & Brady, S.G. (2017) Phylogenomic Insights into the evolution of stinging wasps and the origins of ants and bees. Current Biology, 27, 1019-1025.
Branstetter, M.G., Longino, J.T., Ward, P.S. & Faircloth, B.C. (2017) Enriching the ant tree of life: enhanced UCE bait set for genome-scale phylogenetics of ants and other Hymenoptera. Methods in Ecology and Evolution, 8, 768-76.
Faircloth, B.C. (2015) PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics, btv646.
Lopez-Osorio, F., Pickett, K.M., Carpenter, J.M., Ballif, B.A. & Agnarsson, I. (2017) Phylogenomic analysis of yellowjackets and hornets (Hymenoptera: Vespidae, Vespinae). Molecular Phylogenetics and Evolution, 107, 10-15.

0 Comments

Animate stacks of images in ImageJ

10/29/2017

16 Comments

Use the free photo processing program, Fiji, to make a gif out of your stack of z-stepped images! You'll produce a nice little video clip that scrolls through all of the layered shots that you've taken of your specimen.

The images were taken on a Zeiss Stemi SV 6 microscope with an Axiocam 105 color camera. There are a total of 56 images. We don't have an automatic z-stepper for taking stacks of images, but we manually roll the focus knob to capture each of the 56 shots.

Thanks to my labmate, Silas Bossert, for the species ID. Anthidium oblongatum is not a native bee, but is introduced, as is the flower on which I caught it, Lotus corniculatus.

Anthidium oblongatum (Megachilidae) from Ithaca, NY

make a gif using ImageJ and your stack of images:

download Fiji, which is a distribution of ImageJ especially for image processing
make a folder only containing your images (I used tifs)
open the Fiji executable and then go to FILE > Import > Image Sequence; you can just click on one image and click 'open' (you can't select them all, but it will tell you how many images it counts in the folder) --> from here a scrollable window with your images will open
to save, go to FILE > Save As > Animated GIF (scroll down)

The 56 images stacked with Zerene Stacker

Also note:

Watch the animation in Fiji by going to: FILE > Image > Stacks > Animation > Start Animation.
It's a nicer video to save as an AVI file (but not as easy for the web).
My first gif was too large, but I reduced the size of the final file by shrinking the image. Just go to IMAGE > Adjust > Size. I reduced these shown from a width of 2560 pixels to 1500. The final gif was 71MB which is still a big file, but I didn't want the bee to look any fuzzier than it is naturally!

16 Comments

Fixed & Starting Trees in BEAST 2

7/29/2017

1 Comment

Setting a starting tree in the program BEAST can be a complicated issue, and I've been asked about troubleshooting for it. Here is a full XML file with annotations, as an example of how to designate a starting tree and how to force BEAST to keep it as a fixed topology.

BEAST is a widely-used phylogenetic dating program and has an excellent GUI interface in BEAUti, where users can control most all parameters and inputs they'd need. BEAUti is the front-end program that produces the XML file that is then used by BEAST for tree estimation. XML stands for 'eXtensible Markup Language', which is both human and machine readable, and is similar to HTML.

One piece that must be manually edited in a text editor is the user-specified starting tree, if desired. Why use a starting tree? For large and difficult datasets, one can start in the best area of parameter space, so that the Markov chain isn't wasting time jumping around to sample the presumably 'incorrect' topologies. I'm not sure as to how much a starting tree increases efficiency, since alternate topologies can still be sampled (it's certainly not needed for small, straightforward datasets), but I may update my opinions in the future based on the success of trying to manipulate phylogenomic data.

There are two different XML editing tasks I'll cover.

ONE: writing the code to provide BEAST with a starting tree (topology and branch lengths). This is data-dependent, so you'll first have to estimate a starting tree that agrees with run parameters.

TWO: fixing the tree topology so that all branch lengths are calculated on only the input topology. This is really easy! You just comment out four lines of code so the topology isn't ever updated.

Setting a user-specified starting topology
The default starting tree is a random tree, which is coded in an element (content & attributes surrounded by an opening and closing tag ('init', our tree initializer). The whole element needs to be replaced by a user-specified starting tree in newick format. Take a look at the example document!

Fixing the starting tree topology
There are four lines to comment out: the operators for subtree-slide, wide & narrow exchange, and Wilson-Balding. Removing these four operators prevents the topology from updating, but still allows for estimation of the node ages (i.e., branch lengths will be modified even though the topology will not). In an XML file, comments are surrounded by '  ', which means they will not be processed.

An example XML file is here for download. Open it in a text editor to view the annotations! It was written in BEAST 2.4, and it will open in BEAUti and run with no further modifications.

beast_fix_starting_tree_murray.xml
File Size:	28 kb
File Type:	xml

Download File

There are a couple of nice sites with information on user input starting trees, yet translating that to your own data can still be a bit of a struggle. I hope here I could add a bit of guidance on the issue by providing an annotated XML file to help clarify the changes needed.

There is now an updated and more-detailed tutorial at: http://www.beast2.org/fix-starting-tree/.
Also, thanks to great info from: http://www.northernbotanist.com/?page_id=732.

1 Comment

<<Previous

Grab & Graph GBIF Biodiversity Data Using R

Combining UCEs and transcriptomes

Visualizing data -- matrix completeness

Animate stacks of images in ImageJ

Fixed & Starting Trees in BEAST 2

PhyloBlog

Archives

Categories

Elizabeth A. Murray, PHYLOGENETICS AND EVOLUTION of Hymenoptera

Grab & Graph GBIF Biodiversity Data Using R

Combining UCEs and transcriptomes

Visualizing data -- matrix completeness

Animate stacks of images in ImageJ

Fixed & Starting Trees in BEAST 2

PhyloBlog

Archives

Categories

Elizabeth A. Murray, ​PHYLOGENETICS AND EVOLUTION of Hymenoptera

Elizabeth A. Murray, PHYLOGENETICS AND EVOLUTION of Hymenoptera