Interacting with the BioTIME database

BioTIME consortium

02/04/2018


This tutorial guides users interacting with BioTIME, the largest compilation of biodiversity time series containing raw data on species identities and abundances from ecological assemblages (Dornelas et al). The database can be accessed through ZENODO and the BioTIME website.
We provide code and examples to calculate simple summary statistics describing the data, and to produce descriptive plots and maps.


The BioTIME database

BioTIME is a relational database built using the mySQL language, and contains eleven linked tables as illustrated in Fig. 1. Within each table, columns contain the field names, whilst rows contain the records or data for each column.
Each record has a unique identifier or primary key, and queries written in SQL (Structured Query Language) are used to communicate with the database.


BioTIME is available in two formats:
  1. MySQL database (.sql) file
  2. Comma delimited (.csv) files consisting of:
    • full query by study taken from the raw data table within the database
    • associated metadata for each dataset
    • full list of citations
Below we provide instructions and code for both formats. First, we show how to explore the database using MySQL language, and how to produce several outputs using the .sql file. It is possible to query the databse in order to produce downloadable csv files, containing data from the several tables. However, please note that the query that creates the rawdata csv file can take several hours to run, so we provide the file for download and you can skip to using the csv files.


Code to generate the query used in the second part of this document - for use after connection is made


fullQuery<-dbGetQuery(myConnection, "SELECT allrawdata.STUDY_ID, allrawdata.DAY, allrawdata.MONTH, allrawdata.YEAR, allrawdata.SAMPLE_DESC, allrawdata.PLOT, allrawdata.ID_SPECIES, allrawdata.LATITUDE, 
allrawdata.LONGITUDE, sum(allrawdata.ABUNDANCE), sum(allrawdata.BIOMASS), species.GENUS, species.SPECIES, species.GENUS_SPECIES from allrawdata inner join site on allrawdata.STUDY_ID=site.STUDY_ID inner join 
species on allrawdata.ID_SPECIES=species.ID_SPECIES group by concat(allrawdata.STUDY_ID, allrawdata.DAY, allrawdata.MONTH, allrawdata.YEAR, allrawdata.SAMPLE_DESC, allrawdata.ID_SPECIES, allrawdata.LATITUDE, 
allrawdata.LONGITUDE)")


Figure 1 - BioTIME database structure, showing the linking field between the tables; for a full list of the fields and their description see Supplementary Materials in Dornelas et al.


1.Using the MySQL database

The R package RMySQL allows a connection to any MySQL database using R (Studio). In order to use this, you must first download and extract the .sql file and import it to MySQL.
MySQL is open source, and is readily available for download and installation both as MySQL Server or through the installation of larger open source applications such as XAMPP.


Load the required libraries
library(RMySQL)
library(ggplot2)
library(mapdata)
library(vegan)
library(dplyr)


Create a connection to the database

myConnection <- dbConnect(MySQL(), user="username", host="hostname", dbname="database", password="password")
This section should be run as code once the appropriate names have been added.


Some simple queries to explore the database

### list the tables in the database
dbListTables(myConnection)
##  [1] "abundance"  "allrawdata" "biomass"    "citation1"  "contacts"  
##  [6] "curation"   "datasets"   "methods"    "sample"     "site"      
## [11] "species"
### or
dbGetQuery(myConnection, "show tables")
##    Tables_in_biopart
## 1          abundance
## 2         allrawdata
## 3            biomass
## 4          citation1
## 5           contacts
## 6           curation
## 7           datasets
## 8            methods
## 9             sample
## 10              site
## 11           species
### list the fields with their associated information in a chosen table
dbGetQuery(myConnection, "show columns from contacts")
##         Field         Type Null Key Default          Extra
## 1 ID_CONTACTS      int(11)   NO PRI    <NA> auto_increment
## 2    STUDY_ID      int(11)  YES MUL    <NA>               
## 3   CONTACT_1 varchar(500)  YES        <NA>               
## 4   CONTACT_2 varchar(500)  YES        <NA>               
## 5 CONT_1_MAIL  varchar(60)   NO        <NA>               
## 6 CONT_2_MAIL  varchar(60)  YES        <NA>               
## 7     LICENSE varchar(200)   NO        <NA>               
## 8    WEB_LINK varchar(200)  YES        <NA>               
## 9 DATA_SOURCE varchar(250)   NO        <NA>
### get the list of fields from a chosen table
dbListFields(myConnection, "site")
##  [1] "ID_SITE"        "STUDY_ID"       "REALM"          "CLIMATE"       
##  [5] "GENERAL_TREAT"  "TREATMENT"      "TREAT_COMMENTS" "TREAT_DATE"    
##  [9] "CEN_LATITUDE"   "CEN_LONGITUDE"  "HABITAT"        "PROTECTED_AREA"
## [13] "AREA"           "BIOME_MAP"
### read entire tables - N.B. not recommended unless the number of records is known and is not too large! 
dbReadTable(myConnection, "biomass")
##   ID_BIOMASS     BIOMASS_TYPE
## 1          1               NA
## 2          2             Size
## 3          3           Weight
## 4          9 Relative biomass
## 5         10           Volume
## 6         12            Cover
dbReadTable(myConnection, "abundance")
##   ID_ABUNDANCE   ABUNDANCE_TYPE
## 1            1            Count
## 2            5 Presence/Absence
## 3            6        MeanCount
## 4            7          Density
## 5            8               NA


Build queries and wrap them using the dbGetQuery function


There are two ways to select the number of studies held in BioTIME
#### query the datasets table for the number of entries, as each record corresponds to each individual study
result<-dbGetQuery(myConnection, "SELECT count(*) FROM datasets")
## [1] "There are  361  studies in the database"
#### or interrogate the raw data table for the unique number of Study IDs
result2 <- dbGetQuery(myConnection, "SELECT count(distinct STUDY_ID) FROM allrawdata")
## [1] "There are  361  studies in the database"


The above queries were run using the count command to return a single numerical value, but queries can also be used to return blocks of data for analyses, for instance
dbGetQuery(myConnection, "SELECT distinct ID_SPECIES FROM allrawdata inner join site on allrawdata.STUDY_ID = site.STUDY_ID where site.CLIMATE = 'Polar'")
would return a list of all the species found in Polar regions, instead of the number of species.


Using query results to create maps and plots


#### select all the central latitudes and longitudes for each study, along with the taxonomic classification and total number of records
result<-dbGetQuery(myConnection, "SELECT CENT_LAT, CENT_LONG, TAXA, TOTAL from datasets")


Create the base map
#### draw a basic world map, add "y" or "n" for display of tropics and polar latitudes

drawWorld<-function(lats) {
  world_map<-map_data("world")
  
    g1<-ggplot()+coord_fixed()+xlab("")+ylab("")
    g1<-g1+geom_polygon(data=world_map, aes(x=long, y=lat, group=group), colour="gray60", fill="gray60")
    g1<-g1+theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), 
    panel.background=element_rect(fill="white", colour="white"), axis.line=element_line(colour="white"),
    legend.position="none",axis.ticks=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank())
    
    if(lats=="y") {
        g1<-g1+geom_hline(yintercept=23.5, colour="red")+geom_hline(yintercept =-23.5, colour="red")
        g1<-g1+geom_hline(yintercept=66.5, colour="darkblue")+geom_hline(yintercept =-66.5, colour="darkblue")
    }
    else { return(g1) }
        return(g1)
}


Build a colour palette for the taxa
taxaCol<-c('#ffffff','#ffffbf','#5e4fa2','#f46d43','#3288bd','#abdda4','#a8c614','#d53e4f','#66c2a5','#e6f598','#fee08b','#9e0142','#fdae61')


Create world map with study central locations, with points coloured by taxa and sized by number of records
#### add the point coordinates to the blank map (without polar and tropic latitudinal lines)

points<-drawWorld("n")+geom_point(data=result, aes(x=CENT_LONG, y=CENT_LAT, colour=TAXA, size=TOTAL), alpha=I(0.7))
points<-points+scale_colour_manual(values= taxaCol)+scale_size(range=c(3, 10))
points

Basic theme for ggplots with no grid and no axes titles (useful for barplots)
#### theme no grid
themeNoGrid<-function() {
   theme_bw()+
        theme(axis.text=element_text(size=16,color="black"),
        axis.title=element_text(size=0,face="bold"), 
        legend.position="none",     
        plot.title=element_text(size=18,face="bold", hjust=0.5), 
        plot.background=element_blank(),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.border=element_blank(),
        axis.line=element_line(color="black")
    )
}


Query the database to select the proportion of studies that fall into each taxonomic group
#### get studies by taxa
taxaFile<-dbGetQuery(myConnection, "SELECT distinct TAXA, ((count(TAXA)/361)*100) as valueX from datasets group by TAXA")


Function to create a barplot for the taxa, which takes variables for data, colour and position of y axis
taxaPlot<-function(x, cols, pos) {
    plot<-ggplot(x, aes(factor(TAXA), valueX, fill=TAXA))+
      geom_bar(stat="identity", width=0.93,position=position_dodge(width=1),colour="black")+
      scale_fill_manual(values= cols)+
      scale_y_continuous(position=pos)+themeNoGrid()+
      theme(axis.text.x=element_blank())+
      geom_text(aes(label=TAXA), vjust=-1, hjust=0.7, angle=0.5)
    return(plot)
}
taxaPlot(taxaFile, taxaCol, "left") ##axis is on the left


Create a similar plot using climate information, rather than taxa and omitting the record number variable
#### select studies by climate, use an inner join here to link the datasets and site tables
result<-dbGetQuery(myConnection, "SELECT datasets.CENT_LAT, datasets.CENT_LONG, site.CLIMATE FROM datasets inner join site on site.STUDY_ID = datasets.STUDY_ID")
Colour palette for the climate
###colours for the climate
climCol<-c('#3288bd','#66c2a5','#abdda4','#e6f598','#fdae61','#f46d43')


Create map with study central locations, coloured by climate and with the tropical and polar lines of latitude overlaid
#### add the point co-ordinates to the blank map (including polar and tropic latitudinal lines)

points<-drawWorld("y")+geom_point(data=result, aes(x=result$CENT_LONG, y=result$CENT_LAT, colour=result$CLIMATE), size=5, alpha=I(0.7))
points<-points+scale_colour_manual(values= climCol)
points