The Tree of Life is a database of specimens sequenced as part of Kew’s efforts to build a comprehensive evolutionary tree of life for flowering plants.
This package accesses data from the Tree of Life Explorer, an output of the Plant and Fungal Trees of Life Project (PAFTOL). The data in the Tree of Life is generated by target sequence capture using the universal Angiosperm353 probe set.
The Tree of Life contains information about specimens that have been sequenced and genes recovered in the process. It lets you download sequence data for the specimens, as well as alignments and trees for the genes.
The Tree of Life Explorer lets users view the tree of life constructed from the current dataset of samples.
You can view it using kewr
by loading it in:
tree <- load_tol()
#> No encoding supplied: defaulting to UTF-8.
tree
#> <ToL tree url: https://treeoflife.kew.org/api/tree>
#> Preview:
#> (((17419_false_Gnetales_Gnetaceae_Gnetum_montanum:0.00000,18215_false_Ephedrales_Ephedraceae_Ephedra
This reads it as a single string, so you need to use other packages to parse it and view it (e.g, ape).
The Tree of Life contains information about the specimens that have been sequenced to construct the tree. The long-term aim is to sample at least on species from every flowering plant genus. This means that, typically, there will be one specimen per species.
You can search this information using the search_tol
function. There is no filtering or keyword-search functionality, so queries are just the name of an order/family/genus/species. For example, to get all specimens for the genus Myrcia:
specimens <- search_tol("Myrcia")
#> No encoding supplied: defaulting to UTF-8.
specimens
#> <ToL search: 'Myrcia'>
#> total results: 17
#> returned results: 17
#> total pages: 1
#> current page: 1
#> List of 1
#> $ :List of 20
#> ..$ age : NULL
#> ..$ collector : chr "Lima, D. F."
#> ..$ collector_no : chr "504"
#> ..$ country : NULL
#> ..$ fasta_file_url : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5034774.Myrcia_albotomentosa.a353.fasta"
#> ..$ gene_stats :List of 2
#> ..$ genus :List of 3
#> ..$ herbcat_url : NULL
#> ..$ id : int 2717
#> ..$ is_suspicious_placement: logi FALSE
#> ..$ material_source :List of 2
#> ..$ museum_barcode : NULL
#> ..$ project :List of 2
#> ..$ raw_reads :List of 1
#> ..$ sequence_id : int 5528
#> ..$ species :List of 2
#> ..$ specimen_reference : chr "Lima, D. F. 504 (K)"
#> ..$ specimen_source : chr "RBGKew DNA Bank"
#> ..$ taxonomy :List of 4
#> ..$ voucher_no : NULL
This searching works by exact matching, and the taxonomy follows WCVP so only accepted names will work. For example, if we mispell Myrcia we get nothing:
search_tol("Mercya")
#> No encoding supplied: defaulting to UTF-8.
#> <ToL search: 'Mercya'>
#> total results: 0
#> returned results: 0
#> total pages: 0
#> current page: 1
#> list()
And if we search for an outdated synonym we get nothing:
search_tol("Gomidesia")
#> No encoding supplied: defaulting to UTF-8.
#> <ToL search: 'Gomidesia'>
#> total results: 0
#> returned results: 0
#> total pages: 0
#> current page: 1
#> list()
But search using higher taxonomy will work:
specimens <- search_tol("Myrtaceae")
#> No encoding supplied: defaulting to UTF-8.
specimens
#> <ToL search: 'Myrtaceae'>
#> total results: 171
#> returned results: 50
#> total pages: 4
#> current page: 1
#> List of 1
#> $ :List of 20
#> ..$ age : int 1996
#> ..$ collector : chr "Chase, M.W."
#> ..$ collector_no : chr "10349"
#> ..$ country : NULL
#> ..$ fasta_file_url : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5033686.Acca_sellowiana.a353.fasta"
#> ..$ gene_stats :List of 2
#> ..$ genus :List of 3
#> ..$ herbcat_url : NULL
#> ..$ id : int 2660
#> ..$ is_suspicious_placement: logi FALSE
#> ..$ material_source :List of 2
#> ..$ museum_barcode : NULL
#> ..$ project :List of 2
#> ..$ raw_reads :List of 1
#> ..$ sequence_id : int 5471
#> ..$ species :List of 2
#> ..$ specimen_reference : chr "Chase, M.W. 10349 (K)"
#> ..$ specimen_source : chr "RBGKew DNA Bank"
#> ..$ taxonomy :List of 4
#> ..$ voucher_no : NULL
To get all these results, we can either increase the limit in the search function:
myrts_all <- search_tol("Myrtaceae", limit=500)
#> No encoding supplied: defaulting to UTF-8.
myrts_all
#> <ToL search: 'Myrtaceae'>
#> total results: 171
#> returned results: 171
#> total pages: 1
#> current page: 1
#> List of 1
#> $ :List of 20
#> ..$ age : int 1996
#> ..$ collector : chr "Chase, M.W."
#> ..$ collector_no : chr "10349"
#> ..$ country : NULL
#> ..$ fasta_file_url : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5033686.Acca_sellowiana.a353.fasta"
#> ..$ gene_stats :List of 2
#> ..$ genus :List of 3
#> ..$ herbcat_url : NULL
#> ..$ id : int 2660
#> ..$ is_suspicious_placement: logi FALSE
#> ..$ material_source :List of 2
#> ..$ museum_barcode : NULL
#> ..$ project :List of 2
#> ..$ raw_reads :List of 1
#> ..$ sequence_id : int 5471
#> ..$ species :List of 2
#> ..$ specimen_reference : chr "Chase, M.W. 10349 (K)"
#> ..$ specimen_source : chr "RBGKew DNA Bank"
#> ..$ taxonomy :List of 4
#> ..$ voucher_no : NULL
Or do paged searching:
myrts1 <- search_tol("Myrtaceae")
#> No encoding supplied: defaulting to UTF-8.
myrts2 <- request_next(myrts1)
#> No encoding supplied: defaulting to UTF-8.
myrts2
#> <ToL search: 'Myrtaceae'>
#> total results: 171
#> returned results: 50
#> total pages: 4
#> current page: 2
#> List of 1
#> $ :List of 20
#> ..$ age : NULL
#> ..$ collector : chr "J.E.Q. Faria"
#> ..$ collector_no : chr "1250"
#> ..$ country : NULL
#> ..$ fasta_file_url : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5034263.Eugenia_bimarginata.a353.fasta"
#> ..$ gene_stats :List of 2
#> ..$ genus :List of 3
#> ..$ herbcat_url : NULL
#> ..$ id : int 8160
#> ..$ is_suspicious_placement: logi FALSE
#> ..$ material_source :List of 2
#> ..$ museum_barcode : NULL
#> ..$ project :List of 2
#> ..$ raw_reads :List of 1
#> ..$ sequence_id : int 7470
#> ..$ species :List of 2
#> ..$ specimen_reference : chr "J.E.Q. Faria 1250 (UB)"
#> ..$ specimen_source : chr "RBGKew DNA Bank"
#> ..$ taxonomy :List of 4
#> ..$ voucher_no : NULL
And we can tidy our results into a dataframe:
tidied <- tidy(myrts_all)
tidied
#> # A tibble: 171 × 20
#> age collector collector_no country fasta_file_url gene_stats genus
#> <int> <chr> <chr> <list> <chr> <list> <lis>
#> 1 1996 Chase, M.W. 10349 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 2 1981 Hensold, N.;… 20845 <tibble… http://sftp.kew.o… <tibble [… <tib…
#> 3 1985 Rodd, A.N.; … 4984 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 4 2017 Maurin, O. 4360 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 5 2008 J.U. 404 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 6 1985 Clark, M. 52 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 7 2011 Smith, R.J.,… 294 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 8 NA Landrum, L.R. 12251 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 9 NA NA NA <NULL> http://sftp.kew.o… <tibble [… <tib…
#> 10 2005 Johnstone, R. 1544 <NULL> http://sftp.kew.o… <tibble [… <tib…
#> # … with 161 more rows, and 13 more variables: herbcat_url <chr>, id <int>,
#> # is_suspicious_placement <lgl>, material_source <list>,
#> # museum_barcode <chr>, project <list>, raw_reads <list>, sequence_id <int>,
#> # species <list>, specimen_reference <chr>, specimen_source <chr>,
#> # taxonomy <list>, voucher_no <chr>
Some information is nested inside the tidied dataframe, but we can get to it by unnesting:
tidied %>%
select(id, raw_reads, taxonomy) %>%
unnest(col=c(taxonomy, raw_reads), names_sep="_")
#> # A tibble: 171 × 10
#> id raw_reads_data_access_id raw_reads_data_… raw_reads_id raw_reads_reads…
#> <int> <chr> <chr> <int> <int>
#> 1 2660 ERX4840033 https://www.ebi… 1102 2218140
#> 2 4278 ERX4840494 https://www.ebi… 1760 1363846
#> 3 4221 ERX4840466 https://www.ebi… 1708 911144
#> 4 1610 ERX4890378 https://www.ebi… 567 3707086
#> 5 2692 ERX4840052 https://www.ebi… 1124 2090550
#> 6 4222 ERX4840467 https://www.ebi… 1709 2914784
#> 7 2707 ERX4841120 https://www.ebi… 1136 1819906
#> 8 4591 ERX4841264 https://www.ebi… 1809 1932262
#> 9 2682 ERX4840048 https://www.ebi… 1119 1047180
#> 10 4224 ERX4840468 https://www.ebi… 1710 1044822
#> # … with 161 more rows, and 5 more variables:
#> # raw_reads_sequence_platform <chr>, taxonomy_family <chr>,
#> # taxonomy_genus <chr>, taxonomy_order <chr>, taxonomy_species <chr>
The Tree of Life also contains information about the genes captured during sequencing. These can be accessed using the search_tol
function:
genes_all <- search_tol(genes=TRUE, limit=500)
#> No encoding supplied: defaulting to UTF-8.
tidy(genes_all)
#> # A tibble: 353 × 16
#> alignment_file_url average_contig_l… average_contig_l… exemplar_access…
#> <chr> <dbl> <dbl> <chr>
#> 1 http://sftp.kew.org/pub… 493. 86.3 Q8GWR1
#> 2 http://sftp.kew.org/pub… 584. 51.7 Q8H1R4
#> 3 http://sftp.kew.org/pub… 453. 55.2 Q8LEF6
#> 4 http://sftp.kew.org/pub… 487. 55.1 Q9FZ49
#> 5 http://sftp.kew.org/pub… 645. 60.8 P04747
#> 6 http://sftp.kew.org/pub… 790. 68.4 Q9ZUC1
#> 7 http://sftp.kew.org/pub… 641. 53.6 Q8VY89
#> 8 http://sftp.kew.org/pub… 969. 65.2 Q9LRZ3
#> 9 http://sftp.kew.org/pub… 344. 67.4 Q9FIG9
#> 10 http://sftp.kew.org/pub… 612. 90.4 F4JUL9
#> # … with 343 more rows, and 12 more variables: exemplar_hyperlink <chr>,
#> # exemplar_name <chr>, exemplar_species <chr>, fasta_file_url <chr>,
#> # genera_count <int>, id <int>, internal_name <chr>, newick_file <chr>,
#> # newick_file_path_name <chr>, sequence_count <int>, species_count <int>,
#> # tree_file_url <chr>
But they cannot currently be queried, so the best bet is just to grab all of them.
Information about a single specimen or gene can be looked up using their ID:
specimen <- lookup_tol("2660")
#> No encoding supplied: defaulting to UTF-8.
specimen
#> <ToL specimen id: 2660>
#> Species: Acca sellowiana
#> Family: Myrtaceae
#> Order: Myrtales
#> Collector: Chase, M.W.
#> Project: PAFTOL
#> No. of reads: 2,218,140
#> Sequencing platform: MiSeq
#> Suspicious placement: FALSE
gene <- lookup_tol("51", type="gene")
#> No encoding supplied: defaulting to UTF-8.
gene
#> <ToL gene id: 51>
#> Exemplar name: AAAS
#> Exemplar source species: Arabidopsis thaliana (Mouse-ear cress)
#> No. species: 2905
#> No. genera: 2294
#> Avg. recovered length: 493.4949
#> Avg. % recovered: 86.3
Records returned by search_tol
and lookup_tol
contain links to data files on an SFTP server. You can load these into R using the load_tol
function. As you saw at the top of this vignette, if you don’t provide any URL to load_tol
, it will load the whole Tree of Life tree file.
To load a sequence file for a particular specimen:
load_tol(specimen$fasta_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL fasta url: http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5033686.Acca_sellowiana.a353.fasta>
#> Preview:
#> >6483 Gene_Name:RPL13 Species:Acca_sellowiana Repository:INSDC Sequence_ID:ERR5033686
#> GGACAGTTAGTTGT
To load a sequence file for a gene:
load_tol(gene$fasta_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL fasta url: http://sftp.kew.org/pub/paftol/current_release/fasta/by_gene/5328.dna.fasta>
#> Preview:
#> >5328 Gene_Name:AAAS Species:Thaumatococcus_daniellii Repository:INSDC Sequence_ID:ERR5034476
#> GAGCGG
Or the alignment file:
load_tol(gene$alignment_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL fasta url: http://sftp.kew.org/pub/paftol/current_release/fasta/alignments/5328.dna.aln.fasta>
#> Preview:
#> >5328 Gene_Name:AAAS Species:Ceratophyllum_demersum Repository:INSDC Sequence_ID:SRR7451107
#> --------
Or the gene tree:
load_tol(gene$tree_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL tree url: http://sftp.kew.org/pub/paftol/current_release/tree/gene/5328.tree>
#> Preview:
#> (Ceratophyllales_Ceratophyllaceae_Ceratophyllum_demersum_SRR7451107:0.3494554101,(((((((((((((((((((
All files are returned as strings, so you will need to parse them to use them downstream.
If you want to download these files directly, you can use the download_tol
function.