ToL • kewr

library(kewr)
library(dplyr)
library(tidyr)

The Tree of Life is a database of specimens sequenced as part of Kew’s efforts to build a comprehensive evolutionary tree of life for flowering plants.

This package accesses data from the Tree of Life Explorer, an output of the Plant and Fungal Trees of Life Project (PAFTOL). The data in the Tree of Life is generated by target sequence capture using the universal Angiosperm353 probe set.

The Tree of Life contains information about specimens that have been sequenced and genes recovered in the process. It lets you download sequence data for the specimens, as well as alignments and trees for the genes.

Viewing the Tree of Life

The Tree of Life Explorer lets users view the tree of life constructed from the current dataset of samples.

You can view it using kewr by loading it in:

tree <- load_tol()
#> No encoding supplied: defaulting to UTF-8.
tree
#> <ToL tree url: https://treeoflife.kew.org/api/tree>
#> Preview:
#> (((17419_false_Gnetales_Gnetaceae_Gnetum_montanum:0.00000,18215_false_Ephedrales_Ephedraceae_Ephedra

This reads it as a single string, so you need to use other packages to parse it and view it (e.g, ape).

Searching ToL for specimens

The Tree of Life contains information about the specimens that have been sequenced to construct the tree. The long-term aim is to sample at least on species from every flowering plant genus. This means that, typically, there will be one specimen per species.

You can search this information using the search_tol function. There is no filtering or keyword-search functionality, so queries are just the name of an order/family/genus/species. For example, to get all specimens for the genus Myrcia:

specimens <- search_tol("Myrcia")
#> No encoding supplied: defaulting to UTF-8.
specimens
#> <ToL search: 'Myrcia'>
#> total results: 17
#> returned results: 17
#> total pages: 1
#> current page: 1
#> List of 1
#>  $ :List of 20
#>   ..$ age                    : NULL
#>   ..$ collector              : chr "Lima, D. F."
#>   ..$ collector_no           : chr "504"
#>   ..$ country                : NULL
#>   ..$ fasta_file_url         : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5034774.Myrcia_albotomentosa.a353.fasta"
#>   ..$ gene_stats             :List of 2
#>   ..$ genus                  :List of 3
#>   ..$ herbcat_url            : NULL
#>   ..$ id                     : int 2717
#>   ..$ is_suspicious_placement: logi FALSE
#>   ..$ material_source        :List of 2
#>   ..$ museum_barcode         : NULL
#>   ..$ project                :List of 2
#>   ..$ raw_reads              :List of 1
#>   ..$ sequence_id            : int 5528
#>   ..$ species                :List of 2
#>   ..$ specimen_reference     : chr "Lima, D. F. 504 (K)"
#>   ..$ specimen_source        : chr "RBGKew DNA Bank"
#>   ..$ taxonomy               :List of 4
#>   ..$ voucher_no             : NULL

This searching works by exact matching, and the taxonomy follows WCVP so only accepted names will work. For example, if we mispell Myrcia we get nothing:

search_tol("Mercya")
#> No encoding supplied: defaulting to UTF-8.
#> <ToL search: 'Mercya'>
#> total results: 0
#> returned results: 0
#> total pages: 0
#> current page: 1
#>  list()

And if we search for an outdated synonym we get nothing:

search_tol("Gomidesia")
#> No encoding supplied: defaulting to UTF-8.
#> <ToL search: 'Gomidesia'>
#> total results: 0
#> returned results: 0
#> total pages: 0
#> current page: 1
#>  list()

But search using higher taxonomy will work:

specimens <- search_tol("Myrtaceae")
#> No encoding supplied: defaulting to UTF-8.
specimens
#> <ToL search: 'Myrtaceae'>
#> total results: 171
#> returned results: 50
#> total pages: 4
#> current page: 1
#> List of 1
#>  $ :List of 20
#>   ..$ age                    : int 1996
#>   ..$ collector              : chr "Chase, M.W."
#>   ..$ collector_no           : chr "10349"
#>   ..$ country                : NULL
#>   ..$ fasta_file_url         : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5033686.Acca_sellowiana.a353.fasta"
#>   ..$ gene_stats             :List of 2
#>   ..$ genus                  :List of 3
#>   ..$ herbcat_url            : NULL
#>   ..$ id                     : int 2660
#>   ..$ is_suspicious_placement: logi FALSE
#>   ..$ material_source        :List of 2
#>   ..$ museum_barcode         : NULL
#>   ..$ project                :List of 2
#>   ..$ raw_reads              :List of 1
#>   ..$ sequence_id            : int 5471
#>   ..$ species                :List of 2
#>   ..$ specimen_reference     : chr "Chase, M.W. 10349 (K)"
#>   ..$ specimen_source        : chr "RBGKew DNA Bank"
#>   ..$ taxonomy               :List of 4
#>   ..$ voucher_no             : NULL

To get all these results, we can either increase the limit in the search function:

myrts_all <- search_tol("Myrtaceae", limit=500)
#> No encoding supplied: defaulting to UTF-8.
myrts_all
#> <ToL search: 'Myrtaceae'>
#> total results: 171
#> returned results: 171
#> total pages: 1
#> current page: 1
#> List of 1
#>  $ :List of 20
#>   ..$ age                    : int 1996
#>   ..$ collector              : chr "Chase, M.W."
#>   ..$ collector_no           : chr "10349"
#>   ..$ country                : NULL
#>   ..$ fasta_file_url         : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5033686.Acca_sellowiana.a353.fasta"
#>   ..$ gene_stats             :List of 2
#>   ..$ genus                  :List of 3
#>   ..$ herbcat_url            : NULL
#>   ..$ id                     : int 2660
#>   ..$ is_suspicious_placement: logi FALSE
#>   ..$ material_source        :List of 2
#>   ..$ museum_barcode         : NULL
#>   ..$ project                :List of 2
#>   ..$ raw_reads              :List of 1
#>   ..$ sequence_id            : int 5471
#>   ..$ species                :List of 2
#>   ..$ specimen_reference     : chr "Chase, M.W. 10349 (K)"
#>   ..$ specimen_source        : chr "RBGKew DNA Bank"
#>   ..$ taxonomy               :List of 4
#>   ..$ voucher_no             : NULL

Or do paged searching:

myrts1 <- search_tol("Myrtaceae")
#> No encoding supplied: defaulting to UTF-8.
myrts2 <- request_next(myrts1)
#> No encoding supplied: defaulting to UTF-8.
myrts2
#> <ToL search: 'Myrtaceae'>
#> total results: 171
#> returned results: 50
#> total pages: 4
#> current page: 2
#> List of 1
#>  $ :List of 20
#>   ..$ age                    : NULL
#>   ..$ collector              : chr "J.E.Q. Faria"
#>   ..$ collector_no           : chr "1250"
#>   ..$ country                : NULL
#>   ..$ fasta_file_url         : chr "http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5034263.Eugenia_bimarginata.a353.fasta"
#>   ..$ gene_stats             :List of 2
#>   ..$ genus                  :List of 3
#>   ..$ herbcat_url            : NULL
#>   ..$ id                     : int 8160
#>   ..$ is_suspicious_placement: logi FALSE
#>   ..$ material_source        :List of 2
#>   ..$ museum_barcode         : NULL
#>   ..$ project                :List of 2
#>   ..$ raw_reads              :List of 1
#>   ..$ sequence_id            : int 7470
#>   ..$ species                :List of 2
#>   ..$ specimen_reference     : chr "J.E.Q. Faria 1250 (UB)"
#>   ..$ specimen_source        : chr "RBGKew DNA Bank"
#>   ..$ taxonomy               :List of 4
#>   ..$ voucher_no             : NULL

And we can tidy our results into a dataframe:

tidied <- tidy(myrts_all)
tidied
#> # A tibble: 171 × 20
#>      age collector     collector_no country  fasta_file_url     gene_stats genus
#>    <int> <chr>         <chr>        <list>   <chr>              <list>     <lis>
#>  1  1996 Chase, M.W.   10349        <NULL>   http://sftp.kew.o… <tibble [… <tib…
#>  2  1981 Hensold, N.;… 20845        <tibble… http://sftp.kew.o… <tibble [… <tib…
#>  3  1985 Rodd, A.N.; … 4984         <NULL>   http://sftp.kew.o… <tibble [… <tib…
#>  4  2017 Maurin, O.    4360         <NULL>   http://sftp.kew.o… <tibble [… <tib…
#>  5  2008 J.U.          404          <NULL>   http://sftp.kew.o… <tibble [… <tib…
#>  6  1985 Clark, M.     52           <NULL>   http://sftp.kew.o… <tibble [… <tib…
#>  7  2011 Smith, R.J.,… 294          <NULL>   http://sftp.kew.o… <tibble [… <tib…
#>  8    NA Landrum, L.R. 12251        <NULL>   http://sftp.kew.o… <tibble [… <tib…
#>  9    NA NA            NA           <NULL>   http://sftp.kew.o… <tibble [… <tib…
#> 10  2005 Johnstone, R. 1544         <NULL>   http://sftp.kew.o… <tibble [… <tib…
#> # … with 161 more rows, and 13 more variables: herbcat_url <chr>, id <int>,
#> #   is_suspicious_placement <lgl>, material_source <list>,
#> #   museum_barcode <chr>, project <list>, raw_reads <list>, sequence_id <int>,
#> #   species <list>, specimen_reference <chr>, specimen_source <chr>,
#> #   taxonomy <list>, voucher_no <chr>

Some information is nested inside the tidied dataframe, but we can get to it by unnesting:

tidied %>%
  select(id, raw_reads, taxonomy) %>%
  unnest(col=c(taxonomy, raw_reads), names_sep="_")
#> # A tibble: 171 × 10
#>       id raw_reads_data_access_id raw_reads_data_… raw_reads_id raw_reads_reads…
#>    <int> <chr>                    <chr>                   <int>            <int>
#>  1  2660 ERX4840033               https://www.ebi…         1102          2218140
#>  2  4278 ERX4840494               https://www.ebi…         1760          1363846
#>  3  4221 ERX4840466               https://www.ebi…         1708           911144
#>  4  1610 ERX4890378               https://www.ebi…          567          3707086
#>  5  2692 ERX4840052               https://www.ebi…         1124          2090550
#>  6  4222 ERX4840467               https://www.ebi…         1709          2914784
#>  7  2707 ERX4841120               https://www.ebi…         1136          1819906
#>  8  4591 ERX4841264               https://www.ebi…         1809          1932262
#>  9  2682 ERX4840048               https://www.ebi…         1119          1047180
#> 10  4224 ERX4840468               https://www.ebi…         1710          1044822
#> # … with 161 more rows, and 5 more variables:
#> #   raw_reads_sequence_platform <chr>, taxonomy_family <chr>,
#> #   taxonomy_genus <chr>, taxonomy_order <chr>, taxonomy_species <chr>

Getting gene information

The Tree of Life also contains information about the genes captured during sequencing. These can be accessed using the search_tol function:

genes_all <- search_tol(genes=TRUE, limit=500)
#> No encoding supplied: defaulting to UTF-8.
tidy(genes_all)
#> # A tibble: 353 × 16
#>    alignment_file_url       average_contig_l… average_contig_l… exemplar_access…
#>    <chr>                                <dbl>             <dbl> <chr>           
#>  1 http://sftp.kew.org/pub…              493.              86.3 Q8GWR1          
#>  2 http://sftp.kew.org/pub…              584.              51.7 Q8H1R4          
#>  3 http://sftp.kew.org/pub…              453.              55.2 Q8LEF6          
#>  4 http://sftp.kew.org/pub…              487.              55.1 Q9FZ49          
#>  5 http://sftp.kew.org/pub…              645.              60.8 P04747          
#>  6 http://sftp.kew.org/pub…              790.              68.4 Q9ZUC1          
#>  7 http://sftp.kew.org/pub…              641.              53.6 Q8VY89          
#>  8 http://sftp.kew.org/pub…              969.              65.2 Q9LRZ3          
#>  9 http://sftp.kew.org/pub…              344.              67.4 Q9FIG9          
#> 10 http://sftp.kew.org/pub…              612.              90.4 F4JUL9          
#> # … with 343 more rows, and 12 more variables: exemplar_hyperlink <chr>,
#> #   exemplar_name <chr>, exemplar_species <chr>, fasta_file_url <chr>,
#> #   genera_count <int>, id <int>, internal_name <chr>, newick_file <chr>,
#> #   newick_file_path_name <chr>, sequence_count <int>, species_count <int>,
#> #   tree_file_url <chr>

But they cannot currently be queried, so the best bet is just to grab all of them.

Looking up a record

Information about a single specimen or gene can be looked up using their ID:

specimen <- lookup_tol("2660")
#> No encoding supplied: defaulting to UTF-8.
specimen
#> <ToL specimen id: 2660>
#> Species: Acca sellowiana
#> Family: Myrtaceae
#> Order: Myrtales
#> Collector: Chase, M.W.
#> Project: PAFTOL
#> No. of reads: 2,218,140
#> Sequencing platform: MiSeq
#> Suspicious placement: FALSE

gene <- lookup_tol("51", type="gene")
#> No encoding supplied: defaulting to UTF-8.
gene
#> <ToL gene id: 51>
#> Exemplar name: AAAS
#> Exemplar source species: Arabidopsis thaliana (Mouse-ear cress)
#> No. species: 2905
#> No. genera: 2294
#> Avg. recovered length: 493.4949
#> Avg. % recovered: 86.3

Loading data

Records returned by search_tol and lookup_tol contain links to data files on an SFTP server. You can load these into R using the load_tol function. As you saw at the top of this vignette, if you don’t provide any URL to load_tol, it will load the whole Tree of Life tree file.

To load a sequence file for a particular specimen:

load_tol(specimen$fasta_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL fasta url: http://sftp.kew.org/pub/paftol/current_release/fasta/by_recovery/INSDC.ERR5033686.Acca_sellowiana.a353.fasta>
#> Preview:
#> >6483 Gene_Name:RPL13 Species:Acca_sellowiana Repository:INSDC Sequence_ID:ERR5033686
#> GGACAGTTAGTTGT

To load a sequence file for a gene:

load_tol(gene$fasta_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL fasta url: http://sftp.kew.org/pub/paftol/current_release/fasta/by_gene/5328.dna.fasta>
#> Preview:
#> >5328 Gene_Name:AAAS Species:Thaumatococcus_daniellii Repository:INSDC Sequence_ID:ERR5034476
#> GAGCGG

Or the alignment file:

load_tol(gene$alignment_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL fasta url: http://sftp.kew.org/pub/paftol/current_release/fasta/alignments/5328.dna.aln.fasta>
#> Preview:
#> >5328 Gene_Name:AAAS Species:Ceratophyllum_demersum Repository:INSDC Sequence_ID:SRR7451107
#> --------

Or the gene tree:

load_tol(gene$tree_file_url)
#> No encoding supplied: defaulting to UTF-8.
#> <ToL tree url: http://sftp.kew.org/pub/paftol/current_release/tree/gene/5328.tree>
#> Preview:
#> (Ceratophyllales_Ceratophyllaceae_Ceratophyllum_demersum_SRR7451107:0.3494554101,(((((((((((((((((((

All files are returned as strings, so you will need to parse them to use them downstream.

If you want to download these files directly, you can use the download_tol function.