vignettes/articles/retrieve-all-query-results.Rmd
retrieve-all-query-results.Rmd
To reduce the load on the servers, some of the Kew resources limit the number of results returned for a query. This tutorial will demonstrate how to download all the results for a query in a way that (hopefully) shouldn’t upset the servers.
Possibly the simplest option is to just tell the resource that you want more results.
By default, the search functions in kewr set the maximum number of results to 50. You can increase this to whatever you want, to make sure you get all the results you want.
For instance, I know for sure that there are fewer than 2000 accepted species in the genus Myrcia. If I want to get a list of all these species from WCVP, I can, therefore, increase the maximum number of results to 2000.
results <- search_wcvp(query=list(genus="Myrcia"),
filters=c("accepted", "species"),
limit=2000)
results
#> <WCVP search: genus='Myrcia' filters: 'accepted, species'>
#> total results: 767
#> returned results: 767
#> total pages: 1
#> current page: 1
#> List of 1
#> $ :List of 9
#> ..$ id : chr "165525-2"
#> ..$ fqId : chr "urn:lsid:ipni.org:names:165525-2"
#> ..$ url : chr "/taxon/165525-2"
#> ..$ display : chr "<b><i>Myrcia abbotiana</i> (Urb.) Alain</b>"
#> ..$ accepted: logi TRUE
#> ..$ family : chr "Myrtaceae"
#> ..$ name : chr "Myrcia abbotiana"
#> ..$ author : chr "(Urb.) Alain"
#> ..$ rank : chr "Species"
We can see from the results object that we have a single page of results that contains the entries for all 748 accepted species in the genus.
However, this only really works when two things are true:
This strategy worked in this case because I knew there definitely weren’t more than 2000 accepted species, and 2000 is a relatively small number as things go. If I there are more results than I expected, I run the risk of missing some entries. If my expected number of results was too big, say 20,000 or even 200,000, the request might time-out before I get anything back.
The other way to get all of your results is to iterate over all the pages of your request.
Making multiple smaller requests avoids the request hanging because you asked for too much data. However, some resources could have rate-limiting enable, which means they will block you if you make too many requests in a certain time period. Therefore, you need to balance the size of the request with the number that you’re making.
One way to make multiple requests is with a for
loop.
To get started, we’ll make our first request outside of the for loop. This way, we can see how many pages we need to loop over. I’ve chosen a limit of 100 results per page here.
query <- list(genus="Myrcia")
filters <- c("accepted", "species")
r <- search_wcvp(query, filters=filters, limit=100)
r
#> <WCVP search: genus='Myrcia' filters: 'accepted, species'>
#> total results: 767
#> returned results: 100
#> total pages: 8
#> current page: 8
#> List of 1
#> $ :List of 9
#> ..$ id : chr "165525-2"
#> ..$ fqId : chr "urn:lsid:ipni.org:names:165525-2"
#> ..$ url : chr "/taxon/165525-2"
#> ..$ display : chr "<b><i>Myrcia abbotiana</i> (Urb.) Alain</b>"
#> ..$ accepted: logi TRUE
#> ..$ family : chr "Myrtaceae"
#> ..$ name : chr "Myrcia abbotiana"
#> ..$ author : chr "(Urb.) Alain"
#> ..$ rank : chr "Species"
Before we get the rest of the results in a for
loop, it’s worth tidying our first result into a data frame, which we’ll use to add all our subsequent results to.
results <- tidy(r)
Now we can loop through and get the rest of our query results.
IMPORTANT: making too many requests in a short period of time to POWO can cause problems for the server. By default, the request_next
function adds in a little waiting period before making a new request. But you might get back an error if you’re asking for lot’s of things one after the other.
for (i in 2:r$pages) {
r <- request_next(r)
new_results <- tidy(r)
results <- bind_rows(results, new_results)
}
head(results)
#> # A tibble: 6 × 9
#> id fqId url display accepted family name author rank
#> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
#> 1 165525-2 urn:lsi… /taxon… <b><i>Myrci… TRUE Myrta… Myrci… (Urb.) … Spec…
#> 2 77107082-1 urn:lsi… /taxon… <b><i>Myrci… TRUE Myrta… Myrci… (O.Berg… Spec…
#> 3 77191340-1 urn:lsi… /taxon… <b><i>Myrci… TRUE Myrta… Myrci… (Alain)… Spec…
#> 4 165530-2 urn:lsi… /taxon… <b><i>Myrci… TRUE Myrta… Myrci… Borhidi Spec…
#> 5 77199392-1 urn:lsi… /taxon… <b><i>Myrci… TRUE Myrta… Myrci… (Urb.) … Spec…
#> 6 77191341-1 urn:lsi… /taxon… <b><i>Myrci… TRUE Myrta… Myrci… A.R.Lou… Spec…
We can check we have all the results by looking at the length of our results data frame:
nrow(results)
#> [1] 767