library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(duckdb)
#> Loading required package: DBI
library(duckplyr)
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.
#>
#> Attaching package: 'duckplyr'
#> The following object is masked from 'package:duckdb':
#>
#> read_csv_duckdb
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:arrow':
#>
#> duration
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(ncdf4)
library(rerddap)
library(rerddapUtils)The R package
rerddapUtils is a set of four main functions designed to
work extend the rerddap package to provide capabilities
requested by users, and that meet specialized needs which are better
dealt with in a separate package, and also provides two helper functions
that estimate the size of a proposed rerddap::griddap()
request.
The first function griddap_season():
griddap_season <- function(datasetx, ..., fields = 'all', stride = 1, season = NULL, fmt = "nc", url = eurl(), store = disk(), read = TRUE, callopts = list())extracts data only for the period during the year defined by ‘season’.
The second function griddap_split():
griddap_split <- function(datasetx, ..., fields = 'all', stride = 1, request_split = NULL, fmt = "nc", url = eurl(), store = disk(), read = TRUE, callopts = list(), aggregate_file = NULL)splits a very large request into pieces defined by ‘request_split’, and then aggregates the parts either in memory, in a duckdb database, or in a netcdf4 file. This allows for requests larger than 2GB to be made in an appropriate manner.
Note that griddap_season() and
griddap_split() are designed to mostly have the same
interface as rerddao::griddap(), except for one or two
additional function arguments. Also some arguments are interpreted
differently, such as in both functions the argument ‘store = disk()’ is
ignored. In griddap_season() the ‘fmt’ argument is ignored
while ‘season’ defines the days during the year to make the extract.
In griddap_split() the argument ‘fmt’ defines whether
the aggregate download should be stored in memory as a dataframe, in a
duckdb database file, or in a netcdf4 file. For the duckdb and nectdf4
options, ‘aggregate_file’ lets you define where to write the file. If
‘aggregate_file’ is not defined a temporary file is created following
proper R guidelines, and in both cases
the path to the file is returned.
The other two functions are designed to make it easier to work with projected datasets:
latlon_to_xy <- function (dataInfo, latitude, longitude, yName = 'latitude', xName = 'longitude', crs = NULL)which for a given dataset will convert a latitude and longitude
request into the projected coordinates to be used in
rerddao::griddap(), while the fourth function:
xy_to_latlon <- function (resp, yName = 'cols', xName = 'rows', crs = NULL)does the reverse, given an rerddao::griddap() extract
with projected coordinates, the coordinates are translated into latitude
and longitude values.
The two helper functions work with rerddap::griddap()
and griddap_split():
estimate_griddap_size <- function(info, ..., fields = "all",cstride = 1L, spacing = list(), verbose = TRUE)and
estimate_griddap_split_size <- function(size_est, splits, verbose = TRUE)estimate_griddap_size() provides a rough estimate of the
size of a download from an rerddap::griddap() request, and
estimate_griddap_split_size() takes the output from
’estimate_griddap_size() and estimates the size of the
download for each request for a given split.
This vignette assumes familiarity with the various ‘rerddap’ functions and how to use them. More information about the various ‘rerddap’ functions can be found in the documentation for that package.
griddap_season()The function griddap_season() works exactly like the
function rerddap::griddap except that there is one extra
argument, “season”, which restricts the extract to the part of the year
defined by “season”. griddap_season() also ignores several
of the rerddap::griddap options, such as it ignores the
“fmt” option, and only saves to a standard rerddap::griddap
output structure.
The argument “season” is a character string such that for a given
“year”, paste0(year, '-', season) produces a valid ISO
datetime string. To test that “season” is properly formed, use
lubridate::as_datetime():
year <- '2023'
season <- c('02-01', '06-01')
# test that "season" is defined properly
test_time <- paste0(year, '-', season)
lubridate::as_datetime(test_time)
#> [1] "2023-02-01 UTC" "2023-06-01 UTC"The example only includes month and day, but hours etc can be included as long as the final result, when a year is appended, produces a valid datetime.
As an example of using griddap_season() and how the
function compares to using rerddap::griddap(), suppose for
an rerddap::griddap() extract given by:
wind_info <- rerddap::info('erdQMekm14day')
extract <- rerddap::griddap(wind_info,
time = c('2015-01-01','2019-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current'
)it is desired to restrict the extract to March 1 to May 1 inclusive
in each year. Then the following modification of the above using
griddap_season()produces the desired result:
wcnURL <- "https://coastwatch.pfeg.noaa.gov/erddap/"
wind_info <- rerddap::info('erdQMekm14day', url = wcnURL)
season <- c('03-01', '05-01')
season_extract <- griddap_season(wind_info,
time = c('2015-01-01','2019-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
season = season
)To check that the extract has been limited to the given season:
extract_times <- lubridate::as_datetime(season_extract$data$time)
extract_months <- lubridate::month(extract_times)
unique(extract_months)Only results for months 3, 4, and 5 are included, as per the values of “season”. May is included because the definition of “season” is inclusive, hence May 1 will be included.
griddap_split()The function griddap_split() allows for
rerddap::griddap() requests that are larger than the 2GB
limit by splitting the request into parts and then combining the results
either in memory, in a dataframe, in a netcdf4 file, or into a duckdb
database file. If the “in memory” option is chosen, there is no check
that there is adequate memory, so it is possible to crash both R and
your computer. The nectdf4 and duckdb options depend more on having
adequate disk space, not memory. To check if the proposed split is small
enough, see estimate_griddap_split_size() below.
griddap_split() uses the same arguments as
rerddap::griddap() except that:
The argument “split” is a named list that defines the number of splits in each dimension, not the values where to make the splits, and a value must be given for each coordinate in the dataset. For the dataset ‘erdQMekm14day’ above, there are four coordinates, “time”, “altitude”, “latitude” and “longitude”, so to break the request into five “time” parts and two “latitude” and “longitude” parts (if you remember your combinatorics the original request will be broken into 20 separate requests) use:
Note that:
will fail, even though “altitude” is dimension one. A value must be given for all of the coordinates in the dataset.
“fmt” defines how the split extracts are combined to give one result.
If fmt = “memory” then the results are combined into a usual
rerddap::griddap() dataframe in memory, and that structure
is returned by the function. If fmt = “nc” then the extracts are
combined into a netcdf4 file, and a path to the netcdf4 file is returned
by the function. If fmt = “duckdb” then the extracts are combined into a
duckdb database file, and a path to the duckdb file is returned by the
function.
As in the previous example, given a proposed
rerddap::griddap() extract:
wind_info <- rerddap::info('erdQMekm14day')
extract <- rerddap::griddap(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current'
)then the function estimate_griddap_size() provides a
rough estimate of the size of the file to be downloaded, (usually a bit
on the low side, but good enough to see if the download size approaches
2GB):
sz <- estimate_griddap_size(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
)In the example the information is printed out and stored in the variable “sz”, when not set to a variable the information is just printed out. In this case the download size would be about 13.5 MB.
To obtain an estimate of the download size of each file in a proposed split:
and each split will be about 714KB in size.
Given that the requested splits are small enough, to make the request for the split “request_split” defined above and stored in memory:
wcnURL <- "https://coastwatch.pfeg.noaa.gov/erddap/"
wind_info <- rerddap::info('erdQMekm14day', url = wcnURL)
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "memory"
)
str(split_extract)On return, “split_extract” contains the usual
rerddap::griddap() dataframe. Again, there is no check that
there is enough memory to do this operation, so it is possible to crash
R (or your computer) if not used
carefully. It is common for computers these days to have 16GB of RAM or
more, so extracts a good bit larger than 2GB can be stored in
memory.
If instead it is desired to store the result in a netcdf4 file:
wcnURL <- "https://coastwatch.pfeg.noaa.gov/erddap/"
wind_info <- rerddap::info('erdQMekm14day', url = wcnURL)
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "nc"
)The argument “aggregate_file” is not set in the example above, so the netcdf4 file is written to a temporary location and the path to that file returned by the function. The reason for this is it is bad form for an R package to write to a user’s space without permission (in fact it is prohibited by CRAN). To set the location where the netcdf4 file is written, set “aggregate_file” to the name of the file, in which case the file is written in the present working directory, or else set “aggregate_file” as the full path to the file. As an example:
wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "nc",
aggregate_file = 'wind_extract.nc'
)To access the data in the netcdf4 file, any of the various R packages that can read netcdf4 files can be used, but it is recommend to only use packages that can read in subsets of the file, as the data may well be too large for memory. For example, to open the the netcdf4 file created above using ‘ncdf4’ and viewing a summary:
To make the same extract and save to a duckbb database file:
wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "duckdb",
aggregate_file = 'wind_extract.duckdb'
)There are several packages that can read “duckdb” database files, among them the packages ‘dplyr’ and ‘duckplyr’ (note extracts that write to a “duckdb” database file always write to a table named “extract”:
con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
tbl(con_db, "extract") |>
head(5) |>
collect()
dbDisconnect(con_db, shutdown=TRUE)The default “duckdb” format is used because it allows for incremental additions to the dataset on disk, so datasets can be larger than memory. A popular format for storing data is “parquet”, or if it is desired to encode geometry information into the data, then the “geoparquet” format. To copy the duckdb file to parquet:
con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
# Use DuckDB's COPY command to write directly to Parquet file without loading into R
query <- sprintf("COPY extract TO '%s' (FORMAT 'parquet')", "wind_extract.parquet")
DBI::dbExecute(con_db, query)
dbDisconnect(con_db, shutdown=TRUE)The resulting parquet files are half the size of the duckdb file, or smaller. While these examples are for writing the duckdb database to the parquet format, similar steps can be used if the data are in tibbles, but it may require reading the entire dataset into memory.
latlon_to_xy() and
xy_to_latlon()Extracting projected data in ERDDAP™ (that is not on a
latitude-longitude grid) can be difficult because most people do not
think in terms of the projection in defining the region to make an
extract, and similary once an extract has been made, it is often
desirable to know the coordinates in terms of latitude and longitude.
The functions latlon_to_xy() and
xy_to_latlon() are designed to help with this process.
In both cases the functions try to determine the dataset projection
either from the dataset summary given by rerddap::info()
when projecting the values of a latitude-longitude bounding box, or when
an extract has been made, from the metadata in the extract. This is done
by searching for any of the terms:
proj_strings <- c('proj4text', 'projection', 'proj4string', 'grid_mapping_epsg_code',
'WKT', 'proj_crs_code',
'grid_mapping_epsg_code', 'grid_mapping_proj4',
'grid_mapping_proj4_params', 'grid_mapping_proj4text')in the metadata. If this is not found, then the user has to provide the “CRS”, which can be in any of the forms above. For example the Ice dataset ‘nsidcG02202v4sh1day’ on the Coastwatch Central ERDDAP™ is projected. The bounding box of interest is:
then the bounding box in projected coordinates is found from:
myURL <- 'https://coastwatch.noaa.gov/erddap/'
myInfo <- rerddap::info('noaacwVIIRSn20icethickNP06Daily', url = myURL)
coords <- latlon_to_xy(myInfo, longitude, latitude)
coordsSimilary, given an extract on a projected coordinate system, the values on terms of latitude and longitude can be found by:
rows <- c( -889533.8, -469356.9)
cols <- c(622858.3, 270983.4)
myURL <- 'https://coastwatch.noaa.gov/erddap/'
myInfo <- rerddap::info('noaacwVIIRSn20icethickNP06Daily', url = myURL)
proj_extract <- rerddap::griddap(myInfo,
time = c('2023-01-30T00:00:00Z', '2023-01-30T00:00:00Z'),
rows = rows,
cols = cols,
fields = 'IceThickness',
url = myURL
)
test <- xy_to_latlon(proj_extract)
head(test)Note that xy_to_latlon() also works with the output from
the functions rerddapXtracto::rxtracto(),
rerddapXtracto::rxtracto_3D() and
`rerddapXtracto::rxtractogon().