--- title: "Input data requirements" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Input data requirements} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(width = 10000) ``` The input data required for `greatR` can be either a data frame, a list of data frames, or a list of vectors, as shown below: ```{r fig-input-diagram, echo=FALSE, fig.align='center', out.width='75%'} knitr::include_graphics("figures/input_diagram.png") ``` ## A single data frame input The single data frame input or the data frame in the list must contain gene expression time-course data with all replicates. The illustrative diagram below shows the required structure of the `input`. ```{r fig-input-table, echo=FALSE, fig.align='center', out.width='75%'} knitr::include_graphics("figures/input_table.png") ``` This data frame must contain **reference** and **query** expression data, and the following five columns: - `gene_id`: locus name or unique ID of each gene. - `accession`: accession or name of the reference and query data. - `timepoint`: time points of the gene expression data. - `expression_value`: desired expression values or measure of the abundance of gene or transcripts. This value can be RPM, RPKM, FPKM, TPM, TMM, or raw read counts. - `replicate`: biological replicate ID for an expression value at a particular time point. The example below shows how the `input` data should look like: ```{r brapa-data-kable, echo=FALSE} # Load a data frame from the sample data b_rapa_data <- system.file("extdata/brapa_arabidopsis_data.csv", package = "greatR") |> data.table::fread() b_rapa_data[, .SD[1:2], by = accession][, .(gene_id, accession, timepoint, expression_value, replicate)] |> knitr::kable() ``` ## A list of data frames as an input If users do not have the data **reference** and **query** joined with the IDs mapped into one single data frame, there is an option of having the input data of a list of data frames. As shown in the illustrative diagram below, the list must contain both **reference** and **query** data frames with the columns as required in the single data frame input (see previous section). ```{r fig-input-list-tables, echo=FALSE, fig.align='center', out.width='85%'} knitr::include_graphics("figures/input_list_tables.png") ``` The example below shows how the `input` list of data frames should look like: ```{r brapa-data-list-df-read, include=FALSE} # Load a data frame from the sample data brapa_ref_data <- system.file("extdata/brapa_SOC1_data.csv", package = "greatR") |> data.table::fread() ara_query_data <- system.file("extdata/arabidopsis_SOC1_data.csv", package = "greatR") |> data.table::fread() list_df <- list( reference = brapa_ref_data, query = ara_query_data ) ``` ```r # Load data frames from the sample data brapa_ref_data <- system.file("extdata/brapa_SOC1_data.csv", package = "greatR") |> data.table::fread() ara_query_data <- system.file("extdata/arabidopsis_SOC1_data.csv", package = "greatR") |> data.table::fread() list_df <- list( reference = brapa_ref_data, query = ara_query_data ) list_df #> $reference #> gene_id accession timepoint expression_value replicate #> #> 1: BRAA03G023790.3C Ro18 11 1.984367 ERR_ro18_rna_seq_v3_R18A1_1 #> 2: BRAA03G023790.3C Ro18 11 1.474974 ERR_ro18_rna_seq_v3_R18A1_2 #> 3: BRAA03G023790.3C Ro18 11 2.194917 ERR_ro18_rna_seq_v3_R18A1_3 #> 4: BRAA03G023790.3C Ro18 29 113.797721 ERR_ro18_rna_seq_v3_R18A10_1 #> 5: BRAA03G023790.3C Ro18 29 94.650207 ERR_ro18_rna_seq_v3_R18A10_2 #> 6: BRAA03G023790.3C Ro18 29 129.176178 ERR_ro18_rna_seq_v3_R18A10_3 #> #> $query #> gene_id accession timepoint expression_value replicate #> #> 1: AT2G45660 Col0 15 76.95936 ERR_ds_klepikova_SRR1688425 #> 2: AT2G45660 Col0 14 81.96151 ERR_ds_klepikova_SRR1688328 #> 3: AT2G45660 Col0 16 59.24077 ERR_ds_klepikova_SRR1688427 #> 4: AT2G45660 Col0 15 68.85581 ERR_ds_klepikova_SRR1688426 #> 5: AT2G45660 Col0 12 64.21780 ERR_ds_klepikova_SRR2106520 #> 6: AT2G45660 Col0 10 72.98476 ERR_ds_klepikova_SRR1661475 ``` Note here that the elements of the list must be named **reference** and **query**, the order of the element will not effect the registration process. ## A list of vectors as an input As an alternative of a list of data frame as input data, as shown above, users can also have a list of numerical vectors as their input. The illustrative diagram below shows how the input should look like: ```{r fig-input-list-vectors, echo=FALSE, fig.align='center', out.width='38%'} knitr::include_graphics("figures/input_list_vectors.png") ``` The example below shows how the `input` list of vectors can be defined as an input: ```{r brapa-data-list-num} # Define expression value vectors ref_expressions <- c(1.9, 3.1, 7.8, 31.6, 33.7, 31.5, 131.4, 107.5, 116.7, 112.5, 109.7, 57.4, 50.9) query_expressions <- c(14, 12.1, 15.9, 47, 30.9, 50.5, 80.1, 67.4, 72.9, 61.7) list_vector <- list( reference = ref_expressions, query = query_expressions ) list_vector ``` Note that since there is no ID specified on the vectors, `greatR` will assign a unique ID for the reference and query pair data (see more in the [register data > Using other inputs ](https://ruthkr.github.io/greatR/articles/register-data.html#registration-results-1) article).