#' Process PULSE data from a single experiment  (`STEPS 1-6`)
#'
#' @description
#' **ALL STEPS EXECUTED SEQUENTIALLY**
#'
#' * `step 1` -- [pulse_read()]
#' * `step 2` -- [pulse_split()]
#' * `step 3` -- [pulse_optimize()]
#' * `step 4` -- [pulse_heart()]
#' * `step 5` -- [pulse_doublecheck()]
#' * `step 6` -- [pulse_choose_keep()]
#'
#' * `extra step` -- [pulse_normalize()]
#' * `extra step` -- [pulse_summarise()]
#'
#' * `visualization` -- [pulse_plot()] and [pulse_plot_raw()]
#'
#' This is a wrapper function that provides a shortcut to running all 6 steps of the PULSE multi-channel data processing pipeline in sequence, namely `pulse_read()` >> `pulse_split()` >> `pulse_optimize()` >> `pulse_heart()` >> `pulse_doublecheck()` >> `pulse_choose_keep()`.
#'
#' Please note that the `heartbeatr` package is designed specifically for PULSE systems commercialized by the non-profit co-op ElectricBlue (https://electricblue.eu/pulse) and is likely to fail if data from any other system is used as input without matching file formatting.
#'
#' `PULSE()` takes a vector of `paths` to PULSE csv files produced by a PULSE system during **a single experiment** (either multi-channel or one-channel, but never both at the same time) and automatically computes the heartbeat frequencies in all target channels across use-defined time windows. The entire workflow may take less than 5 minutes to run on a small dataset (a few hours of data) if `params` are chosen with speed in mind and the code is run on a modern machine. Conversely, large datasets (spanning several days) may take hours or even days to run. In extreme situations, datasets may be too large for the machine to handle (due to memory limitations), and it may be better to process batches at a time (check [PULSE_by_chunks()] and consider implementing a parallel computing strategy).
#'
#' @inheritParams pulse_read
#' @inheritParams pulse_split
#' @inheritParams pulse_optimize
#' @inheritParams pulse_heart
#' @inheritParams pulse_doublecheck
#' @inheritParams pulse_choose_keep
#' @param doublecheck logical, defaults to `TRUE`; should [pulse_doublecheck()] be used? (it is rare, but there are instances when it should be disabled).
#' @param discard_channels character vectors, containing the names of channels to be discarded from the analysis. `discard_channels` is forced to lowercase, but other than that, the **exact** names must be provided. Discarding unused channels can greatly speed the workflow!
#' @param raw_v_smoothed logical, defaults to `TRUE`; indicates whether or not to also compute heart rates before applying smoothing; this will increase the quality of the output but also double the processing time.
#' @param keep_raw_data logical, defaults to `TRUE`; If set to `FALSE`, `$data` is set to `FALSE` (i.e., raw data is discarded), dramatically reducing the amount of disk space required to store the final output (usually, by two orders of magnitude). HOWEVER, note that it won't be possible to use `pulse_plot_raw()` anymore!
#' @param process_large logical, defaults to `FALSE`; If set to `FALSE`, if the dataset used as input is large (i.e., combined file size greater than 20 MB, which is equivalent to three files each with a full hour of PULSE data), `PULSE` will not process the data and instead suggest the use of [PULSE_by_chunks()], which is designed to handle large datasets; If set to `TRUE`, `PULSE` will proceed with the attempt to process the dataset, but the system's memory may become overloaded and R may never finish the job.
#' @param max_dataset_size numeric, defaults to `21`. Corresponds to the maximum combined size (in Mb) that the dataset contained by the files in `paths` can be when `process_large` is set to `FALSE`. If that is the case, data processing will be aborted with a message explaining the remedies possible. This is a fail-safe to prevent `PULSE` from being asked to process a dataset that is larger than the user's machine can handle, a situation that typically leads to a stall (R doesn't fail, it just keeps trying without any progress being made). A conservative value of `21` will allow only a little more than 3 hours-worth of data to be processed (a PULSE csv file with 1 hour of data typically takes up to 7 Mb). If the machine has a large amount of RAM available, a higher value can be used. Alternatively, consider using the function [PULSE_by_chunks()] instead.
#'
#' @section One experiment:
#' The `heartbeatr` workflow must be applied to a single experiment each time. By *experiment* we mean a collection of PULSE data where all the relevant parameters are invariant, including (but not limited):
#' * the version of the firmware installed in the PULSE device (multi-channel or one-channel)
#' * the names of all channels (including unused channels)
#' * the frequency at which data was captured
#'
#' Note also that even if two PULSE systems have been used in the same *scientific experiment*, data from each device must be processed independently, and only merged at the end. There's no drawback in doing so, it just is important to understand that that's how data must be processed by the [`heartbeatr-package`].
#'
#' @section Normalizing and summarising data:
#' Both [pulse_normalize()] and [pulse_summarise()] aren't included in [PULSE()] because they aren't essential for the PULSE data processing pipeline and the choosing of values for their parameters require an initial look at the data. However, it is very often crucial to normalize the heart rate estimates produced so that comparisons across individuals can more reliably be made, and it also often important to manage the amount of data points produced before running statistical analyses on the data to avoid oversampling, meaning that users should consider running the output from [PULSE()] though both these functions before considering the data as fully processed and ready for subsequent analysis. Check both functions for additional details on their role on the entire processing pipeline (`?pulse_normalize` and `?pulse_summarise`).
#'
#' @section Additional details:
#' Check the help files of the underlying functions to obtain additional details about each of the steps implemented under `PULSE()`, namely:
#' * [pulse_read()] describes constraints to the type of files that can be read with the [`heartbeatr-package`] and explains how time zones are handled.
#' * [pulse_split()] provides important advice on how to set `window_width_secs` and `window_shift_secs`, what to expect when lower/higher values are used, and explains how easily to run the [`heartbeatr-package`] with parallel computing.
#' * [pulse_optimize()] explains in detail how the optimization process (interpolation + smoothing) behaves and how it impacts the performance of the analysis.
#' * [pulse_heart()] outlines the algorithm used to identify peaks in the heart beat wave data and some of its limitations.
#' * [pulse_doublecheck()] explains the method used to detect situations when the algorithm's processing resulted in an heart beat frequency double the real value.
#' * [pulse_choose_keep()] selects the best estimates when `raw_v_smoothed = TRUE` and classifies data points as `keep` or `reject`.
#'
#' @section Also check:
#' * [pulse_normalize()] for important info about individual variations on baseline heart rate.
#' * [pulse_summarise()] for important info about oversampling and strategies to handle that.
#' * [PULSE_by_chunks()] for processing large datasets.
#'
#' @section BPM:
#' To convert to Beats Per Minute (bpm), simply multiply `hz` and `ci` by 60.
#'
#' @return
#' A tibble with nrows = (number of channels) * (number of windows in `pulse_data_split`) and 13 columns:
#' * `i`, the order of each time window
#' * `smoothed`, logical flagging smoothed data
#' * `id`, PULSE channel IDs
#' * `time`, time at the center of each time window
#' * `data`, a list of tibbles with raw PULSE data for each combination of channel and window, with columns `time`, `val` and `peak` (`TRUE` in rows corresponding to wave peaks)
#' * `hz`, heartbeat rate estimate (in Hz)
#' * `n`, number of wave peaks identified
#' * `sd`, standard deviation of the intervals between wave peaks
#' * `ci`, confidence interval (hz ± ci)
#' * `keep`, logical indicating whether data points meet N and SD criteria
#' * `d_r`, ratio of consecutive asymmetric peaks
#' * `d_f`, logical flagging data points where heart beat frequency is likely double the real value
#'
#' @export
#'
#' @seealso
#'  * [approx()] is used by [pulse_interpolate()] for the linear interpolation of PULSE data
#'  * [ksmooth()] is used by [pulse_smooth()] for the kernel smoothing of PULSE data
#'  * [pulse_read()], [pulse_split()], [pulse_optimize()], [pulse_heart()], [pulse_doublecheck()] and [pulse_choose_keep()] are the functions used in the complete `heartbeatr` processing workflow
#'  * [pulse_normalize()] and [pulse_summarise()] are important post-processing functions
#'  * [pulse_plot()] and [pulse_plot_raw()] can be used to inspect the processed data
#'
#' @examples
#' ## Begin prepare data ----
#' paths <- pulse_example()
#' chn <- paste0("c", formatC(1:10, width = 2, flag = "0"))
#' ## End prepare data ----
#'
#' # Execute the entire PULSE data processing pipeline with only one call
#' PULSE(
#'	paths,
#'  discard_channels = chn[-8],
#'  raw_v_smoothed   = FALSE,
#'  show_progress    = FALSE
#'  )
#'
#' # Equivalent to...
#' x <- pulse_read(paths)
#' multi <- x$multi
#' x$data <- x$data[,c("time", "c08")]
#' x <- pulse_split(x)
#' x <- pulse_optimize(x, raw_v_smoothed = FALSE, multi = multi)
#' x <- pulse_heart(x)
#' x <- pulse_doublecheck(x)
#' x <- pulse_choose_keep(x)
#' x
#'
PULSE <- function(
		paths,
		window_width_secs = 30, window_shift_secs = 60,
		min_data_points = 0.8, interpolation_freq = 40, bandwidth = 0.2,
		doublecheck = TRUE, lim_n = 3, lim_sd = 0.75,
		raw_v_smoothed = TRUE, correct = TRUE, discard_channels = NULL, keep_raw_data = TRUE,
		subset = 0, subset_seed = NULL, subset_reindex = FALSE,
		process_large = FALSE, show_progress = TRUE, max_dataset_size = 20)
{
	if (length(paths) == 1) if (file.info(paths)$isdir) paths <- dir(paths, full.names = TRUE) %>% tolower() %>% stringr::str_subset(".csv$")

	dataset_size_MB <- ceiling(sum(file.size(paths) / 1024 / 1024))
	if (dataset_size_MB > max_dataset_size & !process_large) cli::cli_abort(c(
		"i" = "datasets larger than 21 MB may cause R to stall on most machines",
		"x" = cli::col_red("the dataset provided has a combined size of {dataset_size_MB} MB"),
		"i" = "set 'process_large = TRUE' to proceed with 'PULSE()' anyway",
		"v" = cli::col_green("alternatively, consider using the function 'PULSE_by_chunks()' instead [PREFERRED]")
	))

	prog_format <- "{cli::pb_bar} {cli::pb_percent} [{cli::pb_elapsed}] - {cli::pb_status}"
	t0 <- Sys.time()
	if (show_progress) cli::cli_progress_step("loading data for analysis")

	## CHECKS INITIATED ## ------------------- ##

	# pulse_read
	checks <- pulse_read_checks(paths)
	if (!checks$ok) {
		stop(checks$msg)
	}

	# pulse_split
	stopifnot(is.numeric(window_width_secs))
	stopifnot(length(window_width_secs) == 1)
	stopifnot(is.numeric(window_shift_secs))
	stopifnot(length(window_shift_secs) == 1)
	stopifnot(is.numeric(min_data_points))
	stopifnot(dplyr::between(min_data_points, 0, 1))

	# pulse_optimize
	stopifnot(is.numeric(interpolation_freq))
	stopifnot(length(interpolation_freq) == 1)
	if (!(interpolation_freq == 0 | interpolation_freq >= 40)) cli::cli_abort("interpolation_freq must be zero or a value >= 40")
	stopifnot(is.numeric(bandwidth))
	stopifnot(length(bandwidth) == 1)

	# pulse
	if (!is.null(discard_channels)) stopifnot(is.character(discard_channels))

	## CHECKS COMPLETED ## ------------------- ##

	# read data
	pulse_data <- pulse_read(
		paths,
		msg = FALSE
	)

	# discard unused/unwanted channels
	if (pulse_data$multi) {
		if (!is.null(discard_channels)) {
			discard_channels <- stringr::str_to_lower(discard_channels)
			not_match <- discard_channels[!(discard_channels %in% colnames(pulse_data$data))]
			if (length(not_match)) cli::cli_abort(stringr::str_c("\n  --> [x] all elements of 'discard_channels' must be exact matches to a channel ID\n  --> [i] offending elements: ", stringr::str_c(not_match, collapse = ", ")))

			dups <- discard_channels[duplicated(discard_channels)]
			if (length(dups)) cli::cli_warn(stringr::str_c("  --> [x] all elements of 'discard_channels' should be unique channel IDs\n  --> [i] duplicated elements: ", stringr::str_c(dups, collapse = ", "), "\n  --> [i] work not interrupted, but consider revising 'discard_channels'"))

			pulse_data$data <- dplyr::select(pulse_data$data, -dplyr::any_of(discard_channels))
		}

		# split data
		pulse_data_split <- pulse_split(
			pulse_data,
			window_width_secs = window_width_secs,
			window_shift_secs = window_shift_secs,
			min_data_points   = min_data_points,
			subset         = subset,
			subset_seed    = subset_seed,
			subset_reindex = subset_reindex,
			msg = FALSE
		)
	} else {
		pulse_data_split <- tibble::tibble(
			smoothed = FALSE,
			data     = pulse_data$data$data
		) %>%
			tibble::rowid_to_column("i")
	}


	# optimize
	pulse_data_optimized <- pulse_optimize(
		pulse_data_split,
		interpolation_freq = interpolation_freq,
		bandwidth          = bandwidth,
		raw_v_smoothed     = raw_v_smoothed,
		multi              = pulse_data$multi
	)

	# heart rate
	if (show_progress) cli::cli_progress_step("computing heart rates")
	heart_rates <- pulse_heart(
		pulse_data_optimized,
		msg = FALSE,
	  show_progress = show_progress
	)

	# correct
	if (show_progress) cli::cli_progress_step("finalizing")
	if (doublecheck) {
		heart_rates <- pulse_doublecheck(
			heart_rates,
			correct = correct
		)
	}

	# filter
	heart_rates <- pulse_choose_keep(
		heart_rates,
		lim_n   = lim_n,
		lim_sd  = lim_sd
	)

	# bind
	if (!pulse_data$multi) {
		heart_rates <- dplyr::bind_cols(
			dplyr::select(pulse_data$data, -data, -time),
			dplyr::select(heart_rates, -id)
		) %>%
			dplyr::relocate(i)
	}

	# return
	if (!keep_raw_data) heart_rates$data <- FALSE
	if (show_progress) cli::cli_progress_done()
	t1 <- Sys.time() - t0
	units(t1) <- "mins"
	t1 <- as.numeric(t1) %>% round(2)
	if (show_progress) cli::cli_alert("completed: {cli::col_red(format(Sys.time(), '%Y-%m-%d %H:%M:%S'))}")
	if (show_progress) cli::cli_alert("[elapsed: {cli::col_red(t1)} mins]")

	heart_rates
}

#' Process PULSE data file by file  (`STEPS 1-6`)
#'
#' @description
#' This function runs `PULSE()` file by file, instead of attempting to read all files at once. This is required when datasets are too large (more than 20-30 files), as otherwise the system may become stuck due to the amount of data that needs to be kept in the memory. Because the results of processing data for each hourly file in the dataset are saved to a `job_folder`, `PULSE_by_chunks()` has the added benefit of allowing the entire job to be stopped and resumed, facilitating the advance in the processing even if a crash occurs.
#'
#' @inheritParams PULSE
#' @param folder the path to a folder where several PULSE files are stored
#' @param allow_dir_create logical, defaults to `FALSE`. Only when set to `TRUE` does `PULSE_by_chunks()` actually do anything. This is to force the user to accept that a job_folder will be created inside of the `folder` supplied - without this folder `PULSE_by_chunks()` cannot operate. It is STRONGLY advised to maintain a copy of the dataset being processed to avoid any inadvertent data loss. By setting `allow_dir_create` to `TRUE`the user is taking responsibility for the management of their files.
#' @param chunks numeric, defaults to `2`. Corresponds to the number of files processed at once during each `for` cycle; higher numbers result in a quicker and more efficient operation, but shouldn't be set too high, as otherwise the system may become overwhelmed once more (which is what `PULSE_by_chunks()` is designed to avoid).
#' @param bind_data logical, defaults to `TRUE`. If set to `TRUE`, after processing all chunks, `PULSE_by_chunks()` will try to read all files in the job_folder and return a single unified tibble with all data. Please be aware that there's a possibility that if the dataset is very large, the machine may become overwhelmed and crash due to lack of memory (still, all files stored in the job_folder will remain intact, and code may be written to analyze data also in chunks). If set to `FALSE`, `PULSE_by_chunks()` will return nothing after completing the processing of all files in the dataset, and the user must instead manually handle the reading and collating of all processed data in the job_folder.
#'
#' @return
#' A tibble with nrows = (number of channels) * (number of windows in `pulse_data_split`) and 13 columns:
#' * `i`, the order of each time window
#' * `smoothed`, logical flagging smoothed data
#' * `id`, PULSE channel IDs
#' * `time`, time at the center of each time window
#' * `data`, a list of tibbles with raw PULSE data for each combination of channel and window, with columns `time`, `val` and `peak` (`TRUE` in rows corresponding to wave peaks)
#' * `hz`, heartbeat rate estimate (in Hz)
#' * `n`, number of wave peaks identified
#' * `sd`, standard deviation of the intervals between wave peaks
#' * `ci`, confidence interval (hz ± ci)
#' * `keep`, logical indicating whether data points meet N and SD criteria
#' * `d_r`, ratio of consecutive asymmetric peaks
#' * `d_f`, logical flagging data points where heart beat frequency is likely double the real value
#'
#' @export
#'
#' @seealso
#'  * [PULSE()] for all the relevant information about the the processing of `PULSE` data
#'
#' @examples
#' ##
PULSE_by_chunks <- function(
		folder,
		allow_dir_create = FALSE, chunks = 2, bind_data = TRUE,
		window_width_secs = 30, window_shift_secs = 60, min_data_points = 0.8,
		interpolation_freq = 40, bandwidth = 0.2,
		doublecheck = TRUE, lim_n = 3, lim_sd = 0.75,
		raw_v_smoothed = TRUE, correct = TRUE,
		discard_channels = NULL, keep_raw_data = TRUE,
		show_progress = TRUE)
	{
	if (!allow_dir_create) cli::cli_abort(c(
		"x" = cli::col_red("need explicit permission to create the job folder"),
		"v" = cli::col_green("set 'allow_dir_create = TRUE' to proceed")
		))

	paths <- folder %>%
		dir(full.names = TRUE, recursive = TRUE) %>%
		stringr::str_to_lower() %>%
		stringr::str_subset(".csv$")

	if (!all(purrr::map_lgl(paths, is.pulse))) cli::cli_abort("not all files in the target folder are PULSE files")

	job_folder <- file.path(folder, "ongoing_job")
	dir.create(job_folder, showWarnings = FALSE)

	n <- length(paths)
	chunks_split <- tibble::tibble(
		split = split(1:n, rep(1:ceiling(n/chunks), each = chunks)[1:n])
	) %>%
		tibble::rowid_to_column("row") %>%
		dplyr::mutate(row = formatC(row, width = nchar(nrow(.)), flag = "0")) %>%
		dplyr::mutate(fn = file.path(job_folder, paste0("chunk_", row, ".RDS")))

	for (i in 1:nrow(chunks_split)) {
		fn <- chunks_split$fn[i]
		if (!file.exists(fn)) {
			if (show_progress) cli::cli_alert("--------------------------------------")
			if (show_progress) cli::cli_alert("chunk {chunks_split$row[i]}/{nrow(chunks_split)} (n files = {length(chunks_split$split[[i]])})")
			x <- PULSE(
				paths = paths[chunks_split$split[[i]]],
				window_width_secs  = window_width_secs,
				window_shift_secs  = window_shift_secs,
				min_data_points    = min_data_points,
				interpolation_freq = interpolation_freq,
				bandwidth          = bandwidth,
				doublecheck        = doublecheck,
				raw_v_smoothed     = raw_v_smoothed,
				lim_n              = lim_n,
				lim_sd             = lim_sd,
				correct            = correct,
				discard_channels   = discard_channels,
				keep_raw_data      = keep_raw_data,
				process_large      = TRUE,
				show_progress      = show_progress
			)
			saveRDS(x, fn)
		}
	}

	cli::cli_bullets(c(
		"v" = "all files read and heart rates computed",
		"i" = "data stored in: {job_folder}"
	))

	if (bind_data) {
		purrr::map_dfr(chunks_split$fn, readRDS)
	} else {
		cli::cli_warn("data not bound together because user set 'bind_data = FALSE'")
		NULL
	}
}
