SedimentR: A R package for identifying major stratigraphic structures in sediment sores using XRF geochemical data

Overview

The sedimentR package provides tools to process, denoise and analyse xrf datasets.
This vignette shows the core workflow, key functions, and practical examples.

sedimentR provides tools for:

Importing XRF or geochemical datasets
Scaling and transforming geochemical variables
Exploring proxy ratios
Performing clustering analyses
Visualizing sediment cores with aligned depth scales
Integrating sediment core photographs

Installation

install.packages("devtools")

devtools::install("romduc74/SedimentR",build_vignettes = TRUE)

Workflow

Automated workflow

sedicore()

The automated workflow reproduces these steps in a fully guided, interactive manner, ensuring consistent processing of sediment-core XRF datasets from raw import to facies interpretation.

1)- Load the dataset and create geochemical ratios

(with an optional XRF denoising module)

2)- Scale the variables

(CLR transformation and log-transform for ratios)

2)-Select the most informative variables

using PCA and loading-threshold filtering

3)- Run clustering

through a complete K-means or hierarchical clustering-based facies identification procedure.

4)- Plot the core

generating the depth-aligned geochemical facies plot, the virtual core, and (optionally) the sediment core photograph

`sedimentR`:

sedicore() runs the full automated workflow provided by the sedimentR package, offering a complete geochemical processing pipeline for sediment cores. The function guides the user through all analytical stages, from raw data import to the final clustering and interpretation of geochemical facies.

The workflow begins with an interactive data-loading step using the charger() function. Depending on the file format (CSV or Excel), the user can specify the separator, encoding, or select the appropriate Excel sheet. If no file path is supplied, a file-selection dialog is opened to allow the user to choose the dataset.

An optional second stage applies XRF noise-reduction methods (EEMD decomposition and IMF extraction), producing cleaned and denoised geochemical signals.

The next step involves transforming the dataset through centered log-ratio (CLR) scaling and the creation of log-transformed elemental ratios, handled by scale_data().

Variable selection is then performed using principal component analysis (PCA). Only variables with loadings above a user-defined threshold are retained, ensuring that the selection focuses on geochemically meaningful contributors via the prop.select function.

Finally, sedicore() executes the full clustering module implemented in xrf_clust(), including K-means and hierarchical clustering, proxy contribution analysis, identification of cluster-driving variables, and optional visualizations (e.g. PCA, Clustering, Virtual Core or Sediment core picture).

The entire process is fully guided: each step prints informative progress messages, invites user interaction when needed, and stores intermediate outputs in the global environment for further visualization or analysis.

Step-by-step workflow:

library(sedimentR)

vignette("introduction", package = "sedimentR")

charger()

scale_data()

prop.select()

xrf_clust()

`charger()`:

The charger() function allows you to import geochemical data from CSV or Excel files,create log-transformed ratios, and optionally clean missing values.

It can be used in both interactive and non-interactive modes.

# --- 1. Interactive mode ---

# If no path is provided, a file selection dialog will open
# You will then be prompted for CSV separator or Excel sheet

charger()

# --- 2. Non-interactive mode: CSV file ---

# Load a CSV file directly, specify separator and encoding

charger(
  path = "data/my_data.csv",
  sep = ",",
  fileEncoding = "UTF-8"
)

# --- 3. Non-interactive mode: Excel file ---

# Load an Excel file, specify sheet

charger(
  path = "data/my_data.xlsx",
  sheet = "Sheet1"
)

1. Integration of `xrf_noise` within `charger()` function:

After loading your data with charger(), you have the option to perform noise reduction on the XRF data immediately after loading. This is done by calling the XRF noise detection & denoising module. The module computes a noise score for each variable, removes the first high-frequency IMFs considered as noise, and produces a cleaned and denoised dataset. This step is fully interactive, allowing you to set thresholds and review which variables are retained.

Ratios creation:

The charger()function also allows the user to create geochemical ratios during the data-import step. This operation is entirely optional and can be skipped at any time. When enabled, the user is free to define any combination of elemental ratios, depending on the analytical objectives or the sedimentological processes they wish to highlight.

Ratio creation is not restricted to predefined pairs, the user may specify as many ratios as needed, using any elements present in the dataset. This flexibility ensures full control over the geochemical transformations applied to the data and allows the workflow to adapt to a wide range of research contexts, from detrital proxies to redox indicators or biogenic markers.

`scale_data()`:

Once the data are loaded, the next analytical steps involve transforming the variables. Individual elemental variables are processed using a Centered Log-Ratio (CLR) transformation, which is commonly applied to compositional geochemical datasets. In contrast, user-defined elemental ratios are log-transformed, ensuring consistent normalization and allowing them to be directly comparable within the workflow.

If you have previously run charger() (and optionally created ratios), scale_data() will automatically:

1)- detect which columns correspond to single elements

2)- detect which columns correspond to ratios (e.g., log_Fe_Ti, log_Ca_K)

3)- apply the appropriate transformation

4)- return a clean, transformed dataset ready for PCA

This transformed dataset is typically the one you will use for PCA (prop.select()) and clustering (xrf_clust()).

# --- 1. How should the function be used? ---

scale_data()

1. CLR equation used in `scale_data()`:

$$ y_{ij} = \log\left(\frac{x_{ij}}{g_i}\right), \qquad g_i = \left( \prod_{j=1}^{q} x_{ij} \right)^{1/q} $$

($$x_{ij}$$) : raw value of element (j) at depth (i)
(q) : number of elements used in the CLR transformation at depth (i)
($$g_i$$) : geometric mean of all elements measured at depth (i)
($$y_{ij}$$) : CLR-transformed value for element (j) at depth (i) `

`prop.select()`: Selection of variables

# --- 1. How should the function be used? ---

prop.select()

1. Cumulative variance method:

The prop.select() function performs a Principal Component Analysis (PCA) on a dataset of normalized variables to identify the most informative elements. By examining the contribution of each variable to the selected principal components, the function highlights those that best capture the structure and variance of the data. Users can choose the number of components to retain using cumulative variance, broken-stick, or manual selection methods and can define a percentile threshold to automatically select the most significant variables. Overall, prop.select() provides an interactive workflow to reduce dimensionality and extract key variables for subsequent analyses, such as clustering or pattern identification.

The proportion of variance for each component ( (j) ):

$$ \lambda_j = \frac{\sigma_j^2}{\sum_{i=1}^{p} \sigma_i^2} $$

where ( $\sigma_j^2$ ) is the eigenvalue and ( (p) ) the total number of components.

Cumulative variance up to component ( (k) ):

$$ V_k = \sum_{j=1}^{k} \lambda_j $$

Component selection criterion:

$$ V_k \geq \theta $$

where ( $\theta$ ) is the chosen threshold (e.g., $\theta$ = 0.9 ).

2.The broken method:

The broken-stick model provides theoretical eigenvalues based on an equal partitioning of the total variance among all principal components. These values are then compared to the observed eigenvalues from Principal Component Analysis (PCA). Components are retained only when their observed variance exceeds the corresponding broken-stick expectation.

$$ b_j = \frac{1}{p} \sum_{k=j}^{p} \frac{1}{k} $$

where (p) is the total number of principal components and (k) the summation index. The (j) principal component is considered as significative if ( $$\lambda_j $$), the variance proportion, satisfies the following condition :

$$ \lambda_j > b_j $$

That is, the explained variance of component (j) must be greater than the variance expected under the broken-stick model.

3. Loadings threshold:

The number of principal components to be retained is then determined., the next step is to evaluate the contribution of each variable to these components through their loadings, allowing the identification of the most influential variables. $L_{ij}$ is defined as the loading of the $i$ variable on the $j$ principal component in the PCA. For each components $j$, the most influential properties (in absolute values) are identified.

To this end, the following method is applied : The absolute weight of the loadings on each $j$ components is calculated :

$$ w_{ij} = |L_{ij}| $$

The $S_j$ threshold is determined based on the percentile of the weights.

$$ \mathrm{S_j} = Quantile_{p}(w_{1j}, w_{2j}, \dots, w_{nj}) $$

where $p$ is the chosen percentile. For example, for the 90\textsuperscript{e} percentile of the weights, $p = 0.90$

Finally, the $i$ variables are selected such as :

$$ w_{ij} \geq S_j $$

Thus, only variables whose contribution exceeds the chosen percentile threshold for each component are retained.

`xrf_clust`: Clustering on variables selected

# --- 1. How should the function be used? ---

xrf_clust()

The xrf_clust() function performs clustering analysis on a dataset of selected variables. It helps to determine the optimal number of clusters, to classify observations in the clusters, analyze cluster drivers, and provide multiple visualization options. The main enhancement of this approach is the coupling of variable selection from PCA analysis with the estimation of the optimal number of clusters. The best cluster number is determined using a comprehensive set of validity indices available in the NbClust package (Charrad et al., 2014), providing a robust, data-driven method for clustering via the k-means or hierarchical clustering method.

The user may also manually choose the number of clusters, for example based on the number of sedimentary facies visually identified in the core. Both approaches—automatic and manual clustering—are fully supported. However, to assess the robustness of the clustering solution, several validation metrics can be computed. Comparing the automatic and manually selected number of clusters helps determine which configuration yields the most reliable and interpretable results.

To assess the robustness of the clusters, four complementary indices are calculated: Silhouette, Davies–Bouldin, Bootstrap Stability and Calinski–Harabasz Index.

Visualization

Finally, one of the key strengths of the SedimentR package and notably the xrf_clust() function is the ability to visually compare a virtual core constructed from clustering results with the actual sediment core photograph. This comparison is complemented by the display of elemental variations along the core depth. The integrative approach of sedimentR enhances the interpretability of sediment core analyses and provides a clear, comprehensive view of geochemical and sedimentological variability.

Data available

An example dataset is available online via Zenodo (https://doi.org/doi:10.5281/zenodo.20527130).

Getting help & contributing

If you encounter a clear bug, have a question or suggestion, please either open an Issues or send an email to Romain Ducruet (romain.ducruet@gmail.com) and Amaury Bardelle (amaury.bardelle@icloud.com).

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
R		R
man		man
vignettes		vignettes
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
SedimentR.Rproj		SedimentR.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SedimentR: A R package for identifying major stratigraphic structures in sediment sores using XRF geochemical data

Overview

Installation

Workflow

Automated workflow

`sedimentR`:

Step-by-step workflow:

`charger()`:

1. Integration of `xrf_noise` within `charger()` function:

Ratios creation:

`scale_data()`:

1. CLR equation used in `scale_data()`:

`prop.select()`: Selection of variables

1. Cumulative variance method:

2.The broken method:

3. Loadings threshold:

`xrf_clust`: Clustering on variables selected

Visualization

Data available

Getting help & contributing

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SedimentR: A R package for identifying major stratigraphic structures in sediment sores using XRF geochemical data

Overview

Installation

Workflow

Automated workflow

sedimentR:

Step-by-step workflow:

charger():

1. Integration of xrf_noise within charger() function:

Ratios creation:

scale_data():

1. CLR equation used in scale_data():

prop.select(): Selection of variables

1. Cumulative variance method:

2.The broken method:

3. Loadings threshold:

xrf_clust: Clustering on variables selected

Visualization

Data available

Getting help & contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages

`sedimentR`:

`charger()`:

1. Integration of `xrf_noise` within `charger()` function:

`scale_data()`:

1. CLR equation used in `scale_data()`:

`prop.select()`: Selection of variables

`xrf_clust`: Clustering on variables selected