Skip to content

romduc74/SedimentR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

247 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SedimentR: A R package for identifying major stratigraphic structures in sediment sores using XRF geochemical data

GitHub version DOI GitHub Downloads Zenodo Downloads Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Overview

The sedimentR package provides tools to process, denoise and analyse xrf datasets.
This vignette shows the core workflow, key functions, and practical examples.

sedimentR provides tools for:

  • Importing XRF or geochemical datasets

  • Scaling and transforming geochemical variables

  • Exploring proxy ratios

  • Performing clustering analyses

  • Visualizing sediment cores with aligned depth scales

  • Integrating sediment core photographs

Installation

install.packages("devtools")

devtools::install("romduc74/SedimentR",build_vignettes = TRUE)

Workflow

Automated workflow

sedicore()

The automated workflow reproduces these steps in a fully guided, interactive manner, ensuring consistent processing of sediment-core XRF datasets from raw import to facies interpretation.

1)- Load the dataset and create geochemical ratios

(with an optional XRF denoising module)

2)- Scale the variables

(CLR transformation and log-transform for ratios)

2)-Select the most informative variables

using PCA and loading-threshold filtering

3)- Run clustering

through a complete K-means or hierarchical clustering-based facies identification procedure.

4)- Plot the core

generating the depth-aligned geochemical facies plot, the virtual core, and (optionally) the sediment core photograph

sedimentR:

sedicore() runs the full automated workflow provided by the sedimentR package, offering a complete geochemical processing pipeline for sediment cores. The function guides the user through all analytical stages, from raw data import to the final clustering and interpretation of geochemical facies.

The workflow begins with an interactive data-loading step using the charger() function. Depending on the file format (CSV or Excel), the user can specify the separator, encoding, or select the appropriate Excel sheet. If no file path is supplied, a file-selection dialog is opened to allow the user to choose the dataset.

An optional second stage applies XRF noise-reduction methods (EEMD decomposition and IMF extraction), producing cleaned and denoised geochemical signals.

The next step involves transforming the dataset through centered log-ratio (CLR) scaling and the creation of log-transformed elemental ratios, handled by scale_data().

Variable selection is then performed using principal component analysis (PCA). Only variables with loadings above a user-defined threshold are retained, ensuring that the selection focuses on geochemically meaningful contributors via the prop.select function.

Finally, sedicore() executes the full clustering module implemented in xrf_clust(), including K-means and hierarchical clustering, proxy contribution analysis, identification of cluster-driving variables, and optional visualizations (e.g. PCA, Clustering, Virtual Core or Sediment core picture).

The entire process is fully guided: each step prints informative progress messages, invites user interaction when needed, and stores intermediate outputs in the global environment for further visualization or analysis.

Step-by-step workflow:

library(sedimentR)

vignette("introduction", package = "sedimentR")

charger()

scale_data()

prop.select()

xrf_clust()

charger():

The charger() function allows you to import geochemical data from CSV or Excel files,create log-transformed ratios, and optionally clean missing values.

It can be used in both interactive and non-interactive modes.

# --- 1. Interactive mode ---

# If no path is provided, a file selection dialog will open
# You will then be prompted for CSV separator or Excel sheet

charger()

# --- 2. Non-interactive mode: CSV file ---

# Load a CSV file directly, specify separator and encoding

charger(
  path = "data/my_data.csv",
  sep = ",",
  fileEncoding = "UTF-8"
)

# --- 3. Non-interactive mode: Excel file ---

# Load an Excel file, specify sheet

charger(
  path = "data/my_data.xlsx",
  sheet = "Sheet1"
)
1. Integration of xrf_noise within charger() function:

After loading your data with charger(), you have the option to perform noise reduction on the XRF data immediately after loading. This is done by calling the XRF noise detection & denoising module. The module computes a noise score for each variable, removes the first high-frequency IMFs considered as noise, and produces a cleaned and denoised dataset. This step is fully interactive, allowing you to set thresholds and review which variables are retained.

Ratios creation:

The charger()function also allows the user to create geochemical ratios during the data-import step. This operation is entirely optional and can be skipped at any time. When enabled, the user is free to define any combination of elemental ratios, depending on the analytical objectives or the sedimentological processes they wish to highlight.

Ratio creation is not restricted to predefined pairs, the user may specify as many ratios as needed, using any elements present in the dataset. This flexibility ensures full control over the geochemical transformations applied to the data and allows the workflow to adapt to a wide range of research contexts, from detrital proxies to redox indicators or biogenic markers.

scale_data():

Once the data are loaded, the next analytical steps involve transforming the variables. Individual elemental variables are processed using a Centered Log-Ratio (CLR) transformation, which is commonly applied to compositional geochemical datasets. In contrast, user-defined elemental ratios are log-transformed, ensuring consistent normalization and allowing them to be directly comparable within the workflow.

If you have previously run charger() (and optionally created ratios), scale_data() will automatically:

1)- detect which columns correspond to single elements

2)- detect which columns correspond to ratios (e.g., log_Fe_Ti, log_Ca_K)

3)- apply the appropriate transformation

4)- return a clean, transformed dataset ready for PCA

This transformed dataset is typically the one you will use for PCA (prop.select()) and clustering (xrf_clust()).

# --- 1. How should the function be used? ---

scale_data()

1. CLR equation used in scale_data():

$$ y_{ij} = \log\left(\frac{x_{ij}}{g_i}\right), \qquad g_i = \left( \prod_{j=1}^{q} x_{ij} \right)^{1/q} $$

  • ($$x_{ij}$$) : raw value of element (j) at depth (i)
  • (q) : number of elements used in the CLR transformation at depth (i)
  • ($$g_i$$) : geometric mean of all elements measured at depth (i)
  • ($$y_{ij}$$) : CLR-transformed value for element (j) at depth (i) `

prop.select(): Selection of variables

# --- 1. How should the function be used? ---

prop.select()


1. Cumulative variance method:

The prop.select() function performs a Principal Component Analysis (PCA) on a dataset of normalized variables to identify the most informative elements. By examining the contribution of each variable to the selected principal components, the function highlights those that best capture the structure and variance of the data. Users can choose the number of components to retain using cumulative variance, broken-stick, or manual selection methods and can define a percentile threshold to automatically select the most significant variables. Overall, prop.select() provides an interactive workflow to reduce dimensionality and extract key variables for subsequent analyses, such as clustering or pattern identification.

  • The proportion of variance for each component ( (j) ):

$$ \lambda_j = \frac{\sigma_j^2}{\sum_{i=1}^{p} \sigma_i^2} $$

where ( $\sigma_j^2$ ) is the eigenvalue and ( (p) ) the total number of components.

  • Cumulative variance up to component ( (k) ):

$$ V_k = \sum_{j=1}^{k} \lambda_j $$

  • Component selection criterion:

$$ V_k \geq \theta $$

where ( $\theta$ ) is the chosen threshold (e.g., $\theta$ = 0.9 ).

2.The broken method:

The broken-stick model provides theoretical eigenvalues based on an equal partitioning of the total variance among all principal components. These values are then compared to the observed eigenvalues from Principal Component Analysis (PCA). Components are retained only when their observed variance exceeds the corresponding broken-stick expectation.

$$ b_j = \frac{1}{p} \sum_{k=j}^{p} \frac{1}{k} $$

where (p) is the total number of principal components and (k) the summation index. The (j) principal component is considered as significative if ( $$\lambda_j $$), the variance proportion, satisfies the following condition :

$$ \lambda_j > b_j $$

That is, the explained variance of component (j) must be greater than the variance expected under the broken-stick model.

3. Loadings threshold:

The number of principal components to be retained is then determined., the next step is to evaluate the contribution of each variable to these components through their loadings, allowing the identification of the most influential variables. $L_{ij}$ is defined as the loading of the $i$ variable on the $j$ principal component in the PCA. For each components $j$, the most influential properties (in absolute values) are identified.

To this end, the following method is applied : The absolute weight of the loadings on each $j$ components is calculated :

$$ w_{ij} = |L_{ij}| $$

The $S_j$ threshold is determined based on the percentile of the weights.

$$ \mathrm{S_j} = Quantile_{p}(w_{1j}, w_{2j}, \dots, w_{nj}) $$

where $p$ is the chosen percentile. For example, for the 90\textsuperscript{e} percentile of the weights, $p = 0.90$

Finally, the $i$ variables are selected such as :

$$ w_{ij} \geq S_j $$

Thus, only variables whose contribution exceeds the chosen percentile threshold for each component are retained.

xrf_clust: Clustering on variables selected

# --- 1. How should the function be used? ---

xrf_clust()

The xrf_clust() function performs clustering analysis on a dataset of selected variables. It helps to determine the optimal number of clusters, to classify observations in the clusters, analyze cluster drivers, and provide multiple visualization options. The main enhancement of this approach is the coupling of variable selection from PCA analysis with the estimation of the optimal number of clusters. The best cluster number is determined using a comprehensive set of validity indices available in the NbClust package (Charrad et al., 2014), providing a robust, data-driven method for clustering via the k-means or hierarchical clustering method.

The user may also manually choose the number of clusters, for example based on the number of sedimentary facies visually identified in the core. Both approaches—automatic and manual clustering—are fully supported. However, to assess the robustness of the clustering solution, several validation metrics can be computed. Comparing the automatic and manually selected number of clusters helps determine which configuration yields the most reliable and interpretable results.

To assess the robustness of the clusters, four complementary indices are calculated: Silhouette, Davies–Bouldin, Bootstrap Stability and Calinski–Harabasz Index.


Visualization

Finally, one of the key strengths of the SedimentR package and notably the xrf_clust() function is the ability to visually compare a virtual core constructed from clustering results with the actual sediment core photograph. This comparison is complemented by the display of elemental variations along the core depth. The integrative approach of sedimentR enhances the interpretability of sediment core analyses and provides a clear, comprehensive view of geochemical and sedimentological variability.


Data available

An example dataset is available online via Zenodo (https://doi.org/doi:10.5281/zenodo.20527130).

Getting help & contributing

If you encounter a clear bug, have a question or suggestion, please either open an Issues or send an email to Romain Ducruet (romain.ducruet@gmail.com) and Amaury Bardelle (amaury.bardelle@icloud.com).

About

SedimentR is a package to facilitate the processing and interpretation of geochemical data analysed in sediment cores.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages