Skip to content

Feat/sample and analytics#372

Open
qplevier wants to merge 4 commits into
ckan:masterfrom
GenomicDataInfrastructure:feat/sample-and-analytics
Open

Feat/sample and analytics#372
qplevier wants to merge 4 commits into
ckan:masterfrom
GenomicDataInfrastructure:feat/sample-and-analytics

Conversation

@qplevier

Copy link
Copy Markdown

This pull request implements improved handling and serialization of the sample and analytics fields in DCAT-AP and HealthDCAT-AP profiles, aligning them with the latest Euro DCAT-AP 3 and HealthDCAT-AP specifications. It also enhances agent serialization for HealthDCAT-AP with support for additional properties. The changes update both parsing and serialization logic, as well as relevant tests and schemas.

DCAT-AP and Euro DCAT-AP 3:

  • Added support for the sample field as a list of URIs in both parsing (parse_dataset) and serialization (graph_from_catalog, _graph_from_dataset_v3), including schema updates and test coverage. [1] [2] [3] [4] [5] [6]
  • Refactored handling of distributions to consistently collect and serialize their URIs in a new distribution field, and improved compatibility tweaks for legacy support. [1] [2] [3] [4]

HealthDCAT-AP:

  • Added support for the analytics field as a list of URIs, including parsing, serialization, and test updates. [1] [2] [3] [4] [5]
  • Enhanced agent serialization to include publisherNote and publisherType properties, with multilingual support and proper RDF output.

Testing and Configuration:

  • Updated and added tests to reflect the new handling of sample and analytics fields, and adjusted test configuration for database connectivity. [1] [2] [3] [4]

These changes ensure better compliance with the latest DCAT-AP standards and improve interoperability and data quality for CKAN-based data catalogs.

Extract distribution parsing into _parse_distribution and use it when building dataset dicts; collect distribution URIs into dataset_dict["distribution"] and emit DCAT.distribution triples when graphing. Add ADMS.sample handling in DCAT-AP3 parsing and round‑trip graph serialization for dataset samples. Extend Health DCAT-AP profile to parse/serialize analytics distributions and include publisherNote/publisherType on agents, with helpers to read/write those properties. Update test to use .get() for analytics presence. Overall reduces duplicated distribution parsing code and adds support for sample/analytics agent metadata.
Add handling for the dataset 'sample' property in the European DCAT-AP 3 profile by using _add_list_triples_from_dict with ADMS.sample (allowing URI or literal). Update schema (dcat_ap_full.yaml) to include a 'sample' dataset field. Adjust tests: remove legacy v2 sample assertions, add parsing/serialization checks for v3, and update the example dataset JSON to include sample values. These changes enable parsing and serializing ADMS.sample values for DCAT-AP v3 datasets.

@amercader amercader left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qplevier The sample and analytics handling looks good, but the logic around distribution URIs I'm not sure about.

Comment on lines +224 to +227
distribution_uris.append(str(distribution))

if distribution_uris:
dataset_dict["distribution"] = distribution_uris

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this, Why do we need it?

for dist_uri in dataset_dict.get("distribution", []):
if dist_uri:
g.add((dataset_ref, DCAT.distribution, URIRef(dist_uri)))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this done? IIUC dataset["distribution"] is an internal property used during parsing, that should not be part of the output dataset_dict, so it shouldn't be available when serializing (or at least we shouldn't rely on it being present)
Besides the reference between datasets and each distribution is already added on line 608

# Resources
distribution_uris = []
for distribution in self._distributions(dataset_ref):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the method is long but I'd like to keep this in the _parse_dataset_base() method for now to not break other profiles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants