Feature/dq by bidhan-nagarro · Pull Request #444 · unicef/giga-dagster

bidhan-nagarro · 2026-04-01T12:01:11Z

What type of PR is this?

build: Commits that affect build components like build tool, dependencies, project
version
chore: Miscellaneous commits (e.g. modifying .gitignore)
ci: Commits are special build commits that affect the CI/CD pipeline
docs: Commits that affect documentation only
feat: Commits that add a new feature
fix: Commits that fix a bug
perf: Commits are special refactor commits that improve performance
refactor: Commits that rewrite/restructure your code, however does not change any
behaviour
revert: Commits that revert another commit/PR, usually can be autogenerated on
GitHub or using git revert
style: Commits are special refactor commits that edit the code to comply with a
code style, linter, or formatter
test: Commits that add missing tests or correcting existing tests

Summary

What does this PR do

How to test

Instructions on how to test
Specify which files to review
etc.

Link to Jira/Asana/Airtable task (if applicable)

placeholder

Wireframe screenshot/screencap (if applicable)

placeholder

Implementation screenshot/screencap (if applicable)

placeholder

brianmusisi

I've added a bunch of comments. In the end, where I've added We need this
are the only areas we need to change.

brianmusisi · 2026-04-21T13:14:52Z

 from pyspark.sql.types import NullType, StructType
 from sqlalchemy import select, update
 from src.constants import DataTier
+from src.data_quality_checks.dq_context import DQContext, DQMode


No changes in this file or this flow

brianmusisi · 2026-04-21T13:15:11Z

    dt = DeltaTable.forName(s, f"school_master.{config.country_code}")
    master = dt.toDF()
-
+    dq_context = DQContext(


Same here, no changes in this file or this flow

brianmusisi · 2026-04-21T13:15:42Z

    TimestampType,
 )
 from src.constants import DataTier, constants
+from src.data_quality_checks.dq_context import DQContext, DQMode


No changes to this file/flow, this only affects school_geolocation

brianmusisi · 2026-04-21T13:15:51Z

    )
    columns = get_schema_columns(s, f"coverage_{source}")
    df = add_missing_columns(df, columns)
+


brianmusisi · 2026-04-21T13:16:48Z

    else:
        silver = s.createDataFrame(s.sparkContext.emptyRDD(), schema=schema)
-
+    dq_context = DQContext(


No changes here

brianmusisi · 2026-04-21T13:20:47Z

            country_code = filename_components.country_code
            metadata = adls_file_client.fetch_metadata_for_blob(adls_filepath) or {}

+            # Only process files with dq_mode="master" for the full merge pipeline.


This is not relevant because the approval workflow from Giga Sync only sends data that can be merged. This only applies before data has been sent for approval

brianmusisi · 2026-04-21T13:21:48Z

            size = properties.size

+            run_key = str(path)
+            try:


It's not clear what this is doing

brianmusisi · 2026-04-21T13:23:00Z


    renamed_bronze = casted_bronze.withColumnRenamed("signature", "dq_signature")

+    dq_context = DQContext(


All the DQContext changes do not seem necessary in this P

brianmusisi · 2026-04-21T13:26:43Z

-    elif dataset_type == "geolocation":
-        if mode == UploadMode.CREATE.value:
+
+    elif dq_context.dataset_type == "geolocation":


We need this:
Here, we need to only run this if the DQmode is "master"

Or set the resulting is_not_create and is_not_update` columns to 0 so that they don't cause critical errors when DQMode is not master

brianmusisi · 2026-04-21T13:27:19Z

    schema = StructType(columns)

-    if check_table_exists(s, schema_name, country_code, DataTier.SILVER):
+    if dq_mode == DQMode.MASTER and check_table_exists(


We need this:

This requires no change

brianmusisi

Please make the requested changes

brianmusisi

One thing to add, is that when running on just the uploaded file, geolocation_staging shouldn't be run. We don't want to add data to the staging tables for approval

bidhan-nagarro · 2026-04-23T10:03:39Z

One thing to add, is that when running on just the uploaded file, geolocation_staging shouldn't be run. We don't want to add data to the staging tables for approval

@brianmusisi changes done.

brianmusisi · 2026-06-04T12:56:52Z

+        # For assessment-only (uploaded mode), skip cross-checks against silver
+        # but ensure the columns exist to avoid downstream errors.
+        if dq_context.upload_mode == UploadMode.CREATE.value:
+            df = df.withColumn("dq_is_not_create", f.lit(0))
+        elif dq_context.upload_mode == UploadMode.UPDATE.value:
+            df = df.withColumn("dq_is_not_update", f.lit(0))
+


We actually don't want to do this, so we can skip it

brianmusisi · 2026-06-04T12:58:40Z

+
+def row_level_checks_internal(
+    df: sql.DataFrame,
+    dq_context: DQContext,


Because we've added dq_context, it will break the uses of this function in row_level_checks for the master and reference checks

brianmusisi · 2026-06-04T13:00:15Z

            context.log.info(f"FILE: {path}")
            yield RunRequest(
-                run_key=str(path),
+                run_key=f"{path}:{last_modified}:{dq_triggered_at}",


Changing the run key will make all previous uploaded files run again since they will look new. I have seen this happen on dev already. Whatever we do, we should not change this

brianmusisi

Take a look at the comments I've added

bidhan-nagarro and others added 5 commits February 27, 2026 15:53

feat: Data Quality dagster Changes

219f5e9

feat: Data Quality dagster Changes

22a5a82

Merge branch 'main' into feature/dq

a7bae4c

fix: improve processing time for large files (#445)

96970d4

Merge branch 'main' into feature/dq

43a3aa6

bidhan-nagarro requested a review from brianmusisi April 2, 2026 05:15

feat: pre-commit fix

728acec

brianmusisi force-pushed the main branch from 96970d4 to fda829a Compare April 2, 2026 12:21

brianmusisi force-pushed the main branch from 4af2c06 to b1b12c3 Compare April 9, 2026 12:44

bidhan-nagarro added 2 commits April 20, 2026 10:23

Merge branch 'main' into feature/dq

7dcc4d9

fix: DQ fixes

896cb18

brianmusisi reviewed Apr 21, 2026

View reviewed changes

brianmusisi requested changes Apr 21, 2026

View reviewed changes

bidhan-nagarro added 2 commits April 21, 2026 20:39

fix: review comments addressed

096d9ff

fix: review comments addressed

5bfc3b8

bidhan-nagarro requested a review from brianmusisi April 21, 2026 15:10

bidhan-nagarro added 4 commits April 22, 2026 15:09

fix: DQ fixes

848e743

fix: DQ fixes

6874b40

fix: dq fixes

f6aa504

fix: dq fixes

b88a7a9

brianmusisi reviewed Apr 23, 2026

View reviewed changes

fix: skip staging check for dq_mode=uploaded

b30a332

bidhan-nagarro requested a review from brianmusisi April 23, 2026 10:03

bidhan-nagarro added 5 commits April 24, 2026 16:29

Merge branch 'main' into feature/dq

bd74529

fix: dq fixes

9bb5fb8

Merge branch 'main' into feature/dq

4d1d5c0

Merge branch 'main' into feature/dq

054a3b4

fix: DQ enhancements

f58f55b

bidhan-nagarro force-pushed the feature/dq branch from da88992 to f58f55b Compare May 22, 2026 04:06

fix: DQ enhancements

f238920

brianmusisi force-pushed the main branch from 99b17b2 to 8597c46 Compare May 22, 2026 16:21

brianmusisi force-pushed the main branch from 69038e9 to 5c2a59f Compare May 29, 2026 17:07

brianmusisi requested changes Jun 4, 2026

View reviewed changes


		renamed_bronze = casted_bronze.withColumnRenamed("signature", "dq_signature")

		dq_context = DQContext(

Conversation

bidhan-nagarro commented Apr 1, 2026

What type of PR is this?

Summary

How to test

Link to Jira/Asana/Airtable task (if applicable)

Wireframe screenshot/screencap (if applicable)

Implementation screenshot/screencap (if applicable)

Uh oh!

brianmusisi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brianmusisi left a comment

Choose a reason for hiding this comment

Uh oh!

brianmusisi left a comment

Choose a reason for hiding this comment

Uh oh!

bidhan-nagarro commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brianmusisi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants