Feature/tech 6338/download dq kit by gauravgupta-nagarro · Pull Request #461 · unicef/giga-dagster

gauravgupta-nagarro · 2026-05-28T15:55:49Z

What type of PR is this?

build: Commits that affect build components like build tool, dependencies, project
version
chore: Miscellaneous commits (e.g. modifying .gitignore)
ci: Commits are special build commits that affect the CI/CD pipeline
docs: Commits that affect documentation only
feat: Commits that add a new feature
fix: Commits that fix a bug
perf: Commits are special refactor commits that improve performance
refactor: Commits that rewrite/restructure your code, however does not change any
behaviour
revert: Commits that revert another commit/PR, usually can be autogenerated on
GitHub or using git revert
style: Commits are special refactor commits that edit the code to comply with a
code style, linter, or formatter
test: Commits that add missing tests or correcting existing tests

Summary

What does this PR do

How to test

Instructions on how to test
Specify which files to review
etc.

Link to Jira/Asana/Airtable task (if applicable)

placeholder

Wireframe screenshot/screencap (if applicable)

placeholder

Implementation screenshot/screencap (if applicable)

placeholder

… to rurban_detected

brianmusisi

Made some change requests and also added some requests for clarification

brianmusisi · 2026-06-02T10:54:25Z


+    # Add NULL columns for DQ flags only if the Delta table schema already has them.
+    # This avoids adding them to new tables in staging/production.
+    if check_table_exists(s, schema_name, country_code, DataTier.SILVER):


We shouldn't need to do this. Downstream for all these tables, we use the mergeSchema option. For now, we should set these columns as nullable in the schema, and then when the data doesn't exist in the table it will be null and merging will be possible

I temporarily added this to make the flow work as there was columns mismatch in the tables. When we update the schema file migrate schema sensor run but doesn't delete previously added columns.

Removing it.

brianmusisi · 2026-06-02T10:55:45Z


    renamed_bronze = casted_bronze.withColumnRenamed("signature", "dq_signature")

+    # Diagnose duplicates at the start of DQ pipeline


What does this do?

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.

brianmusisi · 2026-06-02T10:55:58Z


    dq_results = dq_results.withColumnRenamed("dq_signature", "signature")

+    # Check for duplicates after row_level_checks


And what does this do?

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.

brianmusisi · 2026-06-02T10:56:22Z


    context.log.info("Create a new dataframe with only the relevant columns")
+    context.log.info(
+        f"Input DQ results: {geolocation_data_quality_results.count()} rows"


This too, do we need this?

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.

brianmusisi · 2026-06-02T10:57:17Z

-        geolocation_data_quality_results, uploaded_columns, mode
+        geolocation_data_quality_results, uploaded_columns, mode, context
    )
+    context.log.info(f"After extract_relevant_columns: {df.count()} rows")


Same here. Counts can be costly on large datasets, especially without checkpointing/cacheing. Do we need this?

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.

brianmusisi · 2026-06-02T13:10:07Z

+    return df[df["latitude"].between(-90, 90) & df["longitude"].between(-180, 180)]
+
+
+def _get_map_bounds(


If we don't have latitude/longitude in the dataframe, shouldn't we just not do this at all, and not create the map in the first place? That way there no need to get these bounds?

Also, let's use vectorized versions of these commands that use pandas instead. Doing passed_df["latitude"].mean() or passed_df['latitude'].min() is much faster and cleaner than this

Sure, I will remove it I would request please first have a look at the map itself coz I am hoping the map is not final once map is final, I will refactor this file.

brianmusisi · 2026-06-02T13:10:26Z

+    )
+
+
+def _fmt(value) -> str:


Let's name this and any other functions better and clearer

Also, not sure what this does

Helper function to return values rendered on Map and Tooltip.

brianmusisi · 2026-06-02T13:11:39Z

+def _fmt_int(value) -> str:
+    """Format population counts with thousands separators."""
+    formatted = _fmt(value)
+    if formatted == "N/A":
+        return formatted
+    try:
+        return f"{int(float(formatted)):,}"
+    except (TypeError, ValueError):
+        return formatted
+
+
+def _flag(value) -> str:
+    """Convert raw int DQ flag (1/0/None) to true/false string."""
+    if value is None:
+        return "N/A"
+    try:
+        if pd.isna(value):
+            return "N/A"
+    except (TypeError, ValueError):
+        pass
+    try:
+        return "true" if int(float(value)) == 1 else "false"
+    except (TypeError, ValueError):
+        return "N/A"


Let's name these better, but also confirm why/if these are needed

brianmusisi · 2026-06-02T13:13:13Z

+        return "N/A"
+
+
+def _build_popup(


Let's add a docstring here and to any other functions. And Make it clear what it does, any inputs and what it returns. And use descriptive names.

brianmusisi · 2026-06-02T13:13:31Z

+        return "N/A"
+
+
+def _build_popup(


What does this function do? and can we use Jinja templates instead of this?

Used to create Map Pin Tooltip.

brianmusisi

Added more comments.

The rest of the map logic looks fine, except the comments I made. Would love to see an example of the map to compare

brianmusisi · 2026-06-03T15:06:15Z

+    from src.constants import constants
+    from src.utils.map_generator import generate_school_map_html
+
+    _ = geolocation_dq_schools_passed_human_readable


SInce you're using this just to get the counts, you can use the result of data_quality_results filter for the counts you need. Reading from ADLS directly is not required

brianmusisi · 2026-06-03T15:07:12Z

+    from src.utils.dq_kit_generator import generate_dq_kit_zip_bytes
+
+    # `geolocation_school_map` is consumed only as a dependency marker.
+    _ = geolocation_school_map


You don't need to do this. we can leave it unused in the function

@brianmusisi done, also refactored map_generator.py file.

brianmusisi and others added 18 commits March 27, 2026 10:49

chore: upgrade packages for pydantic 2 support

8a8b0bf

feat: TECH-6601 - Add Giga Spatial Lib & DQ Checks

0e78224

feat: TECH-6601 - school_area_type_smod removed & rural_urban renamed…

7e88c90

… to rurban_detected

feat: Synced with main branch

1519391

feat: TECH - 6601: Pydentic Issue fixed

528251f

feat: temp logs added

2472039

feat: Merged main

8c0d4b5

feat: TECH - 6338 Basic Map and DQ Kit Zip file logic added

71fab6b

feat: TECH - 6338 Maps Generation logic fixed

294f747

feat: TECH - 6338 DQ Kit generation issue fixed

3206e89

feat: TECH - 6338: File path issue fixed

3f18131

feat: TECH - 6338: School Master added in DQ Kit

25a0d54

feat: TECH - 6338 Maps updated

a4c471b

feat: TECH - 6338 Maps updated

29a9cdd

feat: TECH - 6338 Maps updated

57272f5

feat: TECH - 6338 Maps updated

71b7dc9

feat: TECH - 6338 Maps updated

c2e98b0

feat: TECH 6338 Reading passed and failed rows from ADLS

810f980

gauravgupta-nagarro requested review from Javiershenbc and brianmusisi May 28, 2026 15:56

Gaurav Gupta added 3 commits May 28, 2026 21:40

feat: TECH 6338 Reading passed and failed rows from ADLS

df39853

feat: TECH 6338 Reading passed and failed rows from ADLS

5338b4f

feat: TECH 6338 Reading passed and failed rows from ADLS

dc1faa5

brianmusisi force-pushed the main branch from 69038e9 to 5c2a59f Compare May 29, 2026 17:07

brianmusisi requested changes Jun 2, 2026

View reviewed changes

feat: TECH 6338 Removed Unnecessary Logging

7cede00

brianmusisi requested changes Jun 3, 2026

View reviewed changes

Gaurav Gupta added 3 commits June 4, 2026 10:50

feat: TECH 6338 Removed marker dependency

1584b2a

feat: TECH 6338 Removed marker dependency

84c42bc

feat: TECH 6338 Removed marker dependency

b99dad4

Gaurav Gupta added 2 commits June 4, 2026 16:15

feat: TECH 6338 Map generataion code refactored

f875e77

feat: TECH 6338 Map generataion code refactored

c61a652


		renamed_bronze = casted_bronze.withColumnRenamed("signature", "dq_signature")

		# Diagnose duplicates at the start of DQ pipeline


		dq_results = dq_results.withColumnRenamed("dq_signature", "signature")

		# Check for duplicates after row_level_checks

		return df[df["latitude"].between(-90, 90) & df["longitude"].between(-180, 180)]


		def _get_map_bounds(

Conversation

gauravgupta-nagarro commented May 28, 2026

What type of PR is this?

Summary

How to test

Link to Jira/Asana/Airtable task (if applicable)

Wireframe screenshot/screencap (if applicable)

Implementation screenshot/screencap (if applicable)

Uh oh!

brianmusisi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brianmusisi Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brianmusisi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brianmusisi Jun 2, 2026 •

edited

Loading