Skip to content

Feature/tech 6338/download dq kit#461

Open
gauravgupta-nagarro wants to merge 27 commits into
mainfrom
feature/TECH-6338/download-dq-kit
Open

Feature/tech 6338/download dq kit#461
gauravgupta-nagarro wants to merge 27 commits into
mainfrom
feature/TECH-6338/download-dq-kit

Conversation

@gauravgupta-nagarro
Copy link
Copy Markdown
Collaborator

What type of PR is this?

  • build: Commits that affect build components like build tool, dependencies, project
    version
  • chore: Miscellaneous commits (e.g. modifying .gitignore)
  • ci: Commits are special build commits that affect the CI/CD pipeline
  • docs: Commits that affect documentation only
  • feat: Commits that add a new feature
  • fix: Commits that fix a bug
  • perf: Commits are special refactor commits that improve performance
  • refactor: Commits that rewrite/restructure your code, however does not change any
    behaviour
  • revert: Commits that revert another commit/PR, usually can be autogenerated on
    GitHub or using git revert
  • style: Commits are special refactor commits that edit the code to comply with a
    code style, linter, or formatter
  • test: Commits that add missing tests or correcting existing tests

Summary

What does this PR do

How to test

  1. Instructions on how to test
  2. Specify which files to review
  3. etc.

Link to Jira/Asana/Airtable task (if applicable)

placeholder

Wireframe screenshot/screencap (if applicable)

placeholder

Implementation screenshot/screencap (if applicable)

placeholder

Copy link
Copy Markdown
Collaborator

@brianmusisi brianmusisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some change requests and also added some requests for clarification

Comment thread dagster/src/assets/common/assets.py Outdated

# Add NULL columns for DQ flags only if the Delta table schema already has them.
# This avoids adding them to new tables in staging/production.
if check_table_exists(s, schema_name, country_code, DataTier.SILVER):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to do this. Downstream for all these tables, we use the mergeSchema option. For now, we should set these columns as nullable in the schema, and then when the data doesn't exist in the table it will be null and merging will be possible

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I temporarily added this to make the flow work as there was columns mismatch in the tables. When we update the schema file migrate schema sensor run but doesn't delete previously added columns.

Removing it.


renamed_bronze = casted_bronze.withColumnRenamed("signature", "dq_signature")

# Diagnose duplicates at the start of DQ pipeline
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.


dq_results = dq_results.withColumnRenamed("dq_signature", "signature")

# Check for duplicates after row_level_checks
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what does this do?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.


context.log.info("Create a new dataframe with only the relevant columns")
context.log.info(
f"Input DQ results: {geolocation_data_quality_results.count()} rows"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too, do we need this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.

geolocation_data_quality_results, uploaded_columns, mode
geolocation_data_quality_results, uploaded_columns, mode, context
)
context.log.info(f"After extract_relevant_columns: {df.count()} rows")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Counts can be costly on large datasets, especially without checkpointing/cacheing. Do we need this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added for logging purpose on dev as I was not seeing expected result.
Removing it.

Comment thread dagster/src/utils/map_generator.py Outdated
return df[df["latitude"].between(-90, 90) & df["longitude"].between(-180, 180)]


def _get_map_bounds(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have latitude/longitude in the dataframe, shouldn't we just not do this at all, and not create the map in the first place? That way there no need to get these bounds?

Also, let's use vectorized versions of these commands that use pandas instead. Doing passed_df["latitude"].mean() or passed_df['latitude'].min() is much faster and cleaner than this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will remove it I would request please first have a look at the map itself coz I am hoping the map is not final once map is final, I will refactor this file.

Comment thread dagster/src/utils/map_generator.py Outdated
)


def _fmt(value) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name this and any other functions better and clearer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not sure what this does

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper function to return values rendered on Map and Tooltip.

Comment thread dagster/src/utils/map_generator.py Outdated
Comment on lines +73 to +96
def _fmt_int(value) -> str:
"""Format population counts with thousands separators."""
formatted = _fmt(value)
if formatted == "N/A":
return formatted
try:
return f"{int(float(formatted)):,}"
except (TypeError, ValueError):
return formatted


def _flag(value) -> str:
"""Convert raw int DQ flag (1/0/None) to true/false string."""
if value is None:
return "N/A"
try:
if pd.isna(value):
return "N/A"
except (TypeError, ValueError):
pass
try:
return "true" if int(float(value)) == 1 else "false"
except (TypeError, ValueError):
return "N/A"
Copy link
Copy Markdown
Collaborator

@brianmusisi brianmusisi Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name these better, but also confirm why/if these are needed

Comment thread dagster/src/utils/map_generator.py Outdated
return "N/A"


def _build_popup(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a docstring here and to any other functions. And Make it clear what it does, any inputs and what it returns. And use descriptive names.

Comment thread dagster/src/utils/map_generator.py Outdated
return "N/A"


def _build_popup(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this function do? and can we use Jinja templates instead of this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to create Map Pin Tooltip.

Copy link
Copy Markdown
Collaborator

@brianmusisi brianmusisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments.

The rest of the map logic looks fine, except the comments I made. Would love to see an example of the map to compare

from src.constants import constants
from src.utils.map_generator import generate_school_map_html

_ = geolocation_dq_schools_passed_human_readable
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SInce you're using this just to get the counts, you can use the result of data_quality_results filter for the counts you need. Reading from ADLS directly is not required

from src.utils.dq_kit_generator import generate_dq_kit_zip_bytes

# `geolocation_school_map` is consumed only as a dependency marker.
_ = geolocation_school_map
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to do this. we can leave it unused in the function

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianmusisi done, also refactored map_generator.py file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants