Feature/tech 6338/download dq kit#461
Conversation
… to rurban_detected
brianmusisi
left a comment
There was a problem hiding this comment.
Made some change requests and also added some requests for clarification
|
|
||
| # Add NULL columns for DQ flags only if the Delta table schema already has them. | ||
| # This avoids adding them to new tables in staging/production. | ||
| if check_table_exists(s, schema_name, country_code, DataTier.SILVER): |
There was a problem hiding this comment.
We shouldn't need to do this. Downstream for all these tables, we use the mergeSchema option. For now, we should set these columns as nullable in the schema, and then when the data doesn't exist in the table it will be null and merging will be possible
There was a problem hiding this comment.
I temporarily added this to make the flow work as there was columns mismatch in the tables. When we update the schema file migrate schema sensor run but doesn't delete previously added columns.
Removing it.
|
|
||
| renamed_bronze = casted_bronze.withColumnRenamed("signature", "dq_signature") | ||
|
|
||
| # Diagnose duplicates at the start of DQ pipeline |
There was a problem hiding this comment.
This was added for logging purpose on dev as I was not seeing expected result.
Removing it.
|
|
||
| dq_results = dq_results.withColumnRenamed("dq_signature", "signature") | ||
|
|
||
| # Check for duplicates after row_level_checks |
There was a problem hiding this comment.
And what does this do?
There was a problem hiding this comment.
This was added for logging purpose on dev as I was not seeing expected result.
Removing it.
|
|
||
| context.log.info("Create a new dataframe with only the relevant columns") | ||
| context.log.info( | ||
| f"Input DQ results: {geolocation_data_quality_results.count()} rows" |
There was a problem hiding this comment.
This too, do we need this?
There was a problem hiding this comment.
This was added for logging purpose on dev as I was not seeing expected result.
Removing it.
| geolocation_data_quality_results, uploaded_columns, mode | ||
| geolocation_data_quality_results, uploaded_columns, mode, context | ||
| ) | ||
| context.log.info(f"After extract_relevant_columns: {df.count()} rows") |
There was a problem hiding this comment.
Same here. Counts can be costly on large datasets, especially without checkpointing/cacheing. Do we need this?
There was a problem hiding this comment.
This was added for logging purpose on dev as I was not seeing expected result.
Removing it.
| return df[df["latitude"].between(-90, 90) & df["longitude"].between(-180, 180)] | ||
|
|
||
|
|
||
| def _get_map_bounds( |
There was a problem hiding this comment.
If we don't have latitude/longitude in the dataframe, shouldn't we just not do this at all, and not create the map in the first place? That way there no need to get these bounds?
Also, let's use vectorized versions of these commands that use pandas instead. Doing passed_df["latitude"].mean() or passed_df['latitude'].min() is much faster and cleaner than this
There was a problem hiding this comment.
Sure, I will remove it I would request please first have a look at the map itself coz I am hoping the map is not final once map is final, I will refactor this file.
| ) | ||
|
|
||
|
|
||
| def _fmt(value) -> str: |
There was a problem hiding this comment.
Let's name this and any other functions better and clearer
There was a problem hiding this comment.
Also, not sure what this does
There was a problem hiding this comment.
Helper function to return values rendered on Map and Tooltip.
| def _fmt_int(value) -> str: | ||
| """Format population counts with thousands separators.""" | ||
| formatted = _fmt(value) | ||
| if formatted == "N/A": | ||
| return formatted | ||
| try: | ||
| return f"{int(float(formatted)):,}" | ||
| except (TypeError, ValueError): | ||
| return formatted | ||
|
|
||
|
|
||
| def _flag(value) -> str: | ||
| """Convert raw int DQ flag (1/0/None) to true/false string.""" | ||
| if value is None: | ||
| return "N/A" | ||
| try: | ||
| if pd.isna(value): | ||
| return "N/A" | ||
| except (TypeError, ValueError): | ||
| pass | ||
| try: | ||
| return "true" if int(float(value)) == 1 else "false" | ||
| except (TypeError, ValueError): | ||
| return "N/A" |
There was a problem hiding this comment.
Let's name these better, but also confirm why/if these are needed
| return "N/A" | ||
|
|
||
|
|
||
| def _build_popup( |
There was a problem hiding this comment.
Let's add a docstring here and to any other functions. And Make it clear what it does, any inputs and what it returns. And use descriptive names.
| return "N/A" | ||
|
|
||
|
|
||
| def _build_popup( |
There was a problem hiding this comment.
What does this function do? and can we use Jinja templates instead of this?
There was a problem hiding this comment.
Used to create Map Pin Tooltip.
brianmusisi
left a comment
There was a problem hiding this comment.
Added more comments.
The rest of the map logic looks fine, except the comments I made. Would love to see an example of the map to compare
| from src.constants import constants | ||
| from src.utils.map_generator import generate_school_map_html | ||
|
|
||
| _ = geolocation_dq_schools_passed_human_readable |
There was a problem hiding this comment.
SInce you're using this just to get the counts, you can use the result of data_quality_results filter for the counts you need. Reading from ADLS directly is not required
| from src.utils.dq_kit_generator import generate_dq_kit_zip_bytes | ||
|
|
||
| # `geolocation_school_map` is consumed only as a dependency marker. | ||
| _ = geolocation_school_map |
There was a problem hiding this comment.
You don't need to do this. we can leave it unused in the function
There was a problem hiding this comment.
@brianmusisi done, also refactored map_generator.py file.
What type of PR is this?
build: Commits that affect build components like build tool, dependencies, projectversion
chore: Miscellaneous commits (e.g. modifying.gitignore)ci: Commits are specialbuildcommits that affect the CI/CD pipelinedocs: Commits that affect documentation onlyfeat: Commits that add a new featurefix: Commits that fix a bugperf: Commits are specialrefactorcommits that improve performancerefactor: Commits that rewrite/restructure your code, however does not change anybehaviour
revert: Commits that revert another commit/PR, usually can be autogenerated onGitHub or using
git revertstyle: Commits are specialrefactorcommits that edit the code to comply with acode style, linter, or formatter
test: Commits that add missing tests or correcting existing testsSummary
What does this PR do
How to test
Link to Jira/Asana/Airtable task (if applicable)
placeholder
Wireframe screenshot/screencap (if applicable)
placeholder
Implementation screenshot/screencap (if applicable)
placeholder