Feature/dq#444
Conversation
brianmusisi
left a comment
There was a problem hiding this comment.
I've added a bunch of comments. In the end, where I've added We need this
are the only areas we need to change.
| from pyspark.sql.types import NullType, StructType | ||
| from sqlalchemy import select, update | ||
| from src.constants import DataTier | ||
| from src.data_quality_checks.dq_context import DQContext, DQMode |
There was a problem hiding this comment.
No changes in this file or this flow
| dt = DeltaTable.forName(s, f"school_master.{config.country_code}") | ||
| master = dt.toDF() | ||
|
|
||
| dq_context = DQContext( |
There was a problem hiding this comment.
Same here, no changes in this file or this flow
| TimestampType, | ||
| ) | ||
| from src.constants import DataTier, constants | ||
| from src.data_quality_checks.dq_context import DQContext, DQMode |
There was a problem hiding this comment.
No changes to this file/flow, this only affects school_geolocation
| ) | ||
| columns = get_schema_columns(s, f"coverage_{source}") | ||
| df = add_missing_columns(df, columns) | ||
|
|
| else: | ||
| silver = s.createDataFrame(s.sparkContext.emptyRDD(), schema=schema) | ||
|
|
||
| dq_context = DQContext( |
| country_code = filename_components.country_code | ||
| metadata = adls_file_client.fetch_metadata_for_blob(adls_filepath) or {} | ||
|
|
||
| # Only process files with dq_mode="master" for the full merge pipeline. |
There was a problem hiding this comment.
This is not relevant because the approval workflow from Giga Sync only sends data that can be merged. This only applies before data has been sent for approval
| size = properties.size | ||
|
|
||
| run_key = str(path) | ||
| try: |
There was a problem hiding this comment.
It's not clear what this is doing
|
|
||
| renamed_bronze = casted_bronze.withColumnRenamed("signature", "dq_signature") | ||
|
|
||
| dq_context = DQContext( |
There was a problem hiding this comment.
All the DQContext changes do not seem necessary in this P
| elif dataset_type == "geolocation": | ||
| if mode == UploadMode.CREATE.value: | ||
|
|
||
| elif dq_context.dataset_type == "geolocation": |
There was a problem hiding this comment.
We need this:
Here, we need to only run this if the DQmode is "master"
Or set the resulting is_not_create and is_not_update` columns to 0 so that they don't cause critical errors when DQMode is not master
| schema = StructType(columns) | ||
|
|
||
| if check_table_exists(s, schema_name, country_code, DataTier.SILVER): | ||
| if dq_mode == DQMode.MASTER and check_table_exists( |
There was a problem hiding this comment.
We need this:
This requires no change
brianmusisi
left a comment
There was a problem hiding this comment.
Please make the requested changes
brianmusisi
left a comment
There was a problem hiding this comment.
One thing to add, is that when running on just the uploaded file, geolocation_staging shouldn't be run. We don't want to add data to the staging tables for approval
@brianmusisi changes done. |
da88992 to
f58f55b
Compare
| # For assessment-only (uploaded mode), skip cross-checks against silver | ||
| # but ensure the columns exist to avoid downstream errors. | ||
| if dq_context.upload_mode == UploadMode.CREATE.value: | ||
| df = df.withColumn("dq_is_not_create", f.lit(0)) | ||
| elif dq_context.upload_mode == UploadMode.UPDATE.value: | ||
| df = df.withColumn("dq_is_not_update", f.lit(0)) | ||
|
|
There was a problem hiding this comment.
We actually don't want to do this, so we can skip it
|
|
||
| def row_level_checks_internal( | ||
| df: sql.DataFrame, | ||
| dq_context: DQContext, |
There was a problem hiding this comment.
Because we've added dq_context, it will break the uses of this function in row_level_checks for the master and reference checks
| context.log.info(f"FILE: {path}") | ||
| yield RunRequest( | ||
| run_key=str(path), | ||
| run_key=f"{path}:{last_modified}:{dq_triggered_at}", |
There was a problem hiding this comment.
Changing the run key will make all previous uploaded files run again since they will look new. I have seen this happen on dev already. Whatever we do, we should not change this
brianmusisi
left a comment
There was a problem hiding this comment.
Take a look at the comments I've added
What type of PR is this?
build: Commits that affect build components like build tool, dependencies, projectversion
chore: Miscellaneous commits (e.g. modifying.gitignore)ci: Commits are specialbuildcommits that affect the CI/CD pipelinedocs: Commits that affect documentation onlyfeat: Commits that add a new featurefix: Commits that fix a bugperf: Commits are specialrefactorcommits that improve performancerefactor: Commits that rewrite/restructure your code, however does not change anybehaviour
revert: Commits that revert another commit/PR, usually can be autogenerated onGitHub or using
git revertstyle: Commits are specialrefactorcommits that edit the code to comply with acode style, linter, or formatter
test: Commits that add missing tests or correcting existing testsSummary
What does this PR do
How to test
Link to Jira/Asana/Airtable task (if applicable)
placeholder
Wireframe screenshot/screencap (if applicable)
placeholder
Implementation screenshot/screencap (if applicable)
placeholder