-
-
Notifications
You must be signed in to change notification settings - Fork 96
fix(scripts): return None from get_docx_document on missing template #2546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 28 commits
edc7e2e
34d3910
6a8ef05
879d567
ecb4357
299d536
9595f2c
45d2071
4aab375
28b8ae8
53fe8d8
016b6bf
508bd15
1ea5e89
b7f8722
a6135eb
2728bc0
d392d63
79cba48
37a4f4c
68a9f58
9080fc0
df47860
9529b67
ea5e479
58e5326
69d67b5
794d15c
95c5789
b3c417d
165d873
4ea8948
27c20c4
3d4adf9
3c16538
a021076
715322e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert this as well. |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert. |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please revert this. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| ## Describe the bug | ||
|
|
||
| In `scripts/capec_map_enricher.py`, the `extract_capec_names()` function correctly guards against missing top-level keys (`Attack_Pattern_Catalog`, `Attack_Patterns`, `Attack_Pattern`) but then directly accesses `catalog["Categories"]["Category"]` without any guard check. If the CAPEC JSON data has no `"Categories"` key, the script crashes immediately with an unhandled `KeyError`. | ||
|
|
||
| ## Affected file | ||
|
|
||
| `scripts/capec_map_enricher.py` — `extract_capec_names()` function | ||
|
|
||
| ## Code snippet (problematic section) | ||
|
|
||
| ```python | ||
| # After all the earlier guards for Attack_Pattern... | ||
| # Lines ~62-67: NO guard for "Categories" key | ||
| categories = catalog["Categories"]["Category"] # KeyError if "Categories" is absent | ||
| for category in categories: | ||
| if "_ID" in category and "_Name" in category: | ||
| capec_id = int(category["_ID"]) | ||
| capec_name = category["_Name"] | ||
| capec_names[capec_id] = capec_name | ||
| ``` | ||
|
|
||
| ## Expected behavior | ||
|
|
||
| The function should check for the existence of `"Categories"` and `"Category"` before accessing them, consistent with the defensive guards already applied to `Attack_Pattern_Catalog`, `Attack_Patterns`, and `Attack_Pattern` earlier in the same function. A warning should be logged if absent rather than raising an unhandled exception. | ||
|
|
||
| ## Proposed fix | ||
|
|
||
| ```python | ||
| if "Categories" not in catalog: | ||
| logging.warning("No 'Categories' key found in catalog") | ||
| else: | ||
| categories_section = catalog["Categories"] | ||
| if "Category" not in categories_section: | ||
| logging.warning("No 'Category' key found in categories section") | ||
| elif not isinstance(categories_section["Category"], list): | ||
| logging.warning("'Category' is not a list") | ||
| else: | ||
| for category in categories_section["Category"]: | ||
| if "_ID" in category and "_Name" in category: | ||
| capec_id = int(category["_ID"]) | ||
| capec_name = category["_Name"] | ||
| capec_names[capec_id] = capec_name | ||
| ``` | ||
|
|
||
| ## Steps to reproduce | ||
|
|
||
| 1. Provide a `3000.json` where `Attack_Pattern_Catalog` has no `"Categories"` key | ||
| 2. Run: `python scripts/capec_map_enricher.py` | ||
| 3. Observe unhandled `KeyError: 'Categories'` | ||
|
|
||
| ## Additional context | ||
|
|
||
| This is inconsistent with the existing defensive coding style used for all preceding key accesses in the same function. The fix should follow the same pattern already established. |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| ## Describe the bug | ||
|
|
||
| In `scripts/check_translations.py`, the `get_file_groups()` method splits filenames by `-` and takes `parts[-1]` as the language code, then only includes files where `len(lang) == 2`. This logic breaks entirely for multi-component locale codes like `no-nb`, `pt-pt`, and `pt-br` — which are all listed as supported languages in `LANGUAGE_CHOICES` in `convert.py`. | ||
|
|
||
| ## Affected file | ||
|
|
||
| `scripts/check_translations.py` — `get_file_groups()` method | ||
|
|
||
| ## Code snippet (problematic section) | ||
|
|
||
| ```python | ||
| def get_file_groups(self) -> Dict[str, List[Path]]: | ||
| file_groups = defaultdict(list) | ||
| for yaml_file in self.source_dir.glob("*-*.yaml"): | ||
| parts = yaml_file.stem.split("-") | ||
| if len(parts) >= 3: | ||
| lang = parts[-1] # <-- for 'webapp-cards-2.2-no-nb.yaml', lang = 'nb' (WRONG) | ||
| base_name = "-".join(parts[:-1]) # base_name = 'webapp-cards-2.2-no' (WRONG) | ||
| if "cards" in base_name and len(lang) == 2: # 'nb' passes len check, but base_name is corrupt | ||
| file_groups[base_name].append(yaml_file) | ||
| return file_groups | ||
| ``` | ||
|
|
||
| ## What happens for `webapp-cards-2.2-no-nb.yaml` | ||
|
|
||
| - `parts` = `['webapp', 'cards', '2.2', 'no', 'nb']` | ||
| - `lang` = `'nb'` (should be `'no-nb'`) | ||
| - `base_name` = `'webapp-cards-2.2-no'` (should be `'webapp-cards-2.2'`) | ||
| - This file gets placed into its own broken group `'webapp-cards-2.2-no'`, never compared against the English reference `'webapp-cards-2.2'` | ||
|
|
||
| The same problem affects `pt-pt` (lang=`'pt'`, base misidentified) and `pt-br` (lang=`'br'`). | ||
|
|
||
| ## Expected behavior | ||
|
|
||
| Files with multi-component locale codes (`no-nb`, `pt-pt`, `pt-br`) should be correctly grouped with their English base counterpart. The language code should be extracted as the full locale string (e.g., `no-nb`), not just the last hyphen-separated segment. | ||
|
|
||
| ## Proposed fix | ||
|
|
||
| Detect multi-component locale codes explicitly before splitting, for example by checking a known set of multi-part locales: | ||
|
|
||
| ```python | ||
| MULTI_PART_LOCALES = {"no-nb", "pt-pt", "pt-br"} | ||
|
|
||
| def get_file_groups(self) -> Dict[str, List[Path]]: | ||
| file_groups = defaultdict(list) | ||
| for yaml_file in self.source_dir.glob("*-*.yaml"): | ||
| if "archive" in str(yaml_file): | ||
| continue | ||
| stem = yaml_file.stem | ||
| lang = None | ||
| base_name = None | ||
| for locale in MULTI_PART_LOCALES: | ||
| if stem.endswith("-" + locale): | ||
| lang = locale | ||
| base_name = stem[: -(len(locale) + 1)] | ||
| break | ||
| if lang is None: | ||
| parts = stem.split("-") | ||
| if len(parts) >= 3: | ||
| lang = parts[-1] | ||
| base_name = "-".join(parts[:-1]) | ||
| if base_name and lang and "cards" in base_name and (len(lang) == 2 or lang in MULTI_PART_LOCALES): | ||
| file_groups[base_name].append(yaml_file) | ||
| return file_groups | ||
| ``` | ||
|
|
||
| ## Additional context | ||
|
|
||
| `ConvertVars.LANGUAGE_CHOICES` in `convert.py` explicitly includes `"no-nb"`, `"pt-pt"`, and `"pt-br"`. The translation checker therefore silently skips validation for these locales even though they are officially supported. |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| ## Describe the bug | ||
|
|
||
| In `scripts/convert.py`, the `get_docx_document()` function is supposed to open and return a `.docx` template file. When the file does not exist, it logs an error but **returns a blank `docx.Document()` instead of `None`**. Since a blank `docx.Document()` is always truthy in Python, the caller's `if doc:` check always passes — causing the script to silently replace text in and save an empty document to the output path, producing a corrupted zero-content output file with no further error. | ||
|
|
||
| ## Affected file | ||
|
|
||
| `scripts/convert.py` — `get_docx_document()` and its call site in `create_edition_from_template()` | ||
|
|
||
| ## Code snippet | ||
|
|
||
| ```python | ||
| def get_docx_document(docx_file: str) -> Any: | ||
| import docx | ||
| if os.path.isfile(docx_file): | ||
| return docx.Document(docx_file) | ||
| else: | ||
| logging.error("Could not find file at: %s", str(docx_file)) | ||
| return docx.Document() # <-- blank Document, always truthy! | ||
|
|
||
| # Call site: | ||
| doc = get_docx_document(template_doc) | ||
| if doc: # always True — even for a blank Document returned on error | ||
| doc = replace_docx_inline_text(doc, language_dict) | ||
| doc.save(output_file) # saves an empty file with no content | ||
| ``` | ||
|
|
||
| ## Expected behavior | ||
|
|
||
| When the template file does not exist: | ||
| - `get_docx_document()` should return `None` | ||
| - The caller should detect `None` and skip the output file creation, or raise a clear error | ||
|
|
||
| ## Proposed fix | ||
|
|
||
| ```python | ||
| def get_docx_document(docx_file: str) -> Any: | ||
| import docx | ||
| if os.path.isfile(docx_file): | ||
| return docx.Document(docx_file) | ||
| else: | ||
| logging.error("Could not find file at: %s", str(docx_file)) | ||
| return None # caller already checks `if doc:` | ||
| ``` | ||
|
|
||
| ## Steps to reproduce | ||
|
|
||
| 1. Specify a non-existent or missing template `.docx` file path | ||
| 2. Run `python scripts/convert.py -lt cards -l en -v 2.2` | ||
| 3. No error is raised; an empty `.docx` file is written to the output folder | ||
|
|
||
| ## Additional context | ||
|
|
||
| The docstring does not indicate that a blank fallback document will be returned on failure, making this behavior surprising and hard to diagnose. Returning `None` is cleaner, consistent with the existing `if doc:` guard in the caller, and avoids silently writing corrupt output files. |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. okay, giveme a min |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| ## Describe the bug | ||
|
|
||
| In `scripts/convert_asvs.py`, the `create_level_summary()` function opens a file using `f = open(...)` without a `with` context manager. If any exception is raised inside the loop body (e.g., a missing dictionary key during iteration), the file handle `f` is never explicitly closed, leaking the OS file descriptor. On batch runs that generate many ASVS taxonomy pages, this can exhaust system file handle limits. | ||
|
|
||
| ## Affected file | ||
|
|
||
| `scripts/convert_asvs.py` — `create_level_summary()` function (and similar patterns elsewhere in the file) | ||
|
|
||
| ## Code snippet (problematic section) | ||
|
|
||
| ```python | ||
| def create_level_summary(level: int, arr: List[dict[str, Any]]) -> None: | ||
| topic = "" | ||
| category = "" | ||
| os.mkdir(Path(convert_vars.args.output_path, f"level-{level}-controls")) | ||
| f = open(Path(convert_vars.args.output_path, f"level-{level}-controls/index.md"), "w", encoding="utf-8") | ||
| # ^ File opened without 'with' — if any exception occurs below, f is never closed | ||
| f.write(f"# Level {level} controls\n\n") | ||
| f.write(f"Level {level} contains {len(arr)} controls listed below: \n\n") | ||
| for link in arr: | ||
| if link["topic"] != topic: # KeyError here would leak file handle | ||
| topic = link["topic"] | ||
| f.write(f"## {topic}\n\n") | ||
| ... | ||
| f.close() # only reached if no exception — not guaranteed | ||
| ``` | ||
|
|
||
| ## Expected behavior | ||
|
|
||
| All file handles opened by the script should use `with open(...) as f:` context manager so that the file is guaranteed to be closed even if an exception is raised within the block. | ||
|
|
||
| ## Proposed fix | ||
|
|
||
| ```python | ||
| def create_level_summary(level: int, arr: List[dict[str, Any]]) -> None: | ||
| topic = "" | ||
| category = "" | ||
| os.mkdir(Path(convert_vars.args.output_path, f"level-{level}-controls")) | ||
| with open(Path(convert_vars.args.output_path, f"level-{level}-controls/index.md"), "w", encoding="utf-8") as f: | ||
| f.write(f"# Level {level} controls\n\n") | ||
| f.write(f"Level {level} contains {len(arr)} controls listed below: \n\n") | ||
| for link in arr: | ||
| if link["topic"] != topic: | ||
| topic = link["topic"] | ||
| f.write(f"## {topic}\n\n") | ||
| ... | ||
| ``` | ||
|
|
||
| The same pattern should be applied to other `open()` calls in `scripts/convert_asvs.py` and `scripts/convert_capec.py` that do not use `with` statements. | ||
|
|
||
| ## Additional context | ||
|
|
||
| Python's `with` statement is the standard and recommended way to handle file I/O. Using bare `open()`/`f.close()` pairs is error-prone; if any code path between `open()` and `close()` raises an exception, the file is never properly flushed and closed. This is a well-known resource leak anti-pattern (CWE-775: Missing Release of File Descriptor or Handle after Effective Lifetime). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,4 @@ | ||
| [mypy] | ||
|
|
||
| [mypy-docx] | ||
| ignore_missing_imports = True |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,7 +11,7 @@ | |
| import zipfile | ||
| import xml.etree.ElementTree as ElTree | ||
| from defusedxml import ElementTree as DefusedElTree | ||
| from typing import Any, Dict, List, Tuple, cast | ||
| from typing import Any, Dict, List, Optional, Tuple, cast | ||
| from operator import itemgetter | ||
| from itertools import groupby | ||
| from pathlib import Path | ||
|
|
@@ -381,9 +381,12 @@ def create_edition_from_template( | |
| if file_extension == ".docx": | ||
| # Get the input (template) document | ||
| doc = get_docx_document(template_doc) | ||
| if doc: | ||
| if doc is not None: | ||
| doc = replace_docx_inline_text(doc, language_dict) | ||
| doc.save(output_file) | ||
| else: | ||
| logging.error("Cannot create output file: template document not found at %s", template_doc) | ||
| return | ||
|
Comment on lines
+384
to
+389
|
||
| else: | ||
| save_odt_file(template_doc, language_dict, output_file) | ||
|
|
||
|
|
@@ -648,15 +651,14 @@ def get_document_paragraphs(doc: Any) -> List[Any]: | |
| return paragraphs | ||
|
|
||
|
|
||
| def get_docx_document(docx_file: str) -> Any: | ||
| """Open the file and return the docx document.""" | ||
| import docx # type: ignore | ||
| def get_docx_document(docx_file: str) -> Optional[Any]: | ||
| """Open the file and return the docx document, or None if the file is not found.""" | ||
| import docx | ||
|
|
||
| if os.path.isfile(docx_file): | ||
| return docx.Document(docx_file) | ||
| else: | ||
| logging.error("Could not find file at: %s", str(docx_file)) | ||
| # Create a blank document if it fails | ||
| return docx.Document() | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this.