Account for multiple filters matching on the same token#105
Account for multiple filters matching on the same token#105stevepolitodesign wants to merge 3 commits into
Conversation
| let(:austin_location) { build_entity(text: "Austin", tag: :location) } | ||
|
|
||
| before do | ||
| stub_ner_entities(austin_person, austin_location) |
There was a problem hiding this comment.
I should check to see what happens if we swap the order:
| stub_ner_entities(austin_person, austin_location) | |
| stub_ner_entities(austin_location, austin_person) |
There was a problem hiding this comment.
What does _ner_ mean in the method name?
There was a problem hiding this comment.
Named entity recognition. I learned about it from using MITIE Ruby.
There was a problem hiding this comment.
Got it! You've probably explained it before, which makes me think, could this method use the full name instead of the acronym so it's more accessible to the reader?
There was a problem hiding this comment.
In this case I wanted to be consistent with how MITIE Ruby names things, since it has a Mitie::NER class.
| result = TopSecret::Text.filter( | ||
| "Primary 192.168.1.1, backup 192.168.1.1.", | ||
| custom_filters: [ip_filter, server_filter] | ||
| ) |
There was a problem hiding this comment.
Same here.
| result = TopSecret::Text.filter( | |
| "Primary 192.168.1.1, backup 192.168.1.1.", | |
| custom_filters: [ip_filter, server_filter] | |
| ) | |
| result = TopSecret::Text.filter( | |
| "Primary 192.168.1.1, backup 192.168.1.1.", | |
| custom_filters: [server_filter, ip_filter] | |
| ) |
There was a problem hiding this comment.
Pull request overview
This PR updates TopSecret::Text redaction so that when the same token is matched by multiple filters, each occurrence can be labeled according to the filter that matched it (rather than having the last-applied filter label overwrite all occurrences).
Changes:
- Added new specs covering “same value matched by multiple filters” for both NER and regex filters.
- Replaced the previous one-pass substitution logic with a new
TopSecret::Text::Substitutionhelper. - Documented the fix in the Unreleased changelog.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
spec/top_secret/text_spec.rb |
Adds regression tests for multi-filter / same-token matching cases. |
lib/top_secret/text/substitution.rb |
Introduces new substitution strategy for handling multiple labels per value. |
lib/top_secret/text.rb |
Routes substitution through the new Substitution class. |
CHANGELOG.md |
Notes the behavioral fix under Unreleased. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return labels.last(1) if occurrences.zero? | ||
|
|
||
| labels.last(occurrences) |
This edge-case was uncovered when running the `/review` command.
| ip_filter = TopSecret::Filters::Regex.new( | ||
| label: "IP_ADDRESS", | ||
| regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/ | ||
| ) | ||
| server_filter = TopSecret::Filters::Regex.new( | ||
| label: "SERVER", | ||
| regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/ | ||
| ) |
There was a problem hiding this comment.
To make it clearer that these are identical expressions, what do you think of this?
| ip_filter = TopSecret::Filters::Regex.new( | |
| label: "IP_ADDRESS", | |
| regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/ | |
| ) | |
| server_filter = TopSecret::Filters::Regex.new( | |
| label: "SERVER", | |
| regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/ | |
| ) | |
| regex = /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/ | |
| ip_filter = TopSecret::Filters::Regex.new( | |
| label: "IP_ADDRESS", | |
| regex: | |
| ) | |
| server_filter = TopSecret::Filters::Regex.new( | |
| label: "SERVER", | |
| regex: | |
| ) |
Relates to: #51
Prior to this commit, if multiple filters matched the same token, filter
precedence would account for the labeling:
This commit ensures each filter is labeled correctly.