A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
It's helpful for
- Precision and Recall: Word matching is a retrieval process, LOGICAL match improves precision while TEXT VARIATIONS match improves recall.
- Content Filtering: Detecting and filtering out offensive or sensitive words.
- Search Engines: Improving search results by identifying relevant keywords.
- Text Analysis: Extracting specific information from large volumes of text.
- Spam Detection: Identifying spam content in emails or messages.
- ···
Construction
┌─────────────────────────────────────────────────────────────┐
│ Rules ──▶ parse & dedup ──▶ transform trie ──▶ AC automata │
└─────────────────────────────────────────────────────────────┘
Query
┌─────────────────────────────────────────────────────────────┐
│ Text ──▶ walk trie ──▶ scan variants ──▶ evaluate ──▶ hits │
│ │ │ │
│ transform text AC automaton │
│ (reuse shared (bytewise or │
│ prefixes) charwise) │
└─────────────────────────────────────────────────────────────┘
All sub-patterns are deduplicated into a single Aho-Corasick automaton for O(N) text scanning. Text transformations share a prefix trie so VariantNorm|Delete reuses the VariantNorm result. For simple literal matchers without transforms, is_match delegates directly to the AC automaton — skipping TLS state setup entirely.
For the full narrative walkthrough, see the Design Document.
- Text Transformation:
- VariantNorm: CJK variant normalization.
Example:
測試->测试 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
ABⅣ①℉->ab41°f - Romanize: Convert CJK characters to space-separated romanized form (Pinyin, Romaji, RR) for fuzzy matching.
Example:
西安->xi an, matches洗按->xi an, but not先->xian - RomanizeChar: Convert CJK characters to romanized form without boundary spaces.
Example:
西安->xian, matches洗按and先->xian - EmojiNorm: Convert emoji to English words (CLDR short names) and strip modifiers.
Example:
👍🏽->thumbs_up,🔥->fire
- VariantNorm: CJK variant normalization.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
&(AND):hello&worldmatcheshello worldandworld,hello|(OR):color|colourmatchescolorandcolour~(NOT):hello~helloo~hhellomatcheshellobut nothellooandhhello\b(word boundary):\bcat\bmatches "the cat" but not "concatenate"- Repeated segments:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Combined:
color|colour&bright~darkmatches "bright color" but not "dark colour"
- Efficient Handling of Large Word Lists: Optimized for performance.
Rust
# Cargo.toml
[dependencies]
matcher_rs = "0.15"use matcher_rs::{ProcessType, SimpleMatcherBuilder};
let matcher = SimpleMatcherBuilder::new()
.add_word(ProcessType::None, 1, "hello&world") // AND: both must appear
.add_word(ProcessType::None, 2, "color|colour") // OR: either spelling
.add_word(ProcessType::None, 3, r"\bcat\b") // word boundary
.build()
.unwrap();
assert!(matcher.is_match("hello, world!"));
assert!(matcher.is_match("nice colour"));
assert!(!matcher.is_match("concatenate")); // "cat" not a whole wordSee the Rust README for full docs.
Python
pip install matcher_pyfrom matcher_py import ProcessType, SimpleMatcherBuilder
builder = SimpleMatcherBuilder()
builder.add_word(ProcessType.NONE, 1, "hello&world") # AND: both must appear
builder.add_word(ProcessType.NONE, 2, "color|colour") # OR: either spelling
builder.add_word(ProcessType.NONE, 3, r"\bcat\b") # word boundary
matcher = builder.build()
assert matcher.is_match("hello, world!")
assert matcher.is_match("nice colour")
assert not matcher.is_match("concatenate") # "cat" not a whole wordSee the Python README for full docs.
Java
import com.matcherjava.SimpleMatcher;
import com.matcherjava.SimpleMatcherBuilder;
import com.matcherjava.extensiontypes.ProcessType;
try (SimpleMatcher matcher = new SimpleMatcherBuilder()
.add(ProcessType.NONE, 1, "hello&world") // AND: both must appear
.add(ProcessType.NONE, 2, "color|colour") // OR: either spelling
.add(ProcessType.NONE, 3, "\\bcat\\b") // word boundary
.build()) {
assert matcher.isMatch("hello, world!");
assert matcher.isMatch("nice colour");
assert !matcher.isMatch("concatenate"); // "cat" not a whole word
}See the Java README for full docs.
C
#include "matcher_c.h"
void* builder = init_simple_matcher_builder();
simple_matcher_builder_add_word(builder, PROCESS_TYPE_NONE, 1, "hello&world"); // AND
simple_matcher_builder_add_word(builder, PROCESS_TYPE_NONE, 2, "color|colour"); // OR
simple_matcher_builder_add_word(builder, PROCESS_TYPE_NONE, 3, "\\bcat\\b"); // word boundary
void* matcher = simple_matcher_builder_build(builder);
simple_matcher_is_match(matcher, "hello, world!"); // true — AND
simple_matcher_is_match(matcher, "nice colour"); // true — OR
simple_matcher_is_match(matcher, "concatenate"); // false — word boundary
drop_simple_matcher(matcher);See the C README for full docs.
git clone https://github.com/Lips7/Matcher.git
cd Matcher
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly -y
just buildThis builds all packages and copies the dynamic libraries to the right locations. You can also run cargo build --release directly — the C and Java libraries will be in target/release/:
libmatcher_c.so/libmatcher_c.dylib/matcher_c.dlllibmatcher_java.so/libmatcher_java.dylib/matcher_java.dll
EmojiNorm+Deletedon't compose:Deletestrips emoji codepoints beforeEmojiNormcan convert them to words. UseEmojiNorm | Normalizeinstead.RomanizevsRomanizeChar:Romanizeadds boundary spaces (西安→xi an) so homophones like洗按match but先doesn't.RomanizeCharomits spaces (xian) for fuzzier matching.Noneis standalone-only:Nonemeans "match raw input, no transforms." Combining it with any transform is redundant and theNonebit is silently stripped.Deletealready scans both the original and delete-transformed text (since patterns are stored verbatim and may contain deletable characters).- Repeated AND segments count repetitions:
无&法&无&天requires无to appear at least twice in the text. \bis per-sub-pattern, not per-rule:\bcat\b&dogrequires "cat" as a whole word but "dog" as a substring.
Please refer to benchmarks for details.