Skip to content

ignore: scope compiled parent matchers by root#3420

Merged
BurntSushi merged 2 commits into
BurntSushi:masterfrom
jelle-openai:jelle/codex/ignore-multiroot-determinism
Jun 5, 2026
Merged

ignore: scope compiled parent matchers by root#3420
BurntSushi merged 2 commits into
BurntSushi:masterfrom
jelle-openai:jelle/codex/ignore-multiroot-determinism

Conversation

@jelle-openai
Copy link
Copy Markdown

Fixes #3419.

Comment thread crates/ignore/src/dir.rs Outdated
};
let mut compiled = self.0.compiled.write().unwrap();
if let Some(weak) = compiled.get(parent.as_os_str()) {
if let Some(weak) = compiled.get(&key) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with this code base, but this makes me wonder if it's still worth caching here.

for roots: /tmp/project/src, /tmp/project/tests:

  • Before: The entry for /tmp/project was shared
  • After: Each root uses its own pre-built Ignore, making the caching here useless except for the case where someone calls add_parents twice with the exact same paths

The caching here also served two purposes:

  • Avoid parsing the same ignore files multiple times
  • Deduplicate errors (because we only built the ignore once)

Now, the main issue why sharing the Ignore across base paths is incorrect is because we set ignore.absolute_base to Some(absolute_base).

I haven't traced through all the changes necessary but I think a better solution is to decouple absolute_path from what we store in the cache, e.g. by:

  • Moving absolute_path from IgnoreInner to the Ignore struct (makes cloning more expensive)
  • Use a different struct for caching

Copy link
Copy Markdown
Owner

@BurntSushi BurntSushi May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think @MichaReiser is right here. In particular, I think this solution will fall apart when there are a lot of roots. For example, rg foo bar/** where bar contains many paths.

I'm on mobile so I don't have a good idea right now of what the simplest change is yet. But Micha's ideas at a very high level seem plausible.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I pushed a followup aligning with Micha's first suggestion. This largely restores the performance of the pre-existing code while still fixing the bug.

Comment thread crates/ignore/src/dir.rs Outdated
Arc<IgnoreInner>,
// Parent matchers are cached independently of the root they are used
// from, but matching them still needs the current root to rewrite paths.
Option<Arc<PathBuf>>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using a named field is too disruptive, can we document what root means in this context.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to named fields, wasn't too bad

Comment thread crates/ignore/src/dir.rs Outdated
}
}

/// An iterator over all parents of an ignore matcher, including itself.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment here is now misplaced

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, moved

@BurntSushi
Copy link
Copy Markdown
Owner

I got this benchmark from Codex that I think I believe. It finds a regression on the initial commit in this PR, but not on master or on the last commit in this PR. Do:

$ mkdir /dev/shm/ripgrep-pr3420
$ cd /dev/shm/ripgrep-pr3420
$ touch create
$ chmod +x create
$ $EDITOR create
$ cat create
#!/usr/bin/env bash

set -euo pipefail

D=./

mkdir -p "$D/.git"

for i in $(seq 1 2000); do
  printf 'ignored_%04d/**\n' "$i"
done > "$D/.gitignore"

for r in $(seq -w 0 49); do
  mkdir -p "$D/root$r/src"
  for f in $(seq -w 0 9); do
    printf 'needle-free content %s %s\n' "$r" "$f" \
      > "$D/root$r/src/file$f.txt"
  done
done
$ ./create
$ tree -a | wc -l
606
$ $ hyperfine --warmup 3 --runs 10 --ignore-failure '/path/to/master/target/release/rg -n NOMATCH root*' '/path/to/commit-13134/target/release/rg -n NOMATCH root*' '/path/to/commit-3e32/target/release/rg -n NOMATCH root*'

[.. snip ..]

Summary
  /tmp/rg-master-src/target/release/rg -n NOMATCH root* ran
    1.00 ± 0.26 times faster than /tmp/rg-3e32-src/target/release/rg -n NOMATCH root*
   25.51 ± 5.77 times faster than /tmp/rg-13134-src/target/release/rg -n NOMATCH root*

I find this quite promising!

@MichaReiser MichaReiser force-pushed the jelle/codex/ignore-multiroot-determinism branch 2 times, most recently from ca8f3c2 to 9818a2c Compare June 4, 2026 16:23
@MichaReiser MichaReiser force-pushed the jelle/codex/ignore-multiroot-determinism branch from 9818a2c to aff90bc Compare June 4, 2026 16:25
Parent matchers are cached by directory, but they also stored the
canonicalized path passed to `Ignore::add_parents`. Reusing cached
matchers while walking another root could therefore rewrite paths
relative to the wrong root and apply scoped parent gitignore rules
incorrectly.

Keep the cached matcher chain independent of the walk root. Carry the
absolute base path in `Ignore` instead, and add a regression test that
checks cached parent matchers can be reused across roots without sharing
their path semantics.

Fixes BurntSushi#3419
@MichaReiser MichaReiser force-pushed the jelle/codex/ignore-multiroot-determinism branch from aff90bc to b6722c1 Compare June 4, 2026 16:38
Comment thread crates/ignore/src/dir.rs

#[derive(Clone, Copy)]
pub(crate) struct IgnoreRef<'a> {
inner: &'a IgnoreInner,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a named field here to reduce churn in the diff

@MichaReiser
Copy link
Copy Markdown
Collaborator

I reworded the commits. I also had codex write a few benchmarks and verified that none of them regress (beyond noise). We could consider adding those benchmarks to the repo, but I won't do so as part of this PR.

@BurntSushi I think this is good to go

Benchmarks
/*!
This module benchmarks ignore file matching.
*/
#![feature(test)]

extern crate test;

use std::{
    fmt::Write as _,
    fs,
    path::{Path, PathBuf},
    sync::atomic::{AtomicUsize, Ordering},
    time::{SystemTime, UNIX_EPOCH},
};

use ignore::{WalkBuilder, WalkState};

const GITIGNORE_PATTERNS: usize = 100;
const DEEP_DIRS: usize = 64;
const LAYERED_DIRS: usize = 32;
const LAYERED_PARENTS: usize = 8;
const MANY_ROOTS: usize = 64;
const PARTIALLY_SHARED_GROUPS: usize = 8;
const PARTIALLY_SHARED_ROOTS_PER_GROUP: usize = 8;
const WIDE_DIRS: usize = 128;

struct TempDir(PathBuf);

impl TempDir {
    fn new() -> TempDir {
        let nanos =
            SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_nanos();
        let path = std::env::temp_dir()
            .join(format!("ignore-bench-{}-{nanos}", std::process::id(),));
        fs::create_dir(&path).unwrap();
        TempDir(path)
    }

    fn path(&self) -> &Path {
        &self.0
    }
}

impl Drop for TempDir {
    fn drop(&mut self) {
        fs::remove_dir_all(&self.0).unwrap();
    }
}

struct ParentGitignore {
    _temp_dir: TempDir,
    builder: WalkBuilder,
    roots: usize,
}

impl ParentGitignore {
    fn new(roots: usize) -> ParentGitignore {
        let gitignore = gitignore_patterns("ignored");
        ParentGitignore::with_paths(&gitignore, |project| {
            (0..roots).map(|i| project.join(format!("root-{i:03}"))).collect()
        })
    }

    fn empty(roots: usize) -> ParentGitignore {
        ParentGitignore::with_paths("", |project| {
            (0..roots).map(|i| project.join(format!("root-{i:03}"))).collect()
        })
    }

    fn partially_shared() -> ParentGitignore {
        let gitignore = gitignore_patterns("ignored");
        ParentGitignore::with_paths(&gitignore, |project| {
            (0..PARTIALLY_SHARED_GROUPS)
                .flat_map(|group| {
                    (0..PARTIALLY_SHARED_ROOTS_PER_GROUP).map(move |root| {
                        project
                            .join(format!("group-{group:03}"))
                            .join(format!("root-{root:03}"))
                    })
                })
                .collect()
        })
    }

    fn deep_kept() -> ParentGitignore {
        ParentGitignore::with_paths("ignored/**\n", |project| {
            let root = project.join("root");
            let mut current = root.clone();
            for i in 0..DEEP_DIRS {
                current = current.join(format!("dir-{i:03}"));
                fs::create_dir_all(&current).unwrap();
            }
            fs::write(current.join("file"), "").unwrap();
            vec![root]
        })
    }

    fn layered_parent_gitignores_misses() -> ParentGitignore {
        let gitignore = gitignore_patterns("ignored-project");
        ParentGitignore::with_paths(&gitignore, |project| {
            let mut parent = project.to_path_buf();
            for layer in 0..LAYERED_PARENTS {
                parent = parent.join(format!("parent-{layer:03}"));
                fs::create_dir_all(&parent).unwrap();
                fs::write(
                    parent.join(".gitignore"),
                    gitignore_patterns(&format!("ignored-{layer:03}")),
                )
                .unwrap();
            }
            let root = parent.join("root");
            for i in 0..LAYERED_DIRS {
                let dir = root.join(format!("dir-{i:03}"));
                fs::create_dir_all(&dir).unwrap();
                fs::write(dir.join("file"), "").unwrap();
            }
            vec![root]
        })
    }

    fn wide_kept() -> ParentGitignore {
        ParentGitignore::with_paths("ignored/**\n", |project| {
            let root = project.join("root");
            for i in 0..WIDE_DIRS {
                let dir = root.join(format!("dir-{i:03}"));
                fs::create_dir_all(&dir).unwrap();
                fs::write(dir.join("file"), "").unwrap();
            }
            vec![root]
        })
    }

    fn with_parent_matches(roots: usize) -> ParentGitignore {
        ParentGitignore::with_paths("root-*/ignored/\n", |project| {
            let paths = (0..roots)
                .map(|i| project.join(format!("root-{i:03}")))
                .collect::<Vec<_>>();
            for path in &paths {
                fs::create_dir_all(path.join("ignored")).unwrap();
                fs::create_dir_all(path.join("kept")).unwrap();
                fs::write(path.join("ignored/file"), "").unwrap();
                fs::write(path.join("kept/file"), "").unwrap();
            }
            paths
        })
    }

    fn with_paths(
        gitignore: &str,
        paths: impl FnOnce(&Path) -> Vec<PathBuf>,
    ) -> ParentGitignore {
        let temp_dir = TempDir::new();
        let project = temp_dir.path().join("project");
        fs::create_dir_all(project.join(".git")).unwrap();
        fs::write(project.join(".gitignore"), gitignore).unwrap();

        let paths = paths(&project);
        for path in &paths {
            fs::create_dir_all(path).unwrap();
        }

        let mut builder = WalkBuilder::new(&paths[0]);
        for path in &paths[1..] {
            builder.add(path);
        }
        builder.max_depth(Some(0)).git_global(false).git_exclude(false);
        ParentGitignore { _temp_dir: temp_dir, builder, roots: paths.len() }
    }

    fn walk(&self) {
        assert_eq!(self.roots, self.builder.build().count());
    }

    fn walk_parallel(&self, expected_entries: usize) {
        let entries = AtomicUsize::new(0);
        self.builder.build_parallel().run(|| {
            Box::new(|_| {
                entries.fetch_add(1, Ordering::Relaxed);
                WalkState::Continue
            })
        });
        assert_eq!(expected_entries, entries.load(Ordering::Relaxed));
    }

    fn walk_entries(&self, expected_entries: usize) {
        assert_eq!(expected_entries, self.builder.build().count());
    }
}

fn gitignore_patterns(prefix: &str) -> String {
    let mut gitignore = String::new();
    for i in 0..GITIGNORE_PATTERNS {
        writeln!(gitignore, "{prefix}/{i:03}/**").unwrap();
    }
    gitignore
}

#[bench]
fn add_parents_one_root(b: &mut test::Bencher) {
    let fixture = ParentGitignore::new(1);

Copy link
Copy Markdown
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the fix! Rather elegant. Thank you @jelle-openai and @MichaReiser for getting this over the finish line. :-)

@BurntSushi BurntSushi merged commit 43e2f08 into BurntSushi:master Jun 5, 2026
22 checks passed
@BurntSushi
Copy link
Copy Markdown
Owner

This PR is on crates.io in ignore 0.4.26.

@MichaReiser
Copy link
Copy Markdown
Collaborator

This PR is on crates.io in ignore 0.4.26.

Wow, this wask quick. Thank you

@jelle-openai jelle-openai deleted the jelle/codex/ignore-multiroot-determinism branch June 5, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nondeterminism in ignore::WalkBuilder parallel multi-root walk

3 participants