Skip to content

feat[Go]: implement POST /api/v1/files/link-to-datasets#15674

Open
hunnyboy1217 wants to merge 2 commits into
infiniflow:mainfrom
hunnyboy1217:feat/go-file-link-to-datasets
Open

feat[Go]: implement POST /api/v1/files/link-to-datasets#15674
hunnyboy1217 wants to merge 2 commits into
infiniflow:mainfrom
hunnyboy1217:feat/go-file-link-to-datasets

Conversation

@hunnyboy1217
Copy link
Copy Markdown
Contributor

@hunnyboy1217 hunnyboy1217 commented Jun 4, 2026

What problem does this PR solve?

Closes #15673 — ports the Python file2document_api.py convert() endpoint to Go.

Method Path Handler
POST /api/v1/files/link-to-datasets FileHandler.LinkToDatasets

Type of change

  • New Feature (non-breaking change which adds functionality)

Implementation notes

Files changed:

internal/service/file2document.go  – new service (File2DocumentService)
internal/dao/file2document.go      – added Create method
internal/handler/file.go           – FileHandler gains file2DocumentService;
                                     LinkToDatasets HTTP handler
internal/router/router.go          – route registered

Functional parity table:

Concern Go behaviour
Required fields file_ids and kb_ids both required; missing either → CodeDataError mirroring Python @validate_request
File existence fileDAO.GetByIDs(fileIDs) builds a set; any missing ID → "File not found!"
KB existence kbDAO.GetByID(kbID) per KB; missing → "Can't find this dataset!"
Folder expansion getAllInnermostFileIDs recursively calls fileDAO.ListByParentID — mirrors FileService.get_all_innermost_file_ids
File permissions checkFileTeamPermission: file.TenantID == userID OR user in tenant's team — mirrors check_file_team_permission
KB permissions checkKBTeamPermission: kb.TenantID == userID OR user in tenant's team — mirrors check_kb_team_permission
Fire-and-forget go convertFiles(...) goroutine after all validation passes — mirrors loop.run_in_executor(None, _convert_files, …)
Conversion convertFiles: for each file → delete existing mappings + hard-delete old documents → create new Document in each target KB → create File2Document mapping — mirrors Python _convert_files
getParser Extension-based lookup with fallback to kb.ParserID — mirrors FileService.get_parser
Immediate return true returned to caller as soon as goroutine is scheduled

Ports Python file2document_api.py convert() endpoint to Go,
closing issue infiniflow#15673.

  POST /api/v1/files/link-to-datasets   (FileHandler.LinkToDatasets)

New files:
  internal/service/file2document.go  – File2DocumentService with:
    LinkToDatasets    – validates files/KBs, expands folders, checks
                        team permissions, schedules _convertFiles goroutine
    convertFiles      – deletes existing docs+mappings per file, then creates
                        fresh documents in each target KB and new mappings
    getAllInnermostFileIDs – recursive folder expansion (mirrors Python
                        FileService.get_all_innermost_file_ids)
    checkFileTeamPermission / checkKBTeamPermission – mirrors Python
                        check_file_team_permission / check_kb_team_permission
    getParser         – maps (file type, extension, kb.parser_id) → parser ID

Changed files:
  internal/dao/file2document.go  – added Create method
  internal/handler/file.go       – FileHandler gains file2DocumentService;
                                   LinkToDatasets HTTP handler added
  internal/router/router.go      – route registered in /api/v1/files group

Behaviour matches Python exactly:
  - file_ids and kb_ids are required; missing either → error
  - All files validated before any work begins
  - All KBs validated before any work begins
  - Folder IDs expanded to leaf-file IDs recursively
  - File and KB team permissions checked for every expanded file
  - Goroutine (fire-and-forget) for the blocking DB work matches Python's
    loop.run_in_executor pattern
  - Returns true immediately without waiting for conversion
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds File2DocumentService to link files/folders to knowledge bases, expand folders, enforce tenant/team permissions, and asynchronously convert and relink documents; exposes POST /api/v1/files/link-to-datasets, adds File2Document DAO create, and documents insertion/removal operations.

Changes

File-to-Dataset Linking Workflow

Layer / File(s) Summary
Service errors, type, constructor
internal/service/file2document.go
Exported sentinel errors; File2DocumentService type and NewFile2DocumentService() that wires required DAOs and DocumentService.
LinkToDatasets request flow and orchestration
internal/service/file2document.go
LinkToDatasets validates file and KB existence, expands folder inputs to leaf file IDs, deduplicates IDs, enforces per-file and per-KB tenant/team authorization, and launches async convertFiles.
convertFiles: remove old mappings, create documents, and remap
internal/service/file2document.go
For each file: load/remove existing File2Document mappings and documents (via DocumentService.RemoveDocumentKeepFile), delete mappings, create new entity.Document per target KB with parser/Kb config/location/pipeline, insert via DocumentService.InsertDocument, and create new File2Document mappings.
Helpers: folder expansion, auth, parser & dedupe
internal/service/file2document.go
getAllInnermostFileIDs for recursive folder traversal; checkFileTeamPermission/checkKBTeamPermission for tenant/team checks; getParser for extension→parser mapping with KB fallback; dedupeStrings.
Document service persistence ops
internal/service/document.go
RemoveDocumentKeepFile deletes document tasks/engine data and the document row while keeping file/storage/mappings; InsertDocument adds a document row and atomically increments KB doc_num in a transaction.
File2Document DAO Create Method
internal/dao/file2document.go
Adds Create method persisting *entity.File2Document via DB.Create(mapping).
HTTP handler and router
internal/handler/file.go, internal/router/router.go
FileHandler gains file2DocumentService field and initializes it; LinkToDatasets handler binds request, validates non-empty file_ids and kb_ids, delegates to service, maps sentinel errors to common.CodeDataError (others to common.CodeServerError); router registers POST /api/v1/files/link-to-datasets.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

💞 feature, ci, size:XL

Suggested reviewers

  • Haruko386
  • JinHai-CN

Poem

🐰 I hop through code with gentle paws,

Files to datasets, I mend the cause.
Folders expand, parsers align,
Documents reborn, knowledge to shine.
Hooray — new links, a tidy sign!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: implementing the POST endpoint for linking files to datasets, matching the primary objective of the pull request.
Description check ✅ Passed The description covers the problem statement, type of change, implementation notes with file structure, and a detailed functional parity table comparing Go to Python behavior.
Linked Issues check ✅ Passed The PR implements the POST /api/v1/files/link-to-datasets endpoint [#15673] with all required components: service, DAO, handler, router integration, folder expansion, permission checks, parser selection, and fire-and-forget conversion model matching Python behavior.
Out of Scope Changes check ✅ Passed All changes are focused on the file2document endpoint implementation across service, DAO, handler, and router layers with supporting document service methods; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/file2document.go`:
- Around line 145-223: The convertFiles flow deletes existing document mappings
and documents then recreates documents per-kb without transactional guarantees,
causing inconsistent state if any step fails; modify convertFiles to use a
per-file transactional approach (or an atomic two-phase approach): for each
fileID wrap operations in a DB transaction (or create new documents and
file2document mappings first using documentDAO.Create and
file2DocumentDAO.Create, verify all creates succeeded, then call
documentDAO.Delete and file2DocumentDAO.DeleteByFileID or replace old mappings),
and ensure error handling rolls back the transaction or aborts cleanup on
failure; locate and update the logic around file2DocumentDAO.DeleteByFileID,
documentDAO.Delete, documentDAO.Create, and file2DocumentDAO.Create to implement
the transactional/atomic reorder and proper rollback.
- Around line 94-106: The loop that expands folder IDs (using filesSet,
FileTypeFolder and s.getAllInnermostFileIDs) can produce duplicate leaf IDs when
requests mix folders and direct file IDs; change the collection logic to
deduplicate by using a set (map[string]struct{}) named e.g. seen to track added
IDs and only append to allFileIDs the first time an ID is encountered (for both
inner results from getAllInnermostFileIDs and direct ids), ensuring each leaf
file is scheduled for conversion exactly once.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b79c150d-0fd8-492c-921a-620b2dd65e92

📥 Commits

Reviewing files that changed from the base of the PR and between 96a4166 and 2bfbc00.

📒 Files selected for processing (4)
  • internal/dao/file2document.go
  • internal/handler/file.go
  • internal/router/router.go
  • internal/service/file2document.go

Comment thread internal/service/file2document.go
Comment thread internal/service/file2document.go Outdated
Comment on lines +145 to +223
mappings, err := s.file2DocumentDAO.GetByFileID(fileID)
if err != nil {
common.Warn("convertFiles: GetByFileID failed", zap.String("fileID", fileID), zap.Error(err))
}
for _, m := range mappings {
if m.DocumentID == nil {
continue
}
doc, err := s.documentDAO.GetByID(*m.DocumentID)
if err != nil || doc == nil {
continue
}
// Get tenant from KB.
kb, err := s.kbDAO.GetByID(doc.KbID)
if err != nil || kb == nil {
continue
}
// Hard-delete document (ignoring chunk store cleanup for simplicity).
if _, err := s.documentDAO.Delete(*m.DocumentID); err != nil {
common.Warn("convertFiles: Delete document failed",
zap.String("docID", *m.DocumentID), zap.Error(err))
}
}
if err := s.file2DocumentDAO.DeleteByFileID(fileID); err != nil {
common.Warn("convertFiles: DeleteByFileID failed", zap.String("fileID", fileID), zap.Error(err))
}

// Get source file.
file, err := s.fileDAO.GetByID(fileID)
if err != nil || file == nil {
continue
}

// Create document + mapping in each target KB.
for _, kbID := range kbIDs {
kb, ok := func() (*entity.Knowledgebase, bool) {
kb, err := s.kbDAO.GetByID(kbID)
return kb, err == nil && kb != nil
}()
if !ok {
continue
}

parserID := getParser(file.Type, file.Name, kb.ParserID)
suffix := strings.TrimPrefix(filepath.Ext(file.Name), ".")
doc := &entity.Document{
ID: common.GenerateUUID(),
KbID: kb.ID,
ParserID: parserID,
ParserConfig: entity.JSONMap(kb.ParserConfig),
CreatedBy: userID,
Type: file.Type,
Name: &file.Name,
Suffix: suffix,
Size: file.Size,
}
if file.Location != nil {
doc.Location = file.Location
}
if kb.PipelineID != nil {
doc.PipelineID = kb.PipelineID
}

if err := s.documentDAO.Create(doc); err != nil {
common.Warn("convertFiles: Create document failed",
zap.String("kbID", kbID), zap.String("fileID", fileID), zap.Error(err))
continue
}

mapping := &entity.File2Document{
ID: common.GenerateUUID(),
FileID: &fileID,
DocumentID: &doc.ID,
}
if err := s.file2DocumentDAO.Create(mapping); err != nil {
common.Warn("convertFiles: Create file2document mapping failed",
zap.String("fileID", fileID), zap.String("docID", doc.ID), zap.Error(err))
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Current delete-then-recreate flow can leave inconsistent file/document state.

convertFiles performs destructive deletes with warn-and-continue behavior and no per-file transaction. If any intermediate step fails, you can end up with removed mappings and missing recreated links (or orphaned docs).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/file2document.go` around lines 145 - 223, The convertFiles
flow deletes existing document mappings and documents then recreates documents
per-kb without transactional guarantees, causing inconsistent state if any step
fails; modify convertFiles to use a per-file transactional approach (or an
atomic two-phase approach): for each fileID wrap operations in a DB transaction
(or create new documents and file2document mappings first using
documentDAO.Create and file2DocumentDAO.Create, verify all creates succeeded,
then call documentDAO.Delete and file2DocumentDAO.DeleteByFileID or replace old
mappings), and ensure error handling rolls back the transaction or aborts
cleanup on failure; locate and update the logic around
file2DocumentDAO.DeleteByFileID, documentDAO.Delete, documentDAO.Create, and
file2DocumentDAO.Create to implement the transactional/atomic reorder and proper
rollback.

Copy link
Copy Markdown
Contributor

@Hz-186 Hz-186 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hunnyboy1217

The Go background conversion directly deletes and creates document rows through DAO calls. Python uses DocumentService.remove_document() and DocumentService.insert(), which update dataset/document counters and clean up related tasks, chunks, metadata, thumbnails, and graph/ index state.

In my local test, Go created the new document and file2document mapping, but the target KB doc_num stayed at 0, confirming the counter update path is skipped.

There are also response-shape mismatches for errors. Missing file_ids returns Python code=101, data=null with the validate_request message, while Go returns code=102, data=false and exposes the Gin binding error. 401 and file-not-found responses also omit/change data.

Please route document removal/insertion through equivalent service logic and align the error response shapes with Python.

…w#15674)

Applies the final-review feedback from Hz-186 and CodeRabbit.

Hz-186 — counters not updated / error shapes:
  - convertFiles now routes document removal/insertion through DocumentService
    instead of raw DAO calls, so KB doc_num/chunk_num/token_num counters are
    updated (target KB doc_num previously stayed at 0):
      * new DocumentService.RemoveDocumentKeepFile — mirrors Python
        remove_document: deletes tasks, doc-store chunks/metadata, and the
        document row with counter decrement, WITHOUT touching the file record
        or its mappings (caller clears mappings via delete_by_file_id)
      * new DocumentService.InsertDocument — mirrors Python insert: creates the
        document row and increments KB doc_num in a single transaction
  - error response shapes aligned with Python:
      * missing file_ids/kb_ids now returns code=101 (ARGUMENT_ERROR), data=null
        with the validate_request-style message "required argument are missing:
        <fields>; " (was code=102, data=false)
      * all error paths use jsonError → data=null (was data=false); file/dataset
        not-found and no-authorization map to code=102 like Python's
        get_data_error_result in convert()

CodeRabbit — robustness:
  - no longer propagate raw DAO/runtime errors through the API: the service
    returns sentinel errors (ErrLinkFileNotFound / ErrLinkDatasetNotFound /
    ErrLinkNoAuthorization / ErrLinkInternal); internal failures surface a
    generic "Internal server error." message instead of leaking DB details
  - deduplicate expanded file IDs after folder expansion so a leaf file reached
    via both a folder and a direct ID (or overlapping folders) is converted once
  - InsertDocument wraps the document row + counter bump in one transaction so a
    failed insert cannot leave a stale counter

No route or request/response field changes beyond the corrected error codes.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
internal/service/file2document.go (1)

221-235: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback the new document when mapping creation fails.

InsertDocument commits before file2DocumentDAO.Create(mapping). If the mapping insert fails, this leaves a live document in the KB with doc_num already incremented but no file2document row, so future relinks cannot find and clean it up.

💡 Proposed fix
 			if err := s.file2DocumentDAO.Create(mapping); err != nil {
 				common.Warn("convertFiles: Create file2document mapping failed",
 					zap.String("fileID", fileID), zap.String("docID", doc.ID), zap.Error(err))
+				if delErr := s.documentSvc.DeleteDocument(doc.ID); delErr != nil {
+					common.Warn("convertFiles: rollback orphan document failed",
+						zap.String("docID", doc.ID), zap.Error(delErr))
+				}
+				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/file2document.go` around lines 221 - 235, When file2document
mapping creation fails after a successful s.documentSvc.InsertDocument(doc),
delete or rollback the newly created document to avoid orphaned docs: after
s.file2DocumentDAO.Create(mapping) returns an error, call the document removal
API (e.g., s.documentSvc.DeleteDocument or the appropriate delete/remove method)
with doc.ID, log both the mapping create error and any delete error, and ensure
the code continues (or retries) only after attempting the cleanup; update the
existing warn log block to perform this delete using the same doc.ID and keep
the mapping creation and ID generation (entity.File2Document,
common.GenerateUUID) as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/document.go`:
- Around line 288-296: In DocumentService.InsertDocument's transaction block,
guard the KB counter update by capturing the update result instead of only
checking Error: call tx.Model(&entity.Knowledgebase{}).Where("id = ?",
doc.KbID).Update("doc_num", gorm.Expr("doc_num + 1")) into a result variable,
return an error and rollback if result.Error != nil, and also return an error
when result.RowsAffected == 0 (indicating no KB row was updated) so the
transaction fails when the KB is missing (follow the same pattern used in
deleteDocRecordWithCounters); reference the tx variable, doc.KbID, and the
Update("doc_num", gorm.Expr("doc_num + 1")) call when making the change.

In `@internal/service/file2document.go`:
- Around line 143-145: Deduplicate the kb IDs before launching the background
goroutine: build a unique slice from the validated kbMap keys (instead of using
raw req.KbIDs) and pass that deduplicated slice into the goroutine that runs the
conversion; update the usage of kbIDs in the anonymous func launched in
file2document.go so it iterates over the unique KB IDs (derived from kbMap) to
avoid creating duplicate documents/mappings.

---

Duplicate comments:
In `@internal/service/file2document.go`:
- Around line 221-235: When file2document mapping creation fails after a
successful s.documentSvc.InsertDocument(doc), delete or rollback the newly
created document to avoid orphaned docs: after
s.file2DocumentDAO.Create(mapping) returns an error, call the document removal
API (e.g., s.documentSvc.DeleteDocument or the appropriate delete/remove method)
with doc.ID, log both the mapping create error and any delete error, and ensure
the code continues (or retries) only after attempting the cleanup; update the
existing warn log block to perform this delete using the same doc.ID and keep
the mapping creation and ID generation (entity.File2Document,
common.GenerateUUID) as-is.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 986c869b-e234-4eb4-95e0-9b61e287a1ad

📥 Commits

Reviewing files that changed from the base of the PR and between 2bfbc00 and bc345fd.

📒 Files selected for processing (3)
  • internal/handler/file.go
  • internal/service/document.go
  • internal/service/file2document.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/handler/file.go

Comment on lines +288 to +296
return dao.DB.Transaction(func(tx *gorm.DB) error {
if err := tx.Create(doc).Error; err != nil {
return fmt.Errorf("failed to create document: %w", err)
}
if err := tx.Model(&entity.Knowledgebase{}).
Where("id = ?", doc.KbID).
Update("doc_num", gorm.Expr("doc_num + 1")).Error; err != nil {
return fmt.Errorf("failed to increment doc_num for KB %s: %w", doc.KbID, err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Document / Knowledgebase model definitions =="
rg -nC2 'type Document struct|type Knowledgebase struct|KbID|foreignKey|constraint' --type=go

echo
echo "== Migrations / schema references for documents.kb_id =="
rg -nC2 'documents|knowledgebase|kb_id|FOREIGN KEY|foreign key|constraint' --type=go --type=sql

Repository: infiniflow/ragflow

Length of output: 50375


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== 1) Inspect Document entity KbID gorm tags =="
sed -n '1,120p' internal/entity/document.go | cat -n

echo
echo "== 2) Inspect knowledgebase entity (for delete behavior / tags) =="
sed -n '1,140p' internal/entity/kb.go | cat -n

echo
echo "== 3) Inspect the function doing tx.Create + KB doc_num increment =="
sed -n '240,340p' internal/service/document.go | cat -n

echo
echo "== 4) Inspect the test that mentions inserting a doc with a missing KB row =="
rg -n "orphan-doc|no-such-kb|kb_id that has no KB" internal/service/document_test.go
# Print around the referenced region without dumping whole file
start=680
end=780
sed -n "${start},${end}p" internal/service/document_test.go | cat -n

echo
echo "== 5) Search repo for actual FK constraints referencing kb_id -> knowledgebase/id =="
rg -n --hidden --glob '*.{sql,psql,postgres,mysql,sqlite,duckdb,migration,ddl}' "FOREIGN KEY" .
rg -n --hidden --glob '*.sql' "kb_id[^\\n]*REFERENCES|REFERENCES[^\\n]*kb_id|kb_id[^\\n]*references|references[^\\n]*kb_id" .
rg -n --hidden --glob '*.go' "foreignKey|constraint:|Constraints|OnDelete|OnUpdate|references|REFERENCES" internal/entity internal/dao

Repository: infiniflow/ragflow

Length of output: 21725


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== 1) Find setupServiceTestDB definition (DB type + FK enforcement) =="
rg -n "func setupServiceTestDB|setupServiceTestDB\\(" internal/service/document_test.go internal/service -S
rg -n "PRAGMA foreign_keys|foreign_keys=on|ForeignKey|constraint" internal/service -S

echo
echo "== 2) Inspect the test DB setup file(s) =="
# Prefer the file that contains setupServiceTestDB
file="$(rg -n "func setupServiceTestDB" -S internal/service | head -n1 | awk -F: '{print $1}')"
if [ -n "${file:-}" ]; then
  echo "Using: $file"
  sed -n '1,220p' "$file" | cat -n
fi

echo
echo "== 3) Search for migrations/DDL mentioning document.kb_id foreign keys (kb_id -> knowledgebase.id) =="
rg -n --hidden --glob '*.{sql,psql,postgres,mysql,sqlite,migration,ddl}' "document.*kb_id|documents.*kb_id|kb_id.*REFERENCES|REFERENCES.*knowledgebase|knowledgebase.*id" .
rg -n --hidden --glob '*.go' "AutoMigrate|Migrate|schema|foreignKey|constraint|OnDelete|OnUpdate" internal -S

Repository: infiniflow/ragflow

Length of output: 13752


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== A) Check if any SQL/DDL/migrations exist in repo =="
fd -t f -i -d 6 -e sql -e ddl -e psql -e postgres -e sqlite -e mysql -e migration -e go -a --max-results 2000 . > /tmp/ddl_files.txt || true
wc -l /tmp/ddl_files.txt || true
head -n 50 /tmp/ddl_files.txt || true

echo
echo "== B) Search repo for FK definitions involving kb_id -> knowledgebase/id =="
rg -n --hidden --glob '*.{sql,ddl,migration,psql,postgres,mysql,sqlite}' "kb_id.*REFERENCES|REFERENCES.*kb_id|document.*kb_id|documents.*kb_id|knowledgebase\\.id" . || true

echo
echo "== C) Inspect deleteDocRecordWithCounters for RowsAffected handling =="
rg -n "deleteDocRecordWithCounters\\(" internal/service/document.go
sed -n '90,170p' internal/service/document.go | cat -n

Repository: infiniflow/ragflow

Length of output: 6259


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== 1) Show deleteDocRecordWithCounters function body =="
ln=$(rg -n "func \\(s \\*DocumentService\\) deleteDocRecordWithCounters" -n internal/service/document.go | head -n1 | cut -d: -f1)
echo "function starts at line: $ln"
start=$((ln-20))
end=$((ln+80))
sed -n "${start},${end}p" internal/service/document.go | cat -n

echo
echo "== 2) Inspect docker init.sql for any FK/constraint on document/kb =="
# Only show relevant lines to keep output small
rg -n "document|documents|knowledgebase|knowledgebase\\.id|kb_id|FOREIGN KEY|REFERENCES" docker/init.sql docker/oceanbase/init.d/vec_memory.sql 2>/dev/null || true
# broaden to all docker init scripts
rg -n "FOREIGN KEY|REFERENCES|documents|knowledgebase|kb_id" docker -S || true

Repository: infiniflow/ragflow

Length of output: 4922


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== 1) Search docker/init.sql (and all docker init scripts) for FK/REFERENCES =="
rg -n "FOREIGN KEY|REFERENCES" docker -S || true

echo
echo "== 2) Search for document table DDL + kb_id inside init scripts =="
rg -n "CREATE TABLE.*document|CREATE TABLE.*documents|\\bdocument\\b" docker -S || true
rg -n "kb_id" docker -S || true

echo
echo "== 3) Search repo-wide for DDL that mentions knowledgebase/doc_num/documents.kb_id =="
rg -n --hidden --glob '*.{sql,ddl,psql,postgres,mysql,sqlite}' "knowledgebase|doc_num|CREATE TABLE.*document|\\bdocuments\\b|\\bkb_id\\b" . || true

Repository: infiniflow/ragflow

Length of output: 420


Guard the KB doc_num increment with RowsAffected in DocumentService.InsertDocument.
entity.Document.KbID has no gorm FK/constraint tags, and the test suite inserts an “orphan” document (KbID: "no-such-kb") successfully, so the DB-level FK enforcement isn’t reliably present. As a result, the current Update(...).Error can be nil even when the KB row doesn’t exist (0 rows updated), leaving the counters inconsistent. Capture the update result and roll back when RowsAffected == 0 (similar to the deleteDocRecordWithCounters pattern).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/document.go` around lines 288 - 296, In
DocumentService.InsertDocument's transaction block, guard the KB counter update
by capturing the update result instead of only checking Error: call
tx.Model(&entity.Knowledgebase{}).Where("id = ?", doc.KbID).Update("doc_num",
gorm.Expr("doc_num + 1")) into a result variable, return an error and rollback
if result.Error != nil, and also return an error when result.RowsAffected == 0
(indicating no KB row was updated) so the transaction fails when the KB is
missing (follow the same pattern used in deleteDocRecordWithCounters); reference
the tx variable, doc.KbID, and the Update("doc_num", gorm.Expr("doc_num + 1"))
call when making the change.

Comment on lines +143 to +145
// ── 6. Run conversion in background (fire-and-forget) ────────────────────
kbIDs := req.KbIDs
go func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deduplicate kb_ids before scheduling conversion.

kbMap collapses duplicates for validation, but the goroutine still receives raw req.KbIDs. A payload like ["kb-a", "kb-a"] will create two documents and two mappings for every file in the same dataset.

💡 Proposed fix
-	kbIDs := req.KbIDs
+	kbIDs := dedupeStrings(req.KbIDs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/file2document.go` around lines 143 - 145, Deduplicate the kb
IDs before launching the background goroutine: build a unique slice from the
validated kbMap keys (instead of using raw req.KbIDs) and pass that deduplicated
slice into the goroutine that runs the conversion; update the usage of kbIDs in
the anonymous func launched in file2document.go so it iterates over the unique
KB IDs (derived from kbMap) to avoid creating duplicate documents/mappings.

@hunnyboy1217
Copy link
Copy Markdown
Contributor Author

Hi, @Hz-186 ,
Updated based on your review!

Copy link
Copy Markdown
Contributor

@Hz-186 Hz-186 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hz-186 Hz-186 added the ci Continue Integration label Jun 5, 2026
@Haruko386 Haruko386 marked this pull request as draft June 5, 2026 05:01
@Haruko386 Haruko386 marked this pull request as ready for review June 5, 2026 05:01
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.16%. Comparing base (461c190) to head (bc345fd).
⚠️ Report is 17 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #15674   +/-   ##
=======================================
  Coverage   93.16%   93.16%           
=======================================
  Files          10       10           
  Lines         717      717           
  Branches      118      118           
=======================================
  Hits          668      668           
  Misses         29       29           
  Partials       20       20           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hunnyboy1217
Copy link
Copy Markdown
Contributor Author

Hi, @Haruko386,
All CI tests are successfully passed.

@hunnyboy1217
Copy link
Copy Markdown
Contributor Author

Hi, @yingfeng , Could you please review this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask]: Implement File link-to-datasets API in Go

2 participants