Skip to content

feat[Go]: implement chunk REST APIs (list, add, update, switch)#15670

Open
hunnyboy1217 wants to merge 1 commit into
infiniflow:mainfrom
hunnyboy1217:feat/go-chunk-rest-apis
Open

feat[Go]: implement chunk REST APIs (list, add, update, switch)#15670
hunnyboy1217 wants to merge 1 commit into
infiniflow:mainfrom
hunnyboy1217:feat/go-chunk-rest-apis

Conversation

@hunnyboy1217

@hunnyboy1217 hunnyboy1217 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Closes #15668 — ports four missing chunk REST endpoints from Python (api/apps/restful_apis/chunk_api.py) to the Go API layer.

The following were already in Go: GET .../chunks/:chunk_id and DELETE .../chunks. This PR adds the remaining four:

Method Path Handler Python source
GET /api/v1/datasets/:dataset_id/documents/:document_id/chunks ListChunksREST list_chunks
POST /api/v1/datasets/:dataset_id/documents/:document_id/chunks AddChunk add_chunk
PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks/:chunk_id UpdateChunkREST update_chunk
PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks SwitchChunks switch_chunks

Type of change

  • New Feature (non-breaking change which adds functionality)

Implementation notes

Files changed:

internal/service/chunk.go  – helpers + four service methods
internal/handler/chunk.go  – four HTTP handlers
internal/router/router.go  – route registration

Functional parity table:

Concern Go behaviour
Dataset access resolveDatasetAccess iterates user tenants, calls kbDAO.GetByIDAndTenantID — mirrors KnowledgebaseService.accessible
Document ownership documentDAO.GetByID + doc.KbID == datasetID — mirrors DocumentService.query(id=…, kb_id=…)
Chunk ID generation xxhash.Sum64String(content + documentID) — same as Python xxhash.xxh64(…).hexdigest()
Tokenisation tokenizer.Tokenize + tokenizer.FineGrainedTokenize (existing Go tokenizer)
Embedding getEmbeddingModelForKB resolves KB → embedding model (same lookup chain as RetrievalTest); 0.1 * docNameVec + 0.9 * contentVec weighted average
Document stats gorm.Expr("chunk_num + 1") / gorm.Expr("token_num + ?", count) via documentDAO.UpdateByID — mirrors DocumentService.increment_chunk_num
UpdateChunkREST Re-embeds only when content or questions changes; all other fields (available, positions, tag_kwd, tag_feas) updated without re-embedding
SwitchChunks Bulk UpdateChunks with available_int per chunk — mirrors Python's docStoreConn.update loop
Response shape AddChunk returns {"chunk": {…}} with Python key_mapping field renames; ListChunksREST mirrors _strip_chunk_runtime_fields output

@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 4, 2026
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds dataset/document-scoped chunk REST endpoints (list with pagination/filters, add with deterministic IDs + embeddings, partial update with conditional re-embed, and bulk availability toggle), router wiring, access/embedding helpers, and response normalization.

Changes

Chunk REST API Implementation

Layer / File(s) Summary
Handler endpoints and imports
internal/handler/chunk.go
Adds strconv import and four Gin handlers: ListChunksREST, AddChunk, UpdateChunkREST, SwitchChunks that parse path/query/body and forward requests to ChunkService.
Router wiring
internal/router/router.go
Registers collection and item routes under /api/v1/datasets/:dataset_id/documents/:document_id/chunks for list, add, switch, get, patch, and delete.
Service imports & helpers
internal/service/chunk.go
Adds time, xxhash/v2, gorm imports and implements resolveDatasetAccess, getEmbeddingModelForKB, embedTexts, estimateTokenCount, and weightedVec.
ListChunks (service + handler)
internal/service/chunk.go, internal/handler/chunk.go
ListChunksREST builds a doc-engine search scoped to dataset/document with pagination, optional keywords and available filter; includes orSlice, orStr, intToBool normalizers and handler-side page/page_size parsing/capping.
AddChunk (service + handler)
internal/service/chunk.go, internal/handler/chunk.go
AddChunk validates ownership/content, computes deterministic chunkID via xxhash(content+documentID), tokenizes/normalizes questions/keywords, generates embeddings from [doc.Name, embedInput], stores a weighted vector key, inserts chunk into doc engine, and updates parent document counters; handler binds AddChunkRequest.
UpdateChunk (service + handler)
internal/service/chunk.go, internal/handler/chunk.go
UpdateChunkREST validates ownership and chunk scope, applies partial updates, re-embeds only when content or questions change, and persists updates; handler binds UpdateChunkRESTRequest.
SwitchChunks (service + handler)
internal/service/chunk.go, internal/handler/chunk.go
SwitchChunks validates chunk_ids, resolves ownership, computes available_int, iterates updating each chunk's availability while logging per-item failures; handler validates payload and returns result.

Sequence Diagram — AddChunk flow

sequenceDiagram
  participant Client
  participant ChunkHandler
  participant ChunkService
  participant DocEngine
  participant EmbeddingService
  Client->>ChunkHandler: POST /datasets/:ds/documents/:doc/chunks {AddChunkRequest}
  ChunkHandler->>ChunkService: AddChunk(datasetID, documentID, userID, req)
  ChunkService->>DocEngine: Verify document belongs to KB / insert chunk
  ChunkService->>EmbeddingService: embedTexts([doc.Name, embedInput])
  EmbeddingService-->>ChunkService: vector
  ChunkService->>DocEngine: store chunk + q_<len>_vec
  ChunkService-->>ChunkHandler: {"chunk": {...}}
  ChunkHandler-->>Client: HTTP 200 {code,data,message}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

💞 feature

Suggested reviewers

  • Haruko386
  • Hz-186
  • JinHai-CN

"A Rabbit's Note on Chunks"
I hopped through code and seeded names,
Deterministic hashes, tiny claims.
Embeddings hum and counters climb,
Chunks now line up, neat and fine.
✨🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately and concisely describes the main change: implementation of four chunk REST API endpoints (list, add, update, switch) in Go.
Description check ✅ Passed The PR description comprehensively addresses the template by explaining the problem solved (porting four Python endpoints to Go), specifying the type of change (New Feature), documenting all four endpoints with paths and handlers, and providing detailed implementation notes including functional parity with Python behavior.
Linked Issues check ✅ Passed The PR implements four of the seven chunk endpoints required by #15668: ListChunksREST, AddChunk, UpdateChunkREST, and SwitchChunks. The implementation achieves behavioral parity with Python sources via dataset/document access validation, chunk ID generation, tokenization, embedding, and document stat updates.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the four chunk REST endpoints required by #15668 (list, add, update, switch). The modifications to internal/handler/chunk.go, internal/service/chunk.go, and internal/router/router.go are directly aligned with the PR objectives and do not introduce unrelated changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
internal/service/chunk.go (1)

1360-1361: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Same panic risk with tokenizer.Tokenize result cast.

This has the same issue as in AddChunk - direct type assertion without safety check.

🛡️ Proposed fix
 	// Tokenize.
 	contentLtks, _ := tokenizer.Tokenize(content)
-	contentSmLtks, _ := tokenizer.FineGrainedTokenize(contentLtks.(string))
+	contentLtksStr, ok := contentLtks.(string)
+	if !ok {
+		contentLtksStr = ""
+	}
+	contentSmLtks, _ := tokenizer.FineGrainedTokenize(contentLtksStr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1360 - 1361, The code unsafely
type-asserts tokenizer.Tokenize(...) result to string (contentLtks.(string))
which can panic; change the call sites (where contentLtks and contentSmLtks are
set) to use the two-value type assertion or a type switch to verify the returned
value is a string before using it, and propagate or handle errors from
tokenizer.Tokenize and tokenizer.FineGrainedTokenize instead of ignoring them;
update the logic around variables contentLtks and contentSmLtks and the calls to
tokenizer.Tokenize and tokenizer.FineGrainedTokenize so you return/log an error
or bail out if the result type is unexpected.
🧹 Nitpick comments (1)
internal/handler/chunk.go (1)

367-381: 💤 Low value

Consider using strconv.Atoi instead of json.Number for query parameter parsing.

Using json.Number(v).Int64() for string-to-int conversion is unconventional. strconv.Atoi or strconv.ParseInt is more idiomatic for parsing query parameters and avoids the intermediate json.Number type.

♻️ Suggested refactor
 	page := 1
 	if v := c.Query("page"); v != "" {
-		if p, err := json.Number(v).Int64(); err == nil && p > 0 {
+		if p, err := strconv.Atoi(v); err == nil && p > 0 {
 			page = int(p)
 		}
 	}
 	pageSize := 30
 	if v := c.Query("page_size"); v != "" {
-		if ps, err := json.Number(v).Int64(); err == nil && ps > 0 {
+		if ps, err := strconv.Atoi(v); err == nil && ps > 0 {
 			if ps > 100 {
 				ps = 100
 			}
-			pageSize = int(ps)
+			pageSize = ps
 		}
 	}

This would also require adding "strconv" to the imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/handler/chunk.go` around lines 367 - 381, The current parsing of
query params for pagination (page and pageSize variables in
internal/handler/chunk.go) uses json.Number(v).Int64(), which is unconventional;
replace those conversions with strconv.Atoi (or strconv.ParseInt) to parse
c.Query("page") and c.Query("page_size"), keep the same validation (positive
values and cap pageSize to 100), and add "strconv" to imports; ensure you still
default page=1 and pageSize=30 when parsing fails or values are empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/chunk.go`:
- Around line 1431-1459: SwitchChunks currently updates chunks by id without
verifying they belong to documentID; modify SwitchChunks to include the document
ownership check: when building the update condition passed to
s.docEngine.UpdateChunks include "doc_id": documentID (in addition to "id":
chunkID) or first fetch the chunk via s.docEngine (e.g., GetChunk/GetChunks) and
verify chunk.DocID equals documentID before calling UpdateChunks; ensure you
still respect tenantID/indexName/datasetID and keep available_int logic and
error logging (use the same condition/update variable names to locate the
change).

---

Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1360-1361: The code unsafely type-asserts tokenizer.Tokenize(...)
result to string (contentLtks.(string)) which can panic; change the call sites
(where contentLtks and contentSmLtks are set) to use the two-value type
assertion or a type switch to verify the returned value is a string before using
it, and propagate or handle errors from tokenizer.Tokenize and
tokenizer.FineGrainedTokenize instead of ignoring them; update the logic around
variables contentLtks and contentSmLtks and the calls to tokenizer.Tokenize and
tokenizer.FineGrainedTokenize so you return/log an error or bail out if the
result type is unexpected.

---

Nitpick comments:
In `@internal/handler/chunk.go`:
- Around line 367-381: The current parsing of query params for pagination (page
and pageSize variables in internal/handler/chunk.go) uses
json.Number(v).Int64(), which is unconventional; replace those conversions with
strconv.Atoi (or strconv.ParseInt) to parse c.Query("page") and
c.Query("page_size"), keep the same validation (positive values and cap pageSize
to 100), and add "strconv" to imports; ensure you still default page=1 and
pageSize=30 when parsing fails or values are empty.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85724050-8e76-4982-be04-53bbaa180426

📥 Commits

Reviewing files that changed from the base of the PR and between 98f2a2e and 722d990.

📒 Files selected for processing (3)
  • internal/handler/chunk.go
  • internal/router/router.go
  • internal/service/chunk.go

Comment thread internal/service/chunk.go Outdated
Comment thread internal/service/chunk.go
@hunnyboy1217

Copy link
Copy Markdown
Contributor Author

Hi, @Haruko386 ,
Please review this PR.

@hunnyboy1217 hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 722d990 to 52752d2 Compare June 8, 2026 05:22
@Haruko386

Copy link
Copy Markdown
Member

Hi @hunnyboy1217 I will review your code soon

@Haruko386 Haruko386 assigned Haruko386 and unassigned Haruko386 Jun 8, 2026
@Haruko386 Haruko386 self-requested a review June 8, 2026 05:45
@Haruko386 Haruko386 marked this pull request as draft June 8, 2026 07:12
@Haruko386 Haruko386 added the ci Continue Integration label Jun 8, 2026
@Haruko386 Haruko386 marked this pull request as ready for review June 8, 2026 07:13
@dosubot dosubot Bot added the 💞 feature Feature request, pull request that fullfill a new feature. label Jun 8, 2026
@Haruko386 Haruko386 removed the 💞 feature Feature request, pull request that fullfill a new feature. label Jun 8, 2026

@Haruko386 Haruko386 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hunnyboy1217 plz fix CI error first

@Hz-186

Hz-186 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@hunnyboy1217
and fix the conflicts plz

@hunnyboy1217 hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 64e7070 to fef32ae Compare June 8, 2026 10:12

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
internal/service/chunk.go (1)

1458-1474: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope the bulk update to the requested document.

documentID is never used after access resolution, and each update filters only by id. This endpoint is still not document-scoped, so supplied chunk IDs from other documents in the same dataset can be toggled here.

🩹 Proposed fix
 	for _, chunkID := range chunkIDs {
-		condition := map[string]interface{}{"id": chunkID}
+		condition := map[string]interface{}{"id": chunkID, "doc_id": documentID}
 		update := map[string]interface{}{"available_int": availInt}
 		if err := s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID); err != nil {
 			common.Warn("SwitchChunks: failed to update chunk", zap.String("chunkID", chunkID), zap.Error(err))
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1458 - 1474, The update loop
currently uses condition := {"id": chunkID} so it ignores the resolved document
scope (documentID) and can flip chunks from other documents; change the
condition used in s.docEngine.UpdateChunks to include the document identifier
(e.g., add "document_id": documentID or the appropriate field name) so the bulk
update in the loop (function using resolveDatasetAccess, variables chunkIDs,
available, indexName) is scoped to the requested document, leaving indexName and
datasetID usage unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/chunk.go`:
- Around line 1394-1438: When re-embedding (in the block using
getEmbeddingModelForKB, embedTexts, weightedVec) preserve existing persisted
questions if req.Questions is nil: read d["question_kwd"] (or the stored
question list) into the local questions variable when req.Questions == nil so
embedInput uses the persisted questions rather than empty content; if
req.Questions is provided keep current behavior and update d["question_kwd"] as
you already do. Ensure questions is populated before building embedInput and
before computing the weighted vector so vectors stay consistent with persisted
question_kwd.
- Around line 1285-1289: The code currently swallows errors from
s.documentDAO.UpdateByID after a successful chunk insert; capture and handle the
UpdateByID error instead of ignoring it — either wrap the chunk creation and the
UpdateByID call in a single DB transaction and rollback on any error, or check
the returned error from s.documentDAO.UpdateByID and return it (or propagate a
wrapped error) so the API reports failure; ensure you use the existing chunk
insert call and s.documentDAO.UpdateByID symbols and do not leave the counter
update silent on failure.
- Around line 1124-1136: The REST payload builds the result map using write-side
field names (e.g. "content_with_weight", "important_kwd", "question_kwd",
"docnm_kwd") which causes normalized search-hit data to be dropped; update the
map construction around the result variable so it reads the normalized
search-hit keys used elsewhere in this file's List/Get paths (e.g. "content",
"important_keywords", "questions", "docnm") while still using the existing
helpers (orSlice, orStr, intToBool) and datasetID/image_id handling; locate the
result map creation in chunk.go and replace the write-side keys with the
normalized names so responses include existing chunk values.
- Around line 1000-1022: getEmbeddingModelForKB currently errors when no
KB-specific embedding is set; instead reuse the existing resolveEmbeddingModel
fallback: when embdID is empty after the KB lookup in getEmbeddingModelForKB,
call resolveEmbeddingModel(tenantID) (or the service method that implements that
tenant-default resolution) to obtain the tenant default embdID/model, handle its
error, and then call modelProviderSvc.GetEmbeddingModel(tenantID, embdID).
Update error paths to preserve original error context from resolveEmbeddingModel
and remove the unconditional "no embedding model configured" error in favor of
the resolved result.

---

Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1458-1474: The update loop currently uses condition := {"id":
chunkID} so it ignores the resolved document scope (documentID) and can flip
chunks from other documents; change the condition used in
s.docEngine.UpdateChunks to include the document identifier (e.g., add
"document_id": documentID or the appropriate field name) so the bulk update in
the loop (function using resolveDatasetAccess, variables chunkIDs, available,
indexName) is scoped to the requested document, leaving indexName and datasetID
usage unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ece3b9fa-5f66-4d97-99e7-9c47e62197b9

📥 Commits

Reviewing files that changed from the base of the PR and between 64e7070 and 660805c.

📒 Files selected for processing (3)
  • internal/handler/chunk.go
  • internal/router/router.go
  • internal/service/chunk.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/handler/chunk.go

Comment thread internal/service/chunk.go
Comment on lines +1000 to +1022
func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {
tenantLLMDAO := dao.NewTenantLLMDAO()
modelProviderSvc := NewModelProviderService()

var embdID string
var err error
if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 {
_, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID)
} else if kb.EmbdID != "" {
parts := strings.Split(kb.EmbdID, "@")
if len(parts) == 2 && parts[1] != "" {
_, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding)
} else {
_, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding)
}
}
if err != nil {
return nil, fmt.Errorf("failed to resolve embedding model: %w", err)
}
if embdID == "" {
return nil, fmt.Errorf("no embedding model configured for dataset")
}
return modelProviderSvc.GetEmbeddingModel(tenantID, embdID)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reuse the existing embedding-model fallback here.

resolveEmbeddingModel already falls back to the tenant default when a KB has no explicit embedding model, but getEmbeddingModelForKB returns an error instead. That makes AddChunk and UpdateChunkREST fail for datasets that still rely on the tenant default embedding configuration.

🩹 Proposed fix
 func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {
-	tenantLLMDAO := dao.NewTenantLLMDAO()
-	modelProviderSvc := NewModelProviderService()
-
-	var embdID string
-	var err error
-	if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 {
-		_, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID)
-	} else if kb.EmbdID != "" {
-		parts := strings.Split(kb.EmbdID, "@")
-		if len(parts) == 2 && parts[1] != "" {
-			_, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding)
-		} else {
-			_, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding)
-		}
-	}
-	if err != nil {
-		return nil, fmt.Errorf("failed to resolve embedding model: %w", err)
-	}
-	if embdID == "" {
-		return nil, fmt.Errorf("no embedding model configured for dataset")
-	}
-	return modelProviderSvc.GetEmbeddingModel(tenantID, embdID)
+	return s.resolveEmbeddingModel(tenantID, kb)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {
tenantLLMDAO := dao.NewTenantLLMDAO()
modelProviderSvc := NewModelProviderService()
var embdID string
var err error
if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 {
_, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID)
} else if kb.EmbdID != "" {
parts := strings.Split(kb.EmbdID, "@")
if len(parts) == 2 && parts[1] != "" {
_, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding)
} else {
_, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding)
}
}
if err != nil {
return nil, fmt.Errorf("failed to resolve embedding model: %w", err)
}
if embdID == "" {
return nil, fmt.Errorf("no embedding model configured for dataset")
}
return modelProviderSvc.GetEmbeddingModel(tenantID, embdID)
func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {
return s.resolveEmbeddingModel(tenantID, kb)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1000 - 1022, getEmbeddingModelForKB
currently errors when no KB-specific embedding is set; instead reuse the
existing resolveEmbeddingModel fallback: when embdID is empty after the KB
lookup in getEmbeddingModelForKB, call resolveEmbeddingModel(tenantID) (or the
service method that implements that tenant-default resolution) to obtain the
tenant default embdID/model, handle its error, and then call
modelProviderSvc.GetEmbeddingModel(tenantID, embdID). Update error paths to
preserve original error context from resolveEmbeddingModel and remove the
unconditional "no embedding model configured" error in favor of the resolved
result.

Comment thread internal/service/chunk.go
Comment on lines +1124 to +1136
result := map[string]interface{}{
"id": chunk["id"],
"content": chunk["content_with_weight"],
"document_id": chunk["doc_id"],
"docnm_kwd": chunk["docnm_kwd"],
"important_keywords": orSlice(chunk["important_kwd"]),
"questions": orSlice(chunk["question_kwd"]),
"tag_kwd": orSlice(chunk["tag_kwd"]),
"dataset_id": datasetID,
"image_id": orStr(chunk["img_id"]),
"available": intToBool(chunk["available_int"]),
"positions": orSlice(chunk["position_int"]),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize search hits before building the REST payload.

This code reads write-side field names (content_with_weight, important_kwd, question_kwd, docnm_kwd), but the existing List/Get paths in this file normalize raw search-hit fields like content, important_keywords, questions, and docnm. As written, REST list responses will drop those values for existing chunks.

🩹 Proposed fix
 		result := map[string]interface{}{
 			"id":                 chunk["id"],
-			"content":            chunk["content_with_weight"],
+			"content":            utility.FirstNonNil(chunk["content_with_weight"], chunk["content"]),
 			"document_id":        chunk["doc_id"],
-			"docnm_kwd":          chunk["docnm_kwd"],
-			"important_keywords": orSlice(chunk["important_kwd"]),
-			"questions":          orSlice(chunk["question_kwd"]),
+			"docnm_kwd":          utility.FirstNonNil(chunk["docnm_kwd"], chunk["docnm"]),
+			"important_keywords": orSlice(utility.FirstNonNil(chunk["important_kwd"], chunk["important_keywords"])),
+			"questions":          orSlice(utility.FirstNonNil(chunk["question_kwd"], chunk["questions"])),
 			"tag_kwd":            orSlice(chunk["tag_kwd"]),
 			"dataset_id":         datasetID,
 			"image_id":           orStr(chunk["img_id"]),
 			"available":          intToBool(chunk["available_int"]),
 			"positions":          orSlice(chunk["position_int"]),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1124 - 1136, The REST payload builds
the result map using write-side field names (e.g. "content_with_weight",
"important_kwd", "question_kwd", "docnm_kwd") which causes normalized search-hit
data to be dropped; update the map construction around the result variable so it
reads the normalized search-hit keys used elsewhere in this file's List/Get
paths (e.g. "content", "important_keywords", "questions", "docnm") while still
using the existing helpers (orSlice, orStr, intToBool) and datasetID/image_id
handling; locate the result map creation in chunk.go and replace the write-side
keys with the normalized names so responses include existing chunk values.

Comment thread internal/service/chunk.go
Comment on lines +1285 to +1289
// Increment document chunk_num and token_num.
_ = s.documentDAO.UpdateByID(documentID, map[string]interface{}{
"chunk_num": gorm.Expr("chunk_num + 1"),
"token_num": gorm.Expr("token_num + ?", tokenCount),
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't swallow the document-counter write failure.

If the chunk insert succeeds but UpdateByID fails, chunk_num and token_num drift from the actual chunk store while the API still returns success. This silently breaks document stats and makes retries/reconciliation ambiguous.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1285 - 1289, The code currently
swallows errors from s.documentDAO.UpdateByID after a successful chunk insert;
capture and handle the UpdateByID error instead of ignoring it — either wrap the
chunk creation and the UpdateByID call in a single DB transaction and rollback
on any error, or check the returned error from s.documentDAO.UpdateByID and
return it (or propagate a wrapped error) so the API reports failure; ensure you
use the existing chunk insert call and s.documentDAO.UpdateByID symbols and do
not leave the counter update silent on failure.

Comment thread internal/service/chunk.go
Comment on lines +1394 to +1438
questions := []string{}
if req.Questions != nil {
for _, q := range req.Questions {
if q = strings.TrimSpace(q); q != "" {
questions = append(questions, q)
}
}
d["question_kwd"] = questions
d["question_tks"] = strings.Join(questions, "\n")
}

if req.Available != nil {
avInt := 0
if *req.Available {
avInt = 1
}
d["available_int"] = avInt
}
if req.Positions != nil {
d["position_int"] = req.Positions
}
if req.TagKwd != nil {
d["tag_kwd"] = req.TagKwd
}
if req.TagFeas != nil {
d["tag_feas"] = req.TagFeas
}

// Re-embed when content or questions changed.
if req.Content != nil || req.Questions != nil {
em, err := s.getEmbeddingModelForKB(kb, tenantID)
if err != nil {
return fmt.Errorf("failed to get embedding model: %w", err)
}
embedInput := content
if len(questions) > 0 {
embedInput = strings.Join(questions, "\n")
}
vecs, _, err := embedTexts(em, []string{doc.Name, embedInput})
if err != nil {
return fmt.Errorf("embedding failed: %w", err)
}
if len(vecs) >= 2 {
vec := weightedVec(vecs[0], vecs[1])
d[fmt.Sprintf("q_%d_vec", len(vec))] = vec

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve existing questions when only content changes.

questions starts empty and is only filled from req.Questions. If a chunk already has questions and the caller patches only content, the re-embed path falls back to raw content while question_kwd remains unchanged in storage. The vector then no longer matches the chunk's persisted question state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1394 - 1438, When re-embedding (in
the block using getEmbeddingModelForKB, embedTexts, weightedVec) preserve
existing persisted questions if req.Questions is nil: read d["question_kwd"] (or
the stored question list) into the local questions variable when req.Questions
== nil so embedInput uses the persisted questions rather than empty content; if
req.Questions is provided keep current behavior and update d["question_kwd"] as
you already do. Ensure questions is populated before building embedInput and
before computing the weighted vector so vectors stay consistent with persisted
question_kwd.

@hunnyboy1217 hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 660805c to 47e915f Compare June 8, 2026 11:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/service/chunk.go (1)

1480-1488: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

SwitchChunks does not validate that chunks belong to the specified document.

The documentID parameter is accepted but never used. The update condition at line 1482 contains only "id": chunkID, allowing a user with dataset access to toggle availability of chunks from any document in that dataset—not just the document specified in the URL path.

Add "doc_id": documentID to the condition map so the update is scoped to the correct document:

🔒 Proposed fix
 	for _, chunkID := range chunkIDs {
-		condition := map[string]interface{}{"id": chunkID}
+		condition := map[string]interface{}{"id": chunkID, "doc_id": documentID}
 		update := map[string]interface{}{"available_int": availInt}
 		if err := s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID); err != nil {
 			common.Warn("SwitchChunks: failed to update chunk", zap.String("chunkID", chunkID), zap.Error(err))
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1480 - 1488, SwitchChunks accepts
documentID but never scopes the chunk update to that document, so UpdateChunks
is called with condition {"id": chunkID} and can modify chunks from other
documents; update the condition map in the loop inside SwitchChunks to include
the document ID (e.g., add "doc_id": documentID) so the call to
s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID) only
affects chunks belonging to the specified documentID (ensure the key name
matches the DB field used elsewhere).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1480-1488: SwitchChunks accepts documentID but never scopes the
chunk update to that document, so UpdateChunks is called with condition {"id":
chunkID} and can modify chunks from other documents; update the condition map in
the loop inside SwitchChunks to include the document ID (e.g., add "doc_id":
documentID) so the call to s.docEngine.UpdateChunks(ctx, condition, update,
indexName, datasetID) only affects chunks belonging to the specified documentID
(ensure the key name matches the DB field used elsewhere).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 74ea85a6-36ef-419f-a4ec-ab3d3c1af790

📥 Commits

Reviewing files that changed from the base of the PR and between 660805c and 47e915f.

📒 Files selected for processing (3)
  • internal/handler/chunk.go
  • internal/router/router.go
  • internal/service/chunk.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/router/router.go
  • internal/handler/chunk.go

@hunnyboy1217 hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch 2 times, most recently from c2d8273 to 95d8fa7 Compare June 8, 2026 14:13
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.16%. Comparing base (93e4f6b) to head (c50554f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #15670   +/-   ##
=======================================
  Coverage   93.16%   93.16%           
=======================================
  Files          10       10           
  Lines         717      717           
  Branches      118      118           
=======================================
  Hits          668      668           
  Misses         29       29           
  Partials       20       20           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hunnyboy1217

Copy link
Copy Markdown
Contributor Author

Hi, @Haruko386, @Hz-186,
All conflicts have been resolved, and CI tests have been successfully passed.

@hunnyboy1217 hunnyboy1217 requested a review from Haruko386 June 8, 2026 22:05

@Hz-186 Hz-186 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hunnyboy1217

Thanks for the great effort in porting the chunk management APIs to Go!

I've conducted a thorough end-to-end API functional test between the existing Python API (:9380) and the new Go API (:9384).

I found 3 critical bugs that need to be fixed before merging:

1. Go API fails to resolve valid Embedding Models (AddChunk)
When calling POST /api/v1/datasets/.../chunks, the Python API successfully parses and embeds the chunk using the configured model (e.g. Qwen/Qwen3-Embedding-8B@test@SILICONFLOW).

However, the Go API fails with: failed to get embedding model... record not found. The model resolution logic in getEmbeddingModelForKB (or the underlying tenant LLM lookup) seems broken for complex model names/factories, blocking chunk creation entirely.

2. Silent Success on Non-existent Chunk IDs (SwitchChunks)
When sending PATCH /api/v1/datasets/.../chunks with a fake or invalid chunk_id (e.g., ["FAKE_ID"]):

  • Python API strictly returns code: 102 (Index updating failure).
  • Go API returns code: 0 (success).
    The Go implementation is missing the validation to check if updated_count > 0 after hitting the document engine, giving users a false positive success.

3. Severe Schema Mismatch in ListChunksREST
The doc object returned in GET /api/v1/datasets/.../chunks diverges heavily in Go:

  • Wrong Keys: Go returns kb_id, chunk_num, and token_num. Python relies on dataset_id, chunk_count, and token_count.
  • Missing Metadata: The Go implementation strips out almost all critical document metadata (e.g., parser_config, process_duration, status, size, type). Python includes all of these. If left unfixed, this will cause the frontend console to crash due to missing fields.

Could you please address these logic alignments? Thanks again for your hard work!

Ports four chunk endpoints from Python to Go (infiniflow#15668):
  GET   /api/v1/datasets/:dataset_id/documents/:document_id/chunks            (ListChunksREST)
  POST  /api/v1/datasets/:dataset_id/documents/:document_id/chunks            (AddChunk)
  PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks/:chunk_id  (UpdateChunkREST)
  PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks            (SwitchChunks)

Addresses review feedback (CodeRabbit) and the CI build break:
  - tokenizer.Tokenize returns (string, error); the bogus contentLtks.(string)
    assertion on a non-interface value (a compile error breaking CI) is removed —
    the string is passed straight to FineGrainedTokenize
  - ListChunksREST page/page_size now parse with strconv.Atoi instead of
    json.Number(...).Int64()
  - embedTexts no longer reports len(d.Embedding) (the fixed vector dimension) as
    token usage; token count is derived from the input text via the project
    tokenizer (estimateTokenCount), since the embedding driver exposes no
    provider token usage

Rebased onto latest main, resolving the chunk.go import conflict with upstream's
RetrievalTest refactor (now using service/nlp): merged the import union
(keeping nlp from upstream and xxhash/gorm from this PR).
@hunnyboy1217 hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 95d8fa7 to c50554f Compare June 9, 2026 04:47
@hunnyboy1217

Copy link
Copy Markdown
Contributor Author

HI, @Hz-186 ,
Just updated, please review again. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask]: Implement Chunk APIs in Go

3 participants