feat[Go]: implement document upload, dataset metadata bulk-update APIs by hunnyboy1217 · Pull Request #15672 · infiniflow/ragflow

hunnyboy1217 · 2026-06-04T14:18:12Z

What problem does this PR solve?

Closes #15671 — ports three Document/metadata endpoints from Python (api/apps/restful_apis/document_api.py) to the Go API layer.

Method	Path	Handler	Python source
POST	`/api/v1/documents/upload`	`UploadInfo`	`upload_info`
PATCH	`/api/v1/datasets/:dataset_id/documents/metadatas`	`UpdateDocumentMetadatas`	`update_metadata`
POST	`/api/v1/datasets/:dataset_id/metadata/update`	`MetadataBatchUpdate`	`metadata_batch_update`

Type of change

New Feature (non-breaking change which adds functionality)

Implementation notes

Files changed:

internal/service/file.go      – UploadFromURL helper
internal/service/document.go  – BatchUpdateDocumentMetadata + supporting types
internal/handler/document.go  – three new handlers + shared batchMetadataUpdate
internal/router/router.go     – route registration

Functional parity table:

Concern	Go behaviour
`UploadInfo` — file path	`c.MultipartForm()` → `fileService.UploadFile(tenantID, "", files)` (existing method); single file returns scalar, multiple returns array — same as Python
`UploadInfo` — URL path	New `FileService.UploadFromURL`: validates `http`/`https` scheme, `GET` remote content (100 MB cap), stores to tenant root folder, returns same file-info map as the file path
Both inputs provided	Returns `CodeArgumentError` mirroring Python `get_error_argument_result`
Neither provided	Returns `CodeArgumentError`
`PATCH /metadatas` and `POST /metadata/update`	Identical logic via shared `batchMetadataUpdate(c)` — dataset access check via `datasetService.Accessible`, request body validated, delegated to `DocumentService.BatchUpdateDocumentMetadata`
`document_ids` validation	`documentDAO.GetAllDocIDsByKBIDs([datasetID])` builds allowed set; unknown IDs returned as error — mirrors `KnowledgebaseService.list_documents_by_ids`
`metadata_condition` filter	`metadataSvc.GetFlattedMetaByKBs([datasetID])` + `ApplyMetaFilter(flattedMeta, conditions, logic)` — reuses the same Go function already used by `RetrievalTest`
Selector intersection	When both `document_ids` and `metadata_condition` given, result is their intersection (Python `target_doc_ids & filtered_ids`)
Early exit	When `metadata_condition.conditions` present but nothing matched: returns `{updated:0, matched_docs:0}`
Per-doc update	`SetDocumentMetadata` for `updates`, `DeleteDocumentMetadata` for `deletes`
Response	`{updated: N, matched_docs: M}` matching Python's `get_result(data={...})` shape

Ports three endpoints from Python (document_api.py) to Go, closing infiniflow#15671: POST /api/v1/documents/upload (UploadInfo) PATCH /api/v1/datasets/:dataset_id/documents/metadatas (UpdateDocumentMetadatas) POST /api/v1/datasets/:dataset_id/metadata/update (MetadataBatchUpdate) Changed files: internal/service/file.go – UploadFromURL: fetch remote URL, store in tenant root folder, return file metadata internal/service/document.go – BatchUpdateDocumentMetadata: validate doc IDs belong to dataset; apply metadata_condition via GetFlattedMetaByKBs + ApplyMetaFilter; iterate SetDocumentMetadata / DeleteDocumentMetadata per doc internal/handler/document.go – DocumentHandler gains fileService field; documentServiceIface extended with BatchUpdateDocumentMetadata; three new handlers + shared batchMetadataUpdate helper internal/router/router.go – three new routes registered Functional parity with Python: UploadInfo – accepts multipart files OR ?url=..., not both; file-only path uses existing UploadFile; URL path uses new UploadFromURL (HTTP GET → store → file record) UpdateDocumentMetadatas / MetadataBatchUpdate – identical logic via shared helper: dataset-access check; selector.document_ids validated against dataset; selector.metadata_condition filtered via ApplyMetaFilter (same Go function used by RetrievalTest); updates applied with SetDocumentMetadata; deletes applied with DeleteDocumentMetadata; returns {updated, matched_docs}

coderabbitai · 2026-06-04T14:18:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6251ef88-f807-4daa-b920-dd8f97bc35c0

📥 Commits

Reviewing files that changed from the base of the PR and between 449c8ee and 900eb59.

📒 Files selected for processing (3)

internal/handler/document.go
internal/service/document.go
internal/service/file.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/handler/document.go

📝 Walkthrough

Walkthrough

Adds remote document upload handling and bulk metadata update/delete. FileService can fetch and store http/https URLs with SSRF protections and a 100MB cap. DocumentService exposes BatchUpdateDocumentMetadata with selector/updates/deletes types. Handlers and routes wire UploadInfo and two dataset metadata endpoints with validation and standardized JSON responses.

Changes

Document Upload and Batch Metadata Management

Layer / File(s)	Summary
File upload service `internal/service/file.go`	Implements `FileService.UploadFromURL` and helpers: safe HTTP fetch with SSRF protections and 100MB cap, filename sanitization, public-IP checks, store bytes under tenant root, create `entity.File` record and return file response.
Batch metadata update service `internal/service/document.go`	Adds `MetadataSelector`, `MetadataUpdate`, `MetadataDelete`, `BatchUpdateMetadataRequest/Result` and `DocumentService.BatchUpdateDocumentMetadata` to resolve target docs, optionally filter via metadata conditions, apply per-document metadata sets/deletes, and return matched/updated counts.
HTTP handlers and endpoint registration `internal/handler/document.go`, `internal/router/router.go`	`DocumentHandler` adds `fileService` and extends `documentServiceIface`; implements `UploadInfo` (multipart `file` or `?url=` upload) and a shared `batchMetadataUpdate` used by `UpdateDocumentMetadatas` and `MetadataBatchUpdate`; router registers `POST /api/v1/documents/upload`, `POST /api/v1/datasets/:dataset_id/metadata/update`, and `PATCH /api/v1/datasets/:dataset_id/documents/metadatas`.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant UploadInfo as DocumentHandler.UploadInfo
  participant FileService
  participant DocumentService
  participant Storage as ObjectStorage

  Client->>UploadInfo: POST /api/v1/documents/upload (file or url)
  UploadInfo->>FileService: UploadFile(...) or UploadFromURL(userID, url)
  FileService->>Storage: Save file bytes
  FileService->>DocumentService: Create entity.File / persist metadata
  DocumentService-->>UploadInfo: file metadata response
  UploadInfo-->>Client: JSON response (single or array)

  Client->>UploadInfo: PATCH/POST dataset metadata endpoints
  UploadInfo->>DocumentService: BatchUpdateDocumentMetadata(datasetID, request)
  DocumentService->>DocumentService: Resolve/filter IDs, apply updates/deletes
  DocumentService-->>UploadInfo: Matched/Updated counts
  UploadInfo-->>Client: JSON result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 I hopped through bytes and sanitized names,
Fetched distant files and guarded their frames.
Metadata sorted in tidy arrays,
Bulk updates hum through handler relays.
Hooray—uploads and batches, lined up like games!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title directly describes the three main features being implemented: document upload and dataset metadata bulk-update APIs, reflecting the core changes in the changeset.
Description check	✅ Passed	The PR description includes the problem statement, type of change, and detailed implementation notes with functional parity table covering all three endpoints and key behaviors.
Linked Issues check	✅ Passed	The PR implements all three required endpoints (POST /api/v1/documents/upload, PATCH /api/v1/datasets/:dataset_id/documents/metadatas, POST /api/v1/datasets/:dataset_id/metadata/update) with documented functional parity to Python sources and proper validation/access checks.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing the three required endpoints: file upload/URL fetch support, batch metadata updates, handler logic, routing, and service methods directly address the linked issue objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

internal/handler/document.go (1)
914-918: 💤 Low value

Silently ignoring MultipartForm() error may hide issues.

The error from c.MultipartForm() is discarded. While this allows non-multipart requests (URL-only) to proceed, it also hides legitimate parsing errors (e.g., malformed multipart boundary, exceeding memory limits). Consider distinguishing between "no multipart form" and "invalid multipart form".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/handler/document.go` around lines 914 - 918, The code currently
discards the error from c.MultipartForm() which hides parsing failures; update
the block around c.MultipartForm() (where files := form.File["file"] is set) to
capture the returned error, then distinguish "not a multipart request" from real
parsing errors: if err == nil use form.File["file"]; if err indicates “not
multipart” (e.g., http.ErrNotMultipart or the framework’s equivalent) continue
treating it as a URL-only request; otherwise return/abort with a 400 and log the
parsing error (or propagate the error) so malformed multipart boundaries or size
issues aren’t silently ignored.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/document.go`:
- Around line 1308-1310: In parseMetadataConditions the code only sets mfc.Value
when cond["value"] is a string, so numeric/boolean JSON values are ignored;
update the logic in parseMetadataConditions to handle non-string primitives by
detecting types (e.g., float64 for JSON numbers, bool) or by converting any
non-nil cond["value"] to a string representation (e.g., fmt.Sprintf("%v",
cond["value"])) before assigning to mfc.Value; ensure you still preserve string
values and avoid panics on nil values when reading cond["value"].
- Around line 1265-1288: The loop increments the updated counter even when
DeleteDocumentMetadata fails, causing inconsistent counts versus
SetDocumentMetadata which uses continue on error; modify the
BatchUpdateDocumentMetadata logic that iterates over ids (and checks req.Updates
/ req.Deletes) so that for each docID you perform SetDocumentMetadata and
DeleteDocumentMetadata but only increment updated when all requested operations
for that doc complete successfully; e.g. call SetDocumentMetadata and if it
fails continue, call DeleteDocumentMetadata and if it fails continue (or track
an error flag and only updated++ when no errors), ensuring both
SetDocumentMetadata and DeleteDocumentMetadata outcomes determine whether
updated is incremented.

In `@internal/service/file.go`:
- Around line 1011-1022: The code uses http.Get(rawURL) and
io.LimitReader(resp.Body, 100<<20) which leaves no client timeout and silently
truncates bodies; replace http.Get with a request executed by an http.Client
with a bounded timeout (or use context.WithTimeout) when calling rawURL (same
call site where http.Get(rawURL) is used), and after reading via
io.LimitReader(resp.Body, 100<<20) verify truncation didn't occur by checking
resp.ContentLength against 100<<20 when available or attempting to read one
extra byte from resp.Body (or using io.ReadAll on a reader wrapped to detect
overflow) and return an explicit error if the body exceeded the 100<<20 limit;
keep defers to close resp.Body and propagate errors using the existing
fmt.Errorf pattern.
- Around line 1025-1028: The filename derivation uses filepath.Base(parsed.Path)
and already avoids query-string extraction, but it doesn't sanitize path-derived
names; update the logic around filename (the variable set from
filepath.Base(parsed.Path)) to clean/normalize the base segment by removing or
replacing filesystem-invalid characters (e.g., <>:"/\\|?*), collapse/control
whitespace, and enforce platform-safe constraints (trim trailing dots/spaces,
handle reserved names like CON/PRN/NUL on Windows); also add collision handling
(e.g., append numeric suffix) and fallback to "download" only after sanitization
fails so Save/Write code receives a safe, unique filename. Ensure these changes
are applied where filename is used downstream so no unsanitized name is
persisted.
- Around line 1005-1008: The URL validation in internal/service/file.go (used by
UploadFromURL) only checks scheme and is vulnerable to SSRF; update
UploadFromURL (or the helper that calls url.Parse) to: 1) parse the hostname and
resolve its A/AAAA records up front, 2) reject any resolved IPs in
private/loopback/link-local ranges and known cloud metadata ranges (e.g.,
169.254.169.254/32, 169.254.0.0/16, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16,
::1/128, fc00::/7, fe80::/10), 3) prevent DNS rebinding by pinning the resolved
IP(s) and using a custom http.Transport DialContext that connects only to those
IPs, 4) disable or strictly validate redirects (set CheckRedirect to disallow
redirecting to different host/IPs or re-run the same IP whitelist on each
redirect target), and 5) fail fast with a clear error from UploadFromURL when
any check fails. Ensure these checks are applied before performing the
http.Get/fetch and reference the functions/methods UploadFromURL and any HTTP
client/Transport/CheckRedirect setup in file.go.

---

Nitpick comments:
In `@internal/handler/document.go`:
- Around line 914-918: The code currently discards the error from
c.MultipartForm() which hides parsing failures; update the block around
c.MultipartForm() (where files := form.File["file"] is set) to capture the
returned error, then distinguish "not a multipart request" from real parsing
errors: if err == nil use form.File["file"]; if err indicates “not multipart”
(e.g., http.ErrNotMultipart or the framework’s equivalent) continue treating it
as a URL-only request; otherwise return/abort with a 400 and log the parsing
error (or propagate the error) so malformed multipart boundaries or size issues
aren’t silently ignored.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 16761187-0092-4a38-ab75-03367547b471

📥 Commits

Reviewing files that changed from the base of the PR and between 96a4166 and 449c8ee.

📒 Files selected for processing (4)

internal/handler/document.go
internal/router/router.go
internal/service/document.go
internal/service/file.go

Hz-186

Hi, @hunnyboy1217

This change still has two parity gaps before it can be merged.

First, the Go metadata batch-update path does not parse metadata_condition in the same shape as Python. Python sends conditions as {name, comparison_operator, value}, but parseMetadataConditions() currently only reads {key|field, op, value}. That means valid Python-style requests will not be filtered correctly.

Second, POST /api/v1/documents/upload?url=... is missing the Python SSRF guard. Python validates the remote URL with assert_url_is_safe(url) before fetching it, while Go currently only checks the scheme and then performs http.Get directly.

Hz-186 · 2026-06-05T05:14:06Z

+}
+
+// parseMetadataConditions converts metadata_condition.conditions to MetaFilterConditions.
+func parseMetadataConditions(mc map[string]interface{}) []MetaFilterCondition {


Hz-186 · 2026-06-05T05:14:31Z

+// @Param url query string false "Remote URL to fetch"
+// @Success 200 {object} map[string]interface{}
+// @Router /api/v1/documents/upload [post]
+func (h *DocumentHandler) UploadInfo(c *gin.Context) {


…nfiniflow#15672) Applies the final-review feedback from Hz-186 and CodeRabbit, and adapts to the GetFlattedMetaByKBs → common.MetaData change merged from main (infiniflow#15656). Hz-186 — Python parity gaps: - metadata condition shape: the metadata_condition filter now goes through the canonical common.ParseAndConvert + common.MetaFilter (the Go port of Python convert_conditions + meta_filter). Conditions are read as {name, comparison_operator, value} with the Python operator mapping (is→=, not is→≠, >=→≥, <=→≤, !=→≠). This also fixes a compile break: after infiniflow#15656, GetFlattedMetaByKBs returns common.MetaData, which the old ApplyMetaFilter(map[string]interface{}) call site no longer accepted. The bespoke parseMetadataConditions helper is removed. - SSRF protection: the ?url= upload now mirrors Python assert_url_is_safe — scheme allowlist, host resolved, every resolved IP must be globally routable (loopback/private/link-local/multicast/unspecified/CGNAT rejected), and the validated IP is pinned in the dialer (re-validated on each redirect hop) to defeat DNS rebinding. CodeRabbit: - BatchUpdateDocumentMetadata: the "updated" counter no longer increments when a delete fails — a document is counted only when every requested operation (updates AND deletes) succeeds. - non-string metadata condition values (numbers, booleans) are now honoured via common.ParseAndConvert's interface{} value instead of being dropped. - UploadFromURL HTTP client: added connect (10s) and overall (30s) timeouts via a custom http.Client; the body is read one byte past the 100 MB cap so an oversized file is rejected rather than silently truncated by io.LimitReader. - UploadFromURL filename: derived filename is sanitized — directory components stripped, unsafe/control chars replaced, Windows reserved names rejected, length bounded, "download" fallback. - UploadInfo: a malformed multipart body now returns HTTP 400 instead of being silently ignored; a non-multipart request (http.ErrNotMultipart) stays benign so the ?url= path is unaffected.

hunnyboy1217 · 2026-06-05T05:31:55Z

Hi, @Hz-186 ,
Updated.

dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 4, 2026

Merge branch 'main' into feat/go-document-metadata-upload-apis

449c8ee

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread internal/service/document.go

Comment thread internal/service/document.go Outdated

Comment thread internal/service/file.go

Comment thread internal/service/file.go Outdated

Comment thread internal/service/file.go Outdated

Hz-186 suggested changes Jun 5, 2026

View reviewed changes

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 5, 2026

hunnyboy1217 requested a review from Hz-186 June 5, 2026 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat[Go]: implement document upload, dataset metadata bulk-update APIs#15672

feat[Go]: implement document upload, dataset metadata bulk-update APIs#15672
hunnyboy1217 wants to merge 3 commits into
infiniflow:mainfrom
hunnyboy1217:feat/go-document-metadata-upload-apis

hunnyboy1217 commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hz-186 left a comment

Uh oh!

Hz-186 Jun 5, 2026

Uh oh!

Hz-186 Jun 5, 2026

Uh oh!

hunnyboy1217 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hunnyboy1217 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of change

Implementation notes

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hz-186 left a comment

Choose a reason for hiding this comment

Uh oh!

Hz-186 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Hz-186 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

hunnyboy1217 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hunnyboy1217 commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading