Skip to content

feat[Go]: implement document upload, dataset metadata bulk-update APIs#15672

Open
hunnyboy1217 wants to merge 3 commits into
infiniflow:mainfrom
hunnyboy1217:feat/go-document-metadata-upload-apis
Open

feat[Go]: implement document upload, dataset metadata bulk-update APIs#15672
hunnyboy1217 wants to merge 3 commits into
infiniflow:mainfrom
hunnyboy1217:feat/go-document-metadata-upload-apis

Conversation

@hunnyboy1217
Copy link
Copy Markdown
Contributor

@hunnyboy1217 hunnyboy1217 commented Jun 4, 2026

What problem does this PR solve?

Closes #15671 — ports three Document/metadata endpoints from Python (api/apps/restful_apis/document_api.py) to the Go API layer.

Method Path Handler Python source
POST /api/v1/documents/upload UploadInfo upload_info
PATCH /api/v1/datasets/:dataset_id/documents/metadatas UpdateDocumentMetadatas update_metadata
POST /api/v1/datasets/:dataset_id/metadata/update MetadataBatchUpdate metadata_batch_update

Type of change

  • New Feature (non-breaking change which adds functionality)

Implementation notes

Files changed:

internal/service/file.go      – UploadFromURL helper
internal/service/document.go  – BatchUpdateDocumentMetadata + supporting types
internal/handler/document.go  – three new handlers + shared batchMetadataUpdate
internal/router/router.go     – route registration

Functional parity table:

Concern Go behaviour
UploadInfo — file path c.MultipartForm()fileService.UploadFile(tenantID, "", files) (existing method); single file returns scalar, multiple returns array — same as Python
UploadInfo — URL path New FileService.UploadFromURL: validates http/https scheme, GET remote content (100 MB cap), stores to tenant root folder, returns same file-info map as the file path
Both inputs provided Returns CodeArgumentError mirroring Python get_error_argument_result
Neither provided Returns CodeArgumentError
PATCH /metadatas and POST /metadata/update Identical logic via shared batchMetadataUpdate(c) — dataset access check via datasetService.Accessible, request body validated, delegated to DocumentService.BatchUpdateDocumentMetadata
document_ids validation documentDAO.GetAllDocIDsByKBIDs([datasetID]) builds allowed set; unknown IDs returned as error — mirrors KnowledgebaseService.list_documents_by_ids
metadata_condition filter metadataSvc.GetFlattedMetaByKBs([datasetID]) + ApplyMetaFilter(flattedMeta, conditions, logic) — reuses the same Go function already used by RetrievalTest
Selector intersection When both document_ids and metadata_condition given, result is their intersection (Python target_doc_ids & filtered_ids)
Early exit When metadata_condition.conditions present but nothing matched: returns {updated:0, matched_docs:0}
Per-doc update SetDocumentMetadata for updates, DeleteDocumentMetadata for deletes
Response {updated: N, matched_docs: M} matching Python's get_result(data={...}) shape

Ports three endpoints from Python (document_api.py) to Go, closing infiniflow#15671:

  POST  /api/v1/documents/upload                             (UploadInfo)
  PATCH /api/v1/datasets/:dataset_id/documents/metadatas     (UpdateDocumentMetadatas)
  POST  /api/v1/datasets/:dataset_id/metadata/update         (MetadataBatchUpdate)

Changed files:
  internal/service/file.go     – UploadFromURL: fetch remote URL, store in tenant
                                  root folder, return file metadata
  internal/service/document.go – BatchUpdateDocumentMetadata: validate doc IDs
                                  belong to dataset; apply metadata_condition via
                                  GetFlattedMetaByKBs + ApplyMetaFilter; iterate
                                  SetDocumentMetadata / DeleteDocumentMetadata per doc
  internal/handler/document.go – DocumentHandler gains fileService field;
                                  documentServiceIface extended with
                                  BatchUpdateDocumentMetadata; three new handlers +
                                  shared batchMetadataUpdate helper
  internal/router/router.go    – three new routes registered

Functional parity with Python:
  UploadInfo          – accepts multipart files OR ?url=..., not both; file-only
                        path uses existing UploadFile; URL path uses new
                        UploadFromURL (HTTP GET → store → file record)
  UpdateDocumentMetadatas / MetadataBatchUpdate – identical logic via shared helper:
                        dataset-access check; selector.document_ids validated against
                        dataset; selector.metadata_condition filtered via ApplyMetaFilter
                        (same Go function used by RetrievalTest); updates applied with
                        SetDocumentMetadata; deletes applied with DeleteDocumentMetadata;
                        returns {updated, matched_docs}
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6251ef88-f807-4daa-b920-dd8f97bc35c0

📥 Commits

Reviewing files that changed from the base of the PR and between 449c8ee and 900eb59.

📒 Files selected for processing (3)
  • internal/handler/document.go
  • internal/service/document.go
  • internal/service/file.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/handler/document.go

📝 Walkthrough

Walkthrough

Adds remote document upload handling and bulk metadata update/delete. FileService can fetch and store http/https URLs with SSRF protections and a 100MB cap. DocumentService exposes BatchUpdateDocumentMetadata with selector/updates/deletes types. Handlers and routes wire UploadInfo and two dataset metadata endpoints with validation and standardized JSON responses.

Changes

Document Upload and Batch Metadata Management

Layer / File(s) Summary
File upload service
internal/service/file.go
Implements FileService.UploadFromURL and helpers: safe HTTP fetch with SSRF protections and 100MB cap, filename sanitization, public-IP checks, store bytes under tenant root, create entity.File record and return file response.
Batch metadata update service
internal/service/document.go
Adds MetadataSelector, MetadataUpdate, MetadataDelete, BatchUpdateMetadataRequest/Result and DocumentService.BatchUpdateDocumentMetadata to resolve target docs, optionally filter via metadata conditions, apply per-document metadata sets/deletes, and return matched/updated counts.
HTTP handlers and endpoint registration
internal/handler/document.go, internal/router/router.go
DocumentHandler adds fileService and extends documentServiceIface; implements UploadInfo (multipart file or ?url= upload) and a shared batchMetadataUpdate used by UpdateDocumentMetadatas and MetadataBatchUpdate; router registers POST /api/v1/documents/upload, POST /api/v1/datasets/:dataset_id/metadata/update, and PATCH /api/v1/datasets/:dataset_id/documents/metadatas.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant UploadInfo as DocumentHandler.UploadInfo
  participant FileService
  participant DocumentService
  participant Storage as ObjectStorage

  Client->>UploadInfo: POST /api/v1/documents/upload (file or url)
  UploadInfo->>FileService: UploadFile(...) or UploadFromURL(userID, url)
  FileService->>Storage: Save file bytes
  FileService->>DocumentService: Create entity.File / persist metadata
  DocumentService-->>UploadInfo: file metadata response
  UploadInfo-->>Client: JSON response (single or array)

  Client->>UploadInfo: PATCH/POST dataset metadata endpoints
  UploadInfo->>DocumentService: BatchUpdateDocumentMetadata(datasetID, request)
  DocumentService->>DocumentService: Resolve/filter IDs, apply updates/deletes
  DocumentService-->>UploadInfo: Matched/Updated counts
  UploadInfo-->>Client: JSON result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 I hopped through bytes and sanitized names,
Fetched distant files and guarded their frames.
Metadata sorted in tidy arrays,
Bulk updates hum through handler relays.
Hooray—uploads and batches, lined up like games!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title directly describes the three main features being implemented: document upload and dataset metadata bulk-update APIs, reflecting the core changes in the changeset.
Description check ✅ Passed The PR description includes the problem statement, type of change, and detailed implementation notes with functional parity table covering all three endpoints and key behaviors.
Linked Issues check ✅ Passed The PR implements all three required endpoints (POST /api/v1/documents/upload, PATCH /api/v1/datasets/:dataset_id/documents/metadatas, POST /api/v1/datasets/:dataset_id/metadata/update) with documented functional parity to Python sources and proper validation/access checks.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the three required endpoints: file upload/URL fetch support, batch metadata updates, handler logic, routing, and service methods directly address the linked issue objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
internal/handler/document.go (1)

914-918: 💤 Low value

Silently ignoring MultipartForm() error may hide issues.

The error from c.MultipartForm() is discarded. While this allows non-multipart requests (URL-only) to proceed, it also hides legitimate parsing errors (e.g., malformed multipart boundary, exceeding memory limits). Consider distinguishing between "no multipart form" and "invalid multipart form".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/handler/document.go` around lines 914 - 918, The code currently
discards the error from c.MultipartForm() which hides parsing failures; update
the block around c.MultipartForm() (where files := form.File["file"] is set) to
capture the returned error, then distinguish "not a multipart request" from real
parsing errors: if err == nil use form.File["file"]; if err indicates “not
multipart” (e.g., http.ErrNotMultipart or the framework’s equivalent) continue
treating it as a URL-only request; otherwise return/abort with a 400 and log the
parsing error (or propagate the error) so malformed multipart boundaries or size
issues aren’t silently ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/document.go`:
- Around line 1308-1310: In parseMetadataConditions the code only sets mfc.Value
when cond["value"] is a string, so numeric/boolean JSON values are ignored;
update the logic in parseMetadataConditions to handle non-string primitives by
detecting types (e.g., float64 for JSON numbers, bool) or by converting any
non-nil cond["value"] to a string representation (e.g., fmt.Sprintf("%v",
cond["value"])) before assigning to mfc.Value; ensure you still preserve string
values and avoid panics on nil values when reading cond["value"].
- Around line 1265-1288: The loop increments the updated counter even when
DeleteDocumentMetadata fails, causing inconsistent counts versus
SetDocumentMetadata which uses continue on error; modify the
BatchUpdateDocumentMetadata logic that iterates over ids (and checks req.Updates
/ req.Deletes) so that for each docID you perform SetDocumentMetadata and
DeleteDocumentMetadata but only increment updated when all requested operations
for that doc complete successfully; e.g. call SetDocumentMetadata and if it
fails continue, call DeleteDocumentMetadata and if it fails continue (or track
an error flag and only updated++ when no errors), ensuring both
SetDocumentMetadata and DeleteDocumentMetadata outcomes determine whether
updated is incremented.

In `@internal/service/file.go`:
- Around line 1011-1022: The code uses http.Get(rawURL) and
io.LimitReader(resp.Body, 100<<20) which leaves no client timeout and silently
truncates bodies; replace http.Get with a request executed by an http.Client
with a bounded timeout (or use context.WithTimeout) when calling rawURL (same
call site where http.Get(rawURL) is used), and after reading via
io.LimitReader(resp.Body, 100<<20) verify truncation didn't occur by checking
resp.ContentLength against 100<<20 when available or attempting to read one
extra byte from resp.Body (or using io.ReadAll on a reader wrapped to detect
overflow) and return an explicit error if the body exceeded the 100<<20 limit;
keep defers to close resp.Body and propagate errors using the existing
fmt.Errorf pattern.
- Around line 1025-1028: The filename derivation uses filepath.Base(parsed.Path)
and already avoids query-string extraction, but it doesn't sanitize path-derived
names; update the logic around filename (the variable set from
filepath.Base(parsed.Path)) to clean/normalize the base segment by removing or
replacing filesystem-invalid characters (e.g., <>:"/\\|?*), collapse/control
whitespace, and enforce platform-safe constraints (trim trailing dots/spaces,
handle reserved names like CON/PRN/NUL on Windows); also add collision handling
(e.g., append numeric suffix) and fallback to "download" only after sanitization
fails so Save/Write code receives a safe, unique filename. Ensure these changes
are applied where filename is used downstream so no unsanitized name is
persisted.
- Around line 1005-1008: The URL validation in internal/service/file.go (used by
UploadFromURL) only checks scheme and is vulnerable to SSRF; update
UploadFromURL (or the helper that calls url.Parse) to: 1) parse the hostname and
resolve its A/AAAA records up front, 2) reject any resolved IPs in
private/loopback/link-local ranges and known cloud metadata ranges (e.g.,
169.254.169.254/32, 169.254.0.0/16, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16,
::1/128, fc00::/7, fe80::/10), 3) prevent DNS rebinding by pinning the resolved
IP(s) and using a custom http.Transport DialContext that connects only to those
IPs, 4) disable or strictly validate redirects (set CheckRedirect to disallow
redirecting to different host/IPs or re-run the same IP whitelist on each
redirect target), and 5) fail fast with a clear error from UploadFromURL when
any check fails. Ensure these checks are applied before performing the
http.Get/fetch and reference the functions/methods UploadFromURL and any HTTP
client/Transport/CheckRedirect setup in file.go.

---

Nitpick comments:
In `@internal/handler/document.go`:
- Around line 914-918: The code currently discards the error from
c.MultipartForm() which hides parsing failures; update the block around
c.MultipartForm() (where files := form.File["file"] is set) to capture the
returned error, then distinguish "not a multipart request" from real parsing
errors: if err == nil use form.File["file"]; if err indicates “not multipart”
(e.g., http.ErrNotMultipart or the framework’s equivalent) continue treating it
as a URL-only request; otherwise return/abort with a 400 and log the parsing
error (or propagate the error) so malformed multipart boundaries or size issues
aren’t silently ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 16761187-0092-4a38-ab75-03367547b471

📥 Commits

Reviewing files that changed from the base of the PR and between 96a4166 and 449c8ee.

📒 Files selected for processing (4)
  • internal/handler/document.go
  • internal/router/router.go
  • internal/service/document.go
  • internal/service/file.go

Comment thread internal/service/document.go
Comment thread internal/service/document.go Outdated
Comment thread internal/service/file.go
Comment thread internal/service/file.go Outdated
Comment thread internal/service/file.go Outdated
Copy link
Copy Markdown
Contributor

@Hz-186 Hz-186 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @hunnyboy1217

This change still has two parity gaps before it can be merged.

First, the Go metadata batch-update path does not parse metadata_condition in the same shape as Python. Python sends conditions as {name, comparison_operator, value}, but parseMetadataConditions() currently only reads {key|field, op, value}. That means valid Python-style requests will not be filtered correctly.

Second, POST /api/v1/documents/upload?url=... is missing the Python SSRF guard. Python validates the remote URL with assert_url_is_safe(url) before fetching it, while Go currently only checks the scheme and then performs http.Get directly.

Comment thread internal/service/document.go Outdated
}

// parseMetadataConditions converts metadata_condition.conditions to MetaFilterConditions.
func parseMetadataConditions(mc map[string]interface{}) []MetaFilterCondition {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

// @Param url query string false "Remote URL to fetch"
// @Success 200 {object} map[string]interface{}
// @Router /api/v1/documents/upload [post]
func (h *DocumentHandler) UploadInfo(c *gin.Context) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

…nfiniflow#15672)

Applies the final-review feedback from Hz-186 and CodeRabbit, and adapts to the
GetFlattedMetaByKBs → common.MetaData change merged from main (infiniflow#15656).

Hz-186 — Python parity gaps:
  - metadata condition shape: the metadata_condition filter now goes through the
    canonical common.ParseAndConvert + common.MetaFilter (the Go port of Python
    convert_conditions + meta_filter). Conditions are read as
    {name, comparison_operator, value} with the Python operator mapping
    (is→=, not is→≠, >=→≥, <=→≤, !=→≠). This also fixes a compile break: after
    infiniflow#15656, GetFlattedMetaByKBs returns common.MetaData, which the old
    ApplyMetaFilter(map[string]interface{}) call site no longer accepted. The
    bespoke parseMetadataConditions helper is removed.
  - SSRF protection: the ?url= upload now mirrors Python assert_url_is_safe —
    scheme allowlist, host resolved, every resolved IP must be globally routable
    (loopback/private/link-local/multicast/unspecified/CGNAT rejected), and the
    validated IP is pinned in the dialer (re-validated on each redirect hop) to
    defeat DNS rebinding.

CodeRabbit:
  - BatchUpdateDocumentMetadata: the "updated" counter no longer increments when
    a delete fails — a document is counted only when every requested operation
    (updates AND deletes) succeeds.
  - non-string metadata condition values (numbers, booleans) are now honoured via
    common.ParseAndConvert's interface{} value instead of being dropped.
  - UploadFromURL HTTP client: added connect (10s) and overall (30s) timeouts via
    a custom http.Client; the body is read one byte past the 100 MB cap so an
    oversized file is rejected rather than silently truncated by io.LimitReader.
  - UploadFromURL filename: derived filename is sanitized — directory components
    stripped, unsafe/control chars replaced, Windows reserved names rejected,
    length bounded, "download" fallback.
  - UploadInfo: a malformed multipart body now returns HTTP 400 instead of being
    silently ignored; a non-multipart request (http.ErrNotMultipart) stays benign
    so the ?url= path is unaffected.
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 5, 2026
@hunnyboy1217
Copy link
Copy Markdown
Contributor Author

Hi, @Hz-186 ,
Updated.

@hunnyboy1217 hunnyboy1217 requested a review from Hz-186 June 5, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask]: Implement remaining Document/metadata APIs in Go

2 participants