feat[Go]: implement document upload, dataset metadata bulk-update APIs#15672
feat[Go]: implement document upload, dataset metadata bulk-update APIs#15672hunnyboy1217 wants to merge 3 commits into
Conversation
Ports three endpoints from Python (document_api.py) to Go, closing infiniflow#15671: POST /api/v1/documents/upload (UploadInfo) PATCH /api/v1/datasets/:dataset_id/documents/metadatas (UpdateDocumentMetadatas) POST /api/v1/datasets/:dataset_id/metadata/update (MetadataBatchUpdate) Changed files: internal/service/file.go – UploadFromURL: fetch remote URL, store in tenant root folder, return file metadata internal/service/document.go – BatchUpdateDocumentMetadata: validate doc IDs belong to dataset; apply metadata_condition via GetFlattedMetaByKBs + ApplyMetaFilter; iterate SetDocumentMetadata / DeleteDocumentMetadata per doc internal/handler/document.go – DocumentHandler gains fileService field; documentServiceIface extended with BatchUpdateDocumentMetadata; three new handlers + shared batchMetadataUpdate helper internal/router/router.go – three new routes registered Functional parity with Python: UploadInfo – accepts multipart files OR ?url=..., not both; file-only path uses existing UploadFile; URL path uses new UploadFromURL (HTTP GET → store → file record) UpdateDocumentMetadatas / MetadataBatchUpdate – identical logic via shared helper: dataset-access check; selector.document_ids validated against dataset; selector.metadata_condition filtered via ApplyMetaFilter (same Go function used by RetrievalTest); updates applied with SetDocumentMetadata; deletes applied with DeleteDocumentMetadata; returns {updated, matched_docs}
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds remote document upload handling and bulk metadata update/delete. FileService can fetch and store http/https URLs with SSRF protections and a 100MB cap. DocumentService exposes BatchUpdateDocumentMetadata with selector/updates/deletes types. Handlers and routes wire UploadInfo and two dataset metadata endpoints with validation and standardized JSON responses. ChangesDocument Upload and Batch Metadata Management
Sequence Diagram(s)sequenceDiagram
participant Client
participant UploadInfo as DocumentHandler.UploadInfo
participant FileService
participant DocumentService
participant Storage as ObjectStorage
Client->>UploadInfo: POST /api/v1/documents/upload (file or url)
UploadInfo->>FileService: UploadFile(...) or UploadFromURL(userID, url)
FileService->>Storage: Save file bytes
FileService->>DocumentService: Create entity.File / persist metadata
DocumentService-->>UploadInfo: file metadata response
UploadInfo-->>Client: JSON response (single or array)
Client->>UploadInfo: PATCH/POST dataset metadata endpoints
UploadInfo->>DocumentService: BatchUpdateDocumentMetadata(datasetID, request)
DocumentService->>DocumentService: Resolve/filter IDs, apply updates/deletes
DocumentService-->>UploadInfo: Matched/Updated counts
UploadInfo-->>Client: JSON result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
internal/handler/document.go (1)
914-918: 💤 Low valueSilently ignoring
MultipartForm()error may hide issues.The error from
c.MultipartForm()is discarded. While this allows non-multipart requests (URL-only) to proceed, it also hides legitimate parsing errors (e.g., malformed multipart boundary, exceeding memory limits). Consider distinguishing between "no multipart form" and "invalid multipart form".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/handler/document.go` around lines 914 - 918, The code currently discards the error from c.MultipartForm() which hides parsing failures; update the block around c.MultipartForm() (where files := form.File["file"] is set) to capture the returned error, then distinguish "not a multipart request" from real parsing errors: if err == nil use form.File["file"]; if err indicates “not multipart” (e.g., http.ErrNotMultipart or the framework’s equivalent) continue treating it as a URL-only request; otherwise return/abort with a 400 and log the parsing error (or propagate the error) so malformed multipart boundaries or size issues aren’t silently ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/service/document.go`:
- Around line 1308-1310: In parseMetadataConditions the code only sets mfc.Value
when cond["value"] is a string, so numeric/boolean JSON values are ignored;
update the logic in parseMetadataConditions to handle non-string primitives by
detecting types (e.g., float64 for JSON numbers, bool) or by converting any
non-nil cond["value"] to a string representation (e.g., fmt.Sprintf("%v",
cond["value"])) before assigning to mfc.Value; ensure you still preserve string
values and avoid panics on nil values when reading cond["value"].
- Around line 1265-1288: The loop increments the updated counter even when
DeleteDocumentMetadata fails, causing inconsistent counts versus
SetDocumentMetadata which uses continue on error; modify the
BatchUpdateDocumentMetadata logic that iterates over ids (and checks req.Updates
/ req.Deletes) so that for each docID you perform SetDocumentMetadata and
DeleteDocumentMetadata but only increment updated when all requested operations
for that doc complete successfully; e.g. call SetDocumentMetadata and if it
fails continue, call DeleteDocumentMetadata and if it fails continue (or track
an error flag and only updated++ when no errors), ensuring both
SetDocumentMetadata and DeleteDocumentMetadata outcomes determine whether
updated is incremented.
In `@internal/service/file.go`:
- Around line 1011-1022: The code uses http.Get(rawURL) and
io.LimitReader(resp.Body, 100<<20) which leaves no client timeout and silently
truncates bodies; replace http.Get with a request executed by an http.Client
with a bounded timeout (or use context.WithTimeout) when calling rawURL (same
call site where http.Get(rawURL) is used), and after reading via
io.LimitReader(resp.Body, 100<<20) verify truncation didn't occur by checking
resp.ContentLength against 100<<20 when available or attempting to read one
extra byte from resp.Body (or using io.ReadAll on a reader wrapped to detect
overflow) and return an explicit error if the body exceeded the 100<<20 limit;
keep defers to close resp.Body and propagate errors using the existing
fmt.Errorf pattern.
- Around line 1025-1028: The filename derivation uses filepath.Base(parsed.Path)
and already avoids query-string extraction, but it doesn't sanitize path-derived
names; update the logic around filename (the variable set from
filepath.Base(parsed.Path)) to clean/normalize the base segment by removing or
replacing filesystem-invalid characters (e.g., <>:"/\\|?*), collapse/control
whitespace, and enforce platform-safe constraints (trim trailing dots/spaces,
handle reserved names like CON/PRN/NUL on Windows); also add collision handling
(e.g., append numeric suffix) and fallback to "download" only after sanitization
fails so Save/Write code receives a safe, unique filename. Ensure these changes
are applied where filename is used downstream so no unsanitized name is
persisted.
- Around line 1005-1008: The URL validation in internal/service/file.go (used by
UploadFromURL) only checks scheme and is vulnerable to SSRF; update
UploadFromURL (or the helper that calls url.Parse) to: 1) parse the hostname and
resolve its A/AAAA records up front, 2) reject any resolved IPs in
private/loopback/link-local ranges and known cloud metadata ranges (e.g.,
169.254.169.254/32, 169.254.0.0/16, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16,
::1/128, fc00::/7, fe80::/10), 3) prevent DNS rebinding by pinning the resolved
IP(s) and using a custom http.Transport DialContext that connects only to those
IPs, 4) disable or strictly validate redirects (set CheckRedirect to disallow
redirecting to different host/IPs or re-run the same IP whitelist on each
redirect target), and 5) fail fast with a clear error from UploadFromURL when
any check fails. Ensure these checks are applied before performing the
http.Get/fetch and reference the functions/methods UploadFromURL and any HTTP
client/Transport/CheckRedirect setup in file.go.
---
Nitpick comments:
In `@internal/handler/document.go`:
- Around line 914-918: The code currently discards the error from
c.MultipartForm() which hides parsing failures; update the block around
c.MultipartForm() (where files := form.File["file"] is set) to capture the
returned error, then distinguish "not a multipart request" from real parsing
errors: if err == nil use form.File["file"]; if err indicates “not multipart”
(e.g., http.ErrNotMultipart or the framework’s equivalent) continue treating it
as a URL-only request; otherwise return/abort with a 400 and log the parsing
error (or propagate the error) so malformed multipart boundaries or size issues
aren’t silently ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 16761187-0092-4a38-ab75-03367547b471
📒 Files selected for processing (4)
internal/handler/document.gointernal/router/router.gointernal/service/document.gointernal/service/file.go
Hz-186
left a comment
There was a problem hiding this comment.
Hi, @hunnyboy1217
This change still has two parity gaps before it can be merged.
First, the Go metadata batch-update path does not parse metadata_condition in the same shape as Python. Python sends conditions as {name, comparison_operator, value}, but parseMetadataConditions() currently only reads {key|field, op, value}. That means valid Python-style requests will not be filtered correctly.
Second, POST /api/v1/documents/upload?url=... is missing the Python SSRF guard. Python validates the remote URL with assert_url_is_safe(url) before fetching it, while Go currently only checks the scheme and then performs http.Get directly.
| } | ||
|
|
||
| // parseMetadataConditions converts metadata_condition.conditions to MetaFilterConditions. | ||
| func parseMetadataConditions(mc map[string]interface{}) []MetaFilterCondition { |
| // @Param url query string false "Remote URL to fetch" | ||
| // @Success 200 {object} map[string]interface{} | ||
| // @Router /api/v1/documents/upload [post] | ||
| func (h *DocumentHandler) UploadInfo(c *gin.Context) { |
…nfiniflow#15672) Applies the final-review feedback from Hz-186 and CodeRabbit, and adapts to the GetFlattedMetaByKBs → common.MetaData change merged from main (infiniflow#15656). Hz-186 — Python parity gaps: - metadata condition shape: the metadata_condition filter now goes through the canonical common.ParseAndConvert + common.MetaFilter (the Go port of Python convert_conditions + meta_filter). Conditions are read as {name, comparison_operator, value} with the Python operator mapping (is→=, not is→≠, >=→≥, <=→≤, !=→≠). This also fixes a compile break: after infiniflow#15656, GetFlattedMetaByKBs returns common.MetaData, which the old ApplyMetaFilter(map[string]interface{}) call site no longer accepted. The bespoke parseMetadataConditions helper is removed. - SSRF protection: the ?url= upload now mirrors Python assert_url_is_safe — scheme allowlist, host resolved, every resolved IP must be globally routable (loopback/private/link-local/multicast/unspecified/CGNAT rejected), and the validated IP is pinned in the dialer (re-validated on each redirect hop) to defeat DNS rebinding. CodeRabbit: - BatchUpdateDocumentMetadata: the "updated" counter no longer increments when a delete fails — a document is counted only when every requested operation (updates AND deletes) succeeds. - non-string metadata condition values (numbers, booleans) are now honoured via common.ParseAndConvert's interface{} value instead of being dropped. - UploadFromURL HTTP client: added connect (10s) and overall (30s) timeouts via a custom http.Client; the body is read one byte past the 100 MB cap so an oversized file is rejected rather than silently truncated by io.LimitReader. - UploadFromURL filename: derived filename is sanitized — directory components stripped, unsafe/control chars replaced, Windows reserved names rejected, length bounded, "download" fallback. - UploadInfo: a malformed multipart body now returns HTTP 400 instead of being silently ignored; a non-multipart request (http.ErrNotMultipart) stays benign so the ?url= path is unaffected.
|
Hi, @Hz-186 , |
What problem does this PR solve?
Closes #15671 — ports three Document/metadata endpoints from Python (
api/apps/restful_apis/document_api.py) to the Go API layer./api/v1/documents/uploadUploadInfoupload_info/api/v1/datasets/:dataset_id/documents/metadatasUpdateDocumentMetadatasupdate_metadata/api/v1/datasets/:dataset_id/metadata/updateMetadataBatchUpdatemetadata_batch_updateType of change
Implementation notes
Files changed:
Functional parity table:
UploadInfo— file pathc.MultipartForm()→fileService.UploadFile(tenantID, "", files)(existing method); single file returns scalar, multiple returns array — same as PythonUploadInfo— URL pathFileService.UploadFromURL: validateshttp/httpsscheme,GETremote content (100 MB cap), stores to tenant root folder, returns same file-info map as the file pathCodeArgumentErrormirroring Pythonget_error_argument_resultCodeArgumentErrorPATCH /metadatasandPOST /metadata/updatebatchMetadataUpdate(c)— dataset access check viadatasetService.Accessible, request body validated, delegated toDocumentService.BatchUpdateDocumentMetadatadocument_idsvalidationdocumentDAO.GetAllDocIDsByKBIDs([datasetID])builds allowed set; unknown IDs returned as error — mirrorsKnowledgebaseService.list_documents_by_idsmetadata_conditionfiltermetadataSvc.GetFlattedMetaByKBs([datasetID])+ApplyMetaFilter(flattedMeta, conditions, logic)— reuses the same Go function already used byRetrievalTestdocument_idsandmetadata_conditiongiven, result is their intersection (Pythontarget_doc_ids & filtered_ids)metadata_condition.conditionspresent but nothing matched: returns{updated:0, matched_docs:0}SetDocumentMetadataforupdates,DeleteDocumentMetadatafordeletes{updated: N, matched_docs: M}matching Python'sget_result(data={...})shape