GH-49644: [Python] Support converting list of multi-dimensional arrays to FixedShapeTensor#50203
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for constructing fixed-shape tensors and fixed-size lists from lists of multi-dimensional NumPy ndarrays by flattening values in C order (GH-49644).
Changes:
- Add tests covering tensor arrays built from lists of ndarrays (including nulls and shape mismatch).
- Add tests ensuring fixed-size lists accept multi-dimensional ndarray elements (and reject invalid cases).
- Update ndarray-to-list conversion to allow flattening for
FIXED_SIZE_LISTwhile keeping variable-sized lists restricted to 1D.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| python/pyarrow/tests/test_extension_type.py | Adds coverage for building FixedShapeTensorArray from a list of ndarrays. |
| python/pyarrow/tests/test_array.py | Adds a regression test for fixed-size list conversion from multi-dimensional ndarrays. |
| python/pyarrow/src/arrow/python/python_to_arrow.cc | Implements multi-dimensional ndarray flattening for fixed-size lists during conversion. |
… arrays to FixedShapeTensor
…cover C-order flatten
90c2ac4 to
e695d01
Compare
|
Thank you @aboderinsamuel! Also, am curious how you decided to use |
|
Hi @AlenkaF The macOS-intel failure is an intermittent dependency-install flake. Homebrew lock on aws-sdk-cpp, before the build/tests even run. It's hitting other PRs too (e.g. #50146), and #50195 is actively moving the macOS AWS-SDK dependency, so it looks like known infra rather than my change. Happy to keep re-running until it catches a clean runner. Everything else is green. @rok 🙂 |
| if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) { | ||
| return Status::Invalid("Can only convert 1-dimensional array values"); | ||
| } | ||
| flattened.reset(PyArray_Ravel(ndarray, NPY_CORDER)); |
There was a problem hiding this comment.
Why did you decided to use PyArray_Ravel? Are there any other alternatives?
There was a problem hiding this comment.
PyArray_Ravel(ndarray, NPY_CORDER) returns a 1-D, C-order view, which lets me reuse the existing 1-D append paths unchanged (the typed fast paths and the PySequence fallback). It's zero-copy when the input is already C-contiguous and only copies for non-contiguous/Fortran-order input, where a copy is unavoidable to get C-order data anyway, which also matches the fixed-shape-tensor's row-major storage.
Alternatives: PyArray_Flatten(NPY_CORDER) gives the same result but always copies; a manual NpyIter/buffer loop could avoid even the view allocation but would mean duplicating the typed-append fast paths. Ravel was the smallest change with the best reuse/performance balance.
There was a problem hiding this comment.
PyArray_Ravel over PyArray_Flatten seems ok. If I read the context correctly we're using a builder to append these arrays. When builder finishes it creates a new buffer with a copy of concatenated arrays. So PyArray_Ravel should be ok if the list of arrays is not deleted before appending is complete, which I expect won't happen. @AlenkaF am I reading this right?
There was a problem hiding this comment.
Could you have a look at PyArray_CheckFromAny with flag NPY_ARRAY_C_CONTIGUOUS? If I understand correctly it reshapes if necessary or gives access to the data/view directly. Then we can use PyArray_DATA as in other places in the codebase.
I am not sure which would fit better but am interested in what you think.
There was a problem hiding this comment.
Also to note here - PyArray_Ravel seems to default to C order. So we should document somewhere that no matter the input arrays order we will return C order. (@AlenkaF please doublecheck me here :) )
There was a problem hiding this comment.
Great, I think having more control can be helpful. Thanks for looking into it!
Also, some other flags might be needed, see: https://numpy.org/doc/stable/reference/c-api/array.html#basic-array-flags
There was a problem hiding this comment.
Thank you 🙏, i'm currently taking a look at the reference
There was a problem hiding this comment.
Thanks for the pointer to the flags doc, @AlenkaF, really useful read 🙂
Done: switched to PyArray_CheckFromAny with NPY_ARRAY_C_CONTIGUOUS | NPY_ARRAY_ALIGNED, then view it as 1-D and read in C order. The ALIGNED flag was the piece I'd have missed on my own, so thanks for steering me there! For byte order I kept the explicit ISBYTESWAPPED check (moved up top per @rok) instead of NPY_ARRAY_NOTSWAPPED, and left out WRITEABLE since we only read here.
Really appreciate you and @rok taking the time on this, learning a ton from it.
There was a problem hiding this comment.
Also, the comment now notes the array is always flattened in C order regardless of the input's layout. @rok
|
|
|
Hi all. I'd add two points for further review of current PR changes (thanks @aboderinsamuel !) import numpy as np
import pytest
import pyarrow as pa
np_dtype = np.dtype("int8")
tensor_type = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype), (2, 3))
pa.array([np.arange(6, dtype=np_dtype).reshape(3, 2)], type=tensor_type)
# As (3, 2) has the right number of elements (6), but the wrong shape for a (2, 3) tensor,
# shouldn't we get an error?Currently returns: np_dtype2 = np.dtype("float32")
elements = [
np.arange(6, dtype=np_dtype2).reshape(2, 3),
np.arange(6, 12, dtype=np_dtype2).reshape(2, 3),
]
permuted_type = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype2), (2, 3),
permutation=[1, 0])
pa.array(elements, type=permuted_type)
# For permuted tensor types built from multi-dim ndarrays, doesn't this store the wrong layout?Currently returns: |
…in error, simplify checks
rok
left a comment
There was a problem hiding this comment.
Some comments about comments. I think the one important question left is the PyArray_Ravel question.
|
@rok I'd keep PyArray_Ravel here. It's a zero-copy reshape of the already-C-contiguous array, and it gives the 1-D array that the dtype-mismatch fallback (value_converter_->Extend, just below) needs to iterate element-wise. Reading PyArray_DATA directly works for the typed fast path, but the fallback would still need a 1-D handle either way, so the ravel keeps a single shared path with no extra copy. Happy to switch if you'd prefer, but I'd lean to keeping it as-is. |
|
@aboderinsamuel - did you mean to push changes? |
|
yes, @rok. in a few minutes |
There was a problem hiding this comment.
This looks good and I think we're ready to merge. Will just wait for @AlenkaF to chime in on the PyArray_Ravel comment.
| 400 | ||
| ] | ||
| ] | ||
| """ |
There was a problem hiding this comment.
Maybe we could add an example of using the new constructor here? e.g.:
| Create an extension array from a list of multi-dimensional NumPy arrays. | |
| Each element is flattened in row-major (C) order, its shape must match | |
| the tensor shape. | |
| >>> import numpy as np | |
| >>> pa.array([np.array([[1, 2], [3, 4]], dtype=np.int32), | |
| ... np.array([[10, 20], [30, 40]], dtype=np.int32)], | |
| ... type=tensor_type) | |
| <pyarrow.lib.FixedShapeTensorArray object at ...> | |
| [ | |
| [ | |
| 1, | |
| 2, | |
| 3, | |
| 4 | |
| ], | |
| [ | |
| 10, | |
| 20, | |
| 30, | |
| 40 | |
| ] | |
| ] | |
| """ |
Rationale for this change
Constructing a fixed-shape-tensor array from a list of individual ndarrays only
worked when each element was 1-D; ≥2-D elements failed with
ArrowInvalid: Can only convert 1-dimensional array values. The only workaroundwas stacking the list into a single ndarray and using
FixedShapeTensorArray.from_numpy_ndarray.What changes are included in this PR?
The C++ list converter
PyListConverter::AppendNdarraynow acceptsmulti-dimensional ndarray elements for fixed-size lists (the storage of a
fixed-shape tensor) by flattening them in C order. The fixed-size-list builder
still validates that the flattened length matches the list width, so wrong sizes
error cleanly. Variable-sized lists remain restricted to 1-D values to avoid
ambiguity. As a side benefit, plain
fixed_size_listalso acceptsmulti-dimensional ndarray elements now.
Are these changes tested?
Yes:
test_tensor_array_from_list_of_ndarrays— construction from 2-D and 3-Dndarrays, null handling, storage parity with
from_numpy_ndarray, and thesize-mismatch error, across
int8/int64/float32.test_fixed_size_list_from_multidim_ndarray— plainfixed_size_listfrommulti-dim arrays, plus a check that variable-sized lists still reject 2-D.
Are there any user-facing changes?
Yes —
pa.array([multi-dim ndarrays], type=fixed_shape_tensor(...))(and thesame for
fixed_size_list) now works instead of raising. Existing 1-D behaviorand variable-sized-list behavior are unchanged.
Scoped to construction only; the reverse
to_numpyshape-preservation alsoraised in the issue is intentionally left as a separate follow-up.