Exploring RNTuple network streaming via Apache Arrow Flight

Karan_Singh · May 16, 2026, 3:22pm

Hi everyone,

I am working around low-level internals of the new C++ API and am looking for some advice on memory management.

ROOT Version: 6.36.00
Platform: Ubuntu 24.04 (Docker)
Compiler: GCC / C++17

I have been building a C++ prototype to stream RNTuple columnar data natively as Apache Arrow tables over an Arrow Flight (gRPC) server. The idea is to stream data to Python/remote clients without requiring local file downloads.

The C++ streaming is working decent for initial state (I measured < 1.85x overhead compared to a raw RNTuple loop), but I want to optimize the memory handoff.

My Question:
Currently, my overhead comes from doing a manual memcpy of the values from RNTuple into Arrow’s pre-allocated memory buffers.

Does the RNTupleReader API expose a safe way to get a raw pointer to the underlying uncompressed memory page for a specific column?

I would prefer to “alias” or “borrow” that memory directly into Apache Arrow to achieve true zero-copy, but I am not sure if that memory is strictly hidden behind the REntry / model layer.

Any tips, or pointing me to the right class in the source code, would be hugely appreciated. Thank you!

Reference:
Github - KaranSinghDev/RNTuple-Arrow-Gateway

Danilo · May 16, 2026, 5:45pm

Hello Karan,

Welcome to the ROOT Forum!

I am adding @jblomer in the loop.

Best,
Danilo

Karan_Singh · May 16, 2026, 6:10pm

Thank you, let me know if any additional information is required. I am still working on this so I will update the message if make any decent progress on it.

Have a good weekend!
Best,
Karan

jblomer · May 16, 2026, 6:59pm

That is very interesting, thank you for reaching out to us!

For individual elements of simple types, the RNTupleDirectAccessView gives a reference into the page buffer.

In general, however, there is no public API to access the page buffers directly. Even with such an API, memory copies would be necessary where the target buffer does not align with the RNTuple page boundaries. The page boundaries are also not aligned with entry boundaries, so when reading more than one column, the number of elements that could possibly be accessed without an additional copy would probably be small.

Bulk reading may help (RNTupleReader::CreateBulk()). That involves a memory copy from the page buffer but it copies in bulk and not value by value. You can bind an existing buffer into which you bulk read, which can be the one provided by Arrow.

Karan_Singh · May 17, 2026, 6:06am

Thank you for clearing it up. Based on my analysis of the RNTuple architecture and one of your work (arXiv:2204.09043), your suggested CreateBulk() looks like a robust way forward, I think I can estimate around 15–20% wall-time reduction over my current per entry loop on compressed data, though that is a projection rather than a measurement at this point.

I will be trying this one and will share the results, one question on list columns though, for a
std::vector, as I know Arrow stores the data as a flat values buffer plus an offsets buffer.

Would the recommended pattern be to bulk-fill the values buffer with CreateBulk() on the value column and read the per-entry collection sizes via a separate access (then compute Arrow’s cumulative offsets myself), or does the framework expose a way to bulk-read both in one call?

Have a good weekend!
Best,
Karan

jblomer · May 17, 2026, 10:01pm

Yes, I think with the current APIs it has to be two bulk read calls plus the offset fixup. Even though RNTuple stores offsets, it hands out collection sizes (RNTupleCardinality) in the API. There is some room for improvement for this use case.

Karan_Singh · May 19, 2026, 3:15pm

Thanks, I implemented the bulk path, CreateBulk + AdoptBuffer for the primitives, and the two-bulk-read pattern (RNTupleCardinality for sizes + inner subfield for values + manual cumulative-offset fixup) for the std::vector list columns. The cardinality-vs-offsets note in particular was very helpful, easy to assume offsets would map directly otherwise.

Measured on the same fixtures and host:

C++ ReadAll wall time down 53–58% across 100 MB / 500 MB / 1 GB
Overhead vs raw RNTupleReader loop: 1.43–1.69× → 0.66–0.70×
Python and Flight paths unchanged (verified separately), speedup is
fully from the bulk path
All 7 columns verified column-for-column against uproot

Next I am planning is ROOT::EnableImplicitMT() to see what parallel page decompression adds on top.
Thanks again
Best,
Karan