If I have enabled ImplicitMT, RDataFrame::Snapshot does not preserve the row order. Moreover, this happens completely invisibly with no errors or warnings, so the user is left with shuffled rows and no indication this happened without manually checking. This strikes me as a feature-breaking bug.
import ROOT as r
r.ROOT.EnableImplicitMT(4)
rdf = r.RDataFrame(10).Define("e", "rdfentry_")
rdf.Snapshot("test", "test.root")
f = r.TFile.Open("test.root")
f.test.Scan()
Hi,
itâs not a bug, if that TTree is used as a friend of another tree we warn users that it is a TTree that has been written by a âshufflingâ operation.
Whatâs your usecase thatâs broken by this behavior?
Well, anything that depends on the order of the rows, but mainly friend trees. If friends are considered the only use-case for row-ordering, then indeed, this is not a bug.
In many cases rows are independent (as in, independent physical events) and order does not matter. Order relative to other TTrees of course matters, hence the problem with friend trees.
In 6.22 you canât use a âshuffled treeâ as a friend unless you manually unset a certain flag in that TTree.
In 6.24 (being released in O(few weeks)) we filled a feature gap and added support for indexed friend trees in RDataFrame, so you can use one of the TTree columns as an index to recover ordered access into the shuffled TTree if needed.
This seems like strange behavior for an (otherwise) ordered data structure, but friend trees are my only use-case, so this doesnât currently break anything for me.
There are, however, workflows where this could be a problem. For example: Tree1 created in ROOT
â Tree2 created from Tree1 using ROOT with ImplicitMT
â Tree3 created from Tree2 using pandas dataframe
â Tree1.AddFriend(Tree3)
This involves leaving the ROOT package, and one could see it as a case of outside programs not providing feature support. But since ROOT is designed to work with other storage formats and be somewhat interoperable, I think it would be appropriate for Snapshot to throw a warning when writing out with ImplicitMT turned on.
I think a runtime warning is too much (it would affect all programs that currently use RDF+multi-thread+Snapshot, and thatâs a lot of programs) â but we can certainly add a big-letter warning to Snapshotâs docs.
It would be also nice to have an option to force Snapshot to maintain entry ordering (at the cost of performance and RAM): thatâs a stretch goal for this yearâs plan of work.
Would this operate via a cache for âforwardâ results that wait until the thread handling earlier chunks finishes and writes out?
By the way, is simultaneous snapshots supported right now? I think I tried this in version 6.20 or 6.18 and trying to snapshot two different nodes at the same time resulted in seg faults (i.e. I split the data into two or four distinct datasets and want to dump them to separate root files with one processing loop). I ended up predicting the largest of them (pretty asymmetric), writing that one out immediately and putting the remaining sets into Cache, then snapshotting those one at a time. Of course, this is shifting performance bottlenecks around and had some severe limitations with the memory available.
Thatâs the only idea I have for now, because it works well with the design of TBufferMerger, which Snapshot uses for multi-thread writes to a ROOT file â there is already a thread-safe queue of buffered âclusters ready to be written outâ, we would âjustâ need to switch from FIFO to an ordered processing (much easier said than done, but itâs what I have so far). The main problem is that memory usage, in this scenario, might have hard-to-predict, highly undesirable long tails.
Assuming each Snapshot writes to a different file, I think it has always been supported. In other words, please report the bug, ideally with a self-contained reproducer, at Issues ¡ root-project/root ¡ GitHub .