Dear all
I am using RDataFrame.Snapshot() without any specific SnapshotOption.
If hte outputFile is storing 1 single instance of DecayTuple rootmv works out of the box.
When multiple DecayTree are stored for example like this :
When i call rootmv TupleProcess_tmp.root:DecayTuple Target.root
I get this following message :
WARNING: Several versions of 'DecayTuple' are present in 'TupleProcess_tmp.root'. Only the most recent will be considered.
The exit_code is 0, but when looking to where the new tuple should be, nothing is present.
Funny enough, if the tuple i consider for rootmv is much smaller and has only 1 key, it works.
I am failing to understand if the problem arise from the snapshot itself and how the TTree gets written or if it’s rootmv which has limited capabilities and fails in some cases.
To bypass the problem i wrote a custom TTree move function.
I do have a question/suggestion for RDataFrame Snapshot.
Since the “last” cycle writing can be achieved with :
newfile.Write(0, TObject::kOverwrite);
Would it be easy to update the RDataFrame Snapshot to just save always the last cycle ( i.e the one with the whole set of entries ) ttree , or have a RSnapshotOption for it ?
Otherwise, is there a ROOT global flag which avoids the previous to the last cycle to be saved ?
Hi,
thanks you for the ticket, and I’ll take a look at the pull request asap, thanks a lot!
About your last question, I am not sure I understand. Namecycles are “partial” saves of the metadata of large TTrees: they are there so that if the program crashes mid-execution, most of the data that has been written can still be recovered. Tools dealing with ROOT files should know to always take the last namecycle unless explicitly instructed otherwise. So I am not sure what feature you’re asking for: that Snapshot deletes all previous namecycles when it’s done? Why would this be useful?
I asked because after i have run a Snapshot() operation i see on the TFile output :
DecayTree;2
DecayTree;1
while i would like to see only DecayTree;2 when i run Snapshot() or have an option inside Snapshot() which forces this to happen. If the file is corrupted and not all entries are available i anyway have to re-run the jobs, that’s why I asked and i would need something like that.
Infact, I have never experienced issues using ROOT “Get(” but i would like just to be able to run a Snapshot without having multiple cycles visible on the final file.
atm none, one issue is that until a next ROOT release i cannot run rootmv, rootcp after having done multiple snapshots and merge all to a file or only part of TFiles. ( so i was looking for an already existing Snapshot options which can avoid multiple cycles to be present on the final snapshot )
Plus in my analysis some modules uses root_numpy, uproot and some other python approaches to load TTrees and since some of them are not directly supported and maintained, i just want to be sure I always have for my processed ntuples 1 single cycle so all entries are always used. This can be a useless thing, but i want to to avoid any unwanted behaviour in other modules ( without the necessity of updating all our code base ) we have in the analysis.
The Snapshot feature would also only be available starting next release, just like the rootmv/rootls fixes (actually, the fixes will be backported to the next patch releases, while the feature would not). Namecycles, for better or worse, have always been part of ROOT’s behavior. Tools like uproot and root_numpy have to support them (and I am quite sure they do).