Rootmv with large ntuple from DataFrame Snapshot fails to move

RENATO_QUAGLIANI · March 31, 2020, 6:11pm

Dear all
I am using RDataFrame.Snapshot() without any specific SnapshotOption.
If hte outputFile is storing 1 single instance of DecayTuple rootmv works out of the box.
When multiple DecayTree are stored for example like this :

Attaching file TupleProcess_tmp.root as _file0...
(TFile *) 0x2f168c0
root [1] .ls
TFile**         TupleProcess_tmp.root
 TFile*         TupleProcess_tmp.root
  KEY: TTree    DecayTuple;36   DecayTuple
  KEY: TTree    DecayTuple;35   DecayTuple

When i call rootmv TupleProcess_tmp.root:DecayTuple Target.root
I get this following message :

WARNING: Several versions of 'DecayTuple' are present in 'TupleProcess_tmp.root'. Only the most recent will be considered.

The exit_code is 0, but when looking to where the new tuple should be, nothing is present.
Funny enough, if the tuple i consider for rootmv is much smaller and has only 1 key, it works.
I am failing to understand if the problem arise from the snapshot itself and how the TTree gets written or if it’s rootmv which has limited capabilities and fails in some cases.
To bypass the problem i wrote a custom TTree move function.

Thanks for any help
Renato

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.18/04
Platform: centos7-gcc8
Compiler: Not Provided

eguiraud · March 31, 2020, 8:40pm

Hi,
this is a problem in rootmv, possibly a duplicate of https://sft.its.cern.ch/jira/browse/ROOT-10599 , but not sure (and in rootmv instead of rootls).

Can you open a jira ticket please?

Cheers,
Enrico

RENATO_QUAGLIANI · April 1, 2020, 11:48am

Opened, and also a pull request.

I do have a question/suggestion for RDataFrame Snapshot.
Since the “last” cycle writing can be achieved with :

    newfile.Write(0, TObject::kOverwrite);

Would it be easy to update the RDataFrame Snapshot to just save always the last cycle ( i.e the one with the whole set of entries ) ttree , or have a RSnapshotOption for it ?
Otherwise, is there a ROOT global flag which avoids the previous to the last cycle to be saved ?

eguiraud · April 1, 2020, 12:08pm

Hi,
thanks you for the ticket, and I’ll take a look at the pull request asap, thanks a lot!

About your last question, I am not sure I understand. Namecycles are “partial” saves of the metadata of large TTrees: they are there so that if the program crashes mid-execution, most of the data that has been written can still be recovered. Tools dealing with ROOT files should know to always take the last namecycle unless explicitly instructed otherwise. So I am not sure what feature you’re asking for: that Snapshot deletes all previous namecycles when it’s done? Why would this be useful?

Cheers,
Enrico

RENATO_QUAGLIANI · April 1, 2020, 12:25pm

I asked because after i have run a Snapshot() operation i see on the TFile output :

DecayTree;2
DecayTree;1

while i would like to see only DecayTree;2 when i run Snapshot() or have an option inside Snapshot() which forces this to happen. If the file is corrupted and not all entries are available i anyway have to re-run the jobs, that’s why I asked and i would need something like that.
Infact, I have never experienced issues using ROOT “Get(” but i would like just to be able to run a Snapshot without having multiple cycles visible on the final file.

eguiraud · April 1, 2020, 12:27pm

I see but what is the issue with having multiple namecycles?

RENATO_QUAGLIANI · April 1, 2020, 12:32pm

atm none, one issue is that until a next ROOT release i cannot run rootmv, rootcp after having done multiple snapshots and merge all to a file or only part of TFiles. ( so i was looking for an already existing Snapshot options which can avoid multiple cycles to be present on the final snapshot )
Plus in my analysis some modules uses root_numpy, uproot and some other python approaches to load TTrees and since some of them are not directly supported and maintained, i just want to be sure I always have for my processed ntuples 1 single cycle so all entries are always used. This can be a useless thing, but i want to to avoid any unwanted behaviour in other modules ( without the necessity of updating all our code base ) we have in the analysis.

eguiraud · April 1, 2020, 12:41pm

The Snapshot feature would also only be available starting next release, just like the rootmv/rootls fixes (actually, the fixes will be backported to the next patch releases, while the feature would not). Namecycles, for better or worse, have always been part of ROOT’s behavior. Tools like uproot and root_numpy have to support them (and I am quite sure they do).

eguiraud · April 1, 2020, 1:02pm

P.S. as a workaround, I think rootcp only copies the last namecycle so you can extract it like that if you ever need to

system · April 15, 2020, 1:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.