Make RDataFrame from TChain where 2 ntuples contains 90% of overlapping branches and do a proper snapshot

RENATO_QUAGLIANI · November 3, 2020, 2:40pm

Dear experts,
I have a question concerning RDataFrame snapshotting behaviour from a TChain.

Let’s say my pseudocode is :

TChain ch("tupleName"); 
for f in files : 
   ch.AddFile( f) 

ROOT::RDataFrame df( ch)
df.Filter.Snapshot().

I see some weird behaviour when

file[0]

contains a TTree with tupleName and say 100 branches
and file[1] contains 100 + 4 branches.

For some reason the snapshot contains entries only from file[1].
Is that expected?
What is the procedure to ensure this to not happen, without having to “manyally” add aligned branches names for the 2 files?

Thanks
Renato

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

eguiraud · November 3, 2020, 2:55pm

Hi,
this is not expected, feel free to open a github issue.
We should not silently produce an unexpected output.
Are there no warnings printed?

On the other hand: what should Snapshot do in this case? I think the best it could do is to write all entries but only for the 100 branches that appear in the first file, is that reasonable?

Cheers,
Enrico

RENATO_QUAGLIANI · November 3, 2020, 3:05pm

Hi @eguiraud,

Error in <TTreeReader::SetEntryBase()>: There was an error while notifying the proxies.
Warning in <TTreeReader::SetEntryBase()>: Unexpected error '-6' in TChain::LoadTree

this is the error i get with this reproducer.

test.C (476 Bytes)
I noticed that if before making the RDataFrame for the TChain i call the GetEntries() i can successfuly merge the files and non overlapping branches are deleted…

RENATO_QUAGLIANI · November 3, 2020, 3:06pm

I am on MAc ( but issue observed on centos7 as well)

$|=>root --version
ROOT Version: 6.18/04
Built for macosx64 on Aug 03 2020, 15:51:27
From tags/v6-18-04@v6-18-04

eguiraud · November 3, 2020, 3:20pm

That’s a hard Error during the event loop, so I guess the failure is not silent after all

Now, about whether RDF should behave better here or not: I don’t know to what extent ROOT tries to support TChains made of TTrees with different schemas. @pcanal @Axel should such a usecase be supported (namely a TChain in which the first file has N branches and the second has those branches + some more).

Cheers,
Enrico

RENATO_QUAGLIANI · November 3, 2020, 3:21pm

Sure, but the Snapshot keeps running, and the code is not failing without making the Snapshot. All i want is this to never happen basically.
Either you can snapshot Tuple1+Tuple2, or you fail.

eguiraud · November 3, 2020, 3:23pm

Ok so you suggest that in this case we should error out “harder” and not just log an Error at the terminal. That’s super reasonable, please open a github issue with the suggestion for improvement!

RENATO_QUAGLIANI · November 3, 2020, 3:25pm

The funny thing is that the error doesn’t appear if befor emaking the RDataFrame(chain); i have called chain.GetEntries();
If this is done, the error is not prompted.

eguiraud · November 3, 2020, 3:29pm

Actually in v6.22 the event loop is interrupted. What ROOT version are you on?

~ ./test                                                       (cern-root) 
Error in <TTreeReader::SetEntryBase()>: There was an error while notifying the proxies.
Warning in <TTreeReader::SetEntryBase()>: Unexpected error '-6' in TChain::LoadTree
terminate called after throwing an instance of 'std::runtime_error'
  what():  An error was encountered while processing the data. TTreeReader status code is: 9
fish: “./test” terminated by signal SIGABRT (Abort)

Yes, that’s quirky, I wouldn’t count on that behavior too much. Please assume that TChains with TTrees with changing schemas are not supported by RDF.

pcanal · November 3, 2020, 3:46pm

Yes, since those branches don’t participate in the analysis they should have no effect (neither on the performance nor on the results).

eguiraud · November 3, 2020, 3:47pm

Ok, if those branches don’t participate in the analysis – in this case Snapshot() is ambiguous because it should supposedly write “all branches” to the output file.

eguiraud · November 3, 2020, 3:49pm

@RENATO_QUAGLIANI I just noticed that in test.C the problem is slightly different than what you described in the original post: the first tree has one more branch than the second.

pcanal · November 3, 2020, 3:56pm

Ok, if those branches don’t participate in the analysis – in this case Snapshot() is ambiguous because it should supposedly write “all branches” to the output file.

Right. The usual handling of this is to add the new branch and then backfill it with default values (TBranch::BackFill), however the blocker is … what is the “default” value for that branch and that file.

So I agree it is reasonable for Snapshot to fail here (but we may want to eventually also introduce an interface for the user to say "that’s fine and use that value for the missing entries)

eguiraud · November 3, 2020, 3:58pm

Indeed, if you pass a list of branches to Snapshot(..., {"branch1", "branch2"}) and those branches are present in all trees, things should work fine (if they don’t, it’s definitely a bug we want to fix asap).

RENATO_QUAGLIANI · November 3, 2020, 4:23pm

Maybe i am missing something , now.
Is there a recipe to have the proper merged set of entries to be Snapshotted , no matter of the order in which TChain is constructed and no matter the misaligned branches present in the TTrees added to TChain?

RENATO_QUAGLIANI · November 3, 2020, 4:24pm

Do i need to create by hand the GetColumnNames() lists of each TTree i add and do a custom set-intersection of names to pass to the Snapshot?
What if i have only the resulting TChain object at hand?

eguiraud · November 3, 2020, 4:36pm

Yes, explicitly pass the column names that need to be considered (and make sure those columns are present in every TTree).

Not necessarily, e.g. it might be enough to check which TTree has the smallest amount of branches and take its branch names, if you know that all others will definitely have those plus maybe others.

Again: what does Snapshot() (i.e. “write all columns”) mean if different TTrees have different schemas?

RENATO_QUAGLIANI · November 3, 2020, 4:48pm

For the moment…

	TFile f("fileMerge.root","RECREATE");
	auto t = (TTree*)ch.CopyTree("");
	t->Write();
	f.Close();

works no matter the order and does the auto-drop of branches i need.
I was looking more to some 1-line fix given a TChain already constructed at hands before being passed to a RDataFrame construction.
The change you suggest is reasonable , but i might need to modify too much of my analysis code to keep track of all files added to the chain and all the branches it contains to make a set intersection later on…

eguiraud · November 3, 2020, 4:54pm

That’s interesting. @pcanal how does CopyTree figure out what branches to drop, if the first TTree in the TChain has more branches than the second?

RENATO_QUAGLIANI · November 3, 2020, 5:24pm

The only nice way out i foound so far is to have a global method on my analysis :

std::vector<std::string> DropColumns(std::vector<std::string> &&good_cols)
{
   // your blacklist
   const std::vector<std::string> blacklist = {"dummyB", "dummyC", "crapSTUFF"};
   // a lambda that checks if `s` is in the blacklist
   auto is_blacklisted = [&blacklist](const std::string &s)  { return std::find(blacklist.begin(), blacklist.end(), s) != blacklist.end(); };

   // removing elements from std::vectors is not pretty, see https://en.wikipedia.org/wiki/Erase%E2%80%93remove_idiom
   good_cols.erase(std::remove_if(good_cols.begin(), good_cols.end(), is_blacklisted), good_cols.end());
   
   return good_cols;
}

Where i dump down all the problematic branches which i know don’t allow merging.