How to tell TDF to ignore branches which are not used in any way?

Hi there,

I am using a TDF to reduce several hundred Datasets to a single File via Snapshot. The operation itself seems to work as intended. The only Problem I face is that some files have more Branches than other because of debug Branches. During reducing the files to a branch-set which does not include one of these debug branches the TDF is shouting the following message at me:

Warning in TChain::CopyAddresses: Could not find branch named…

This will result in a massive slowdown due to the console output. The pragmatic solution would be, to set the loglevel to error only. I do not consider this as a great solution because other warnings which might be a potential bug would be hidden.

I have tried to use a TChain and only activate the Branches which should be written and providing the TDF with default columns, but nothing solves this issue.

Thanks for your help

Cheers

Thomas

Hi Thomas,
you can tell Snapshot what branches you want it to copy:

tdf.Snapshot("tree", "outfile.root", {"b1","b2","b3"})

Does that help?
Enrico

Hi Enrico,

thats exactly the syntax I am using but on the last defined filter instead of the TDF. The only difference is, that I am using a vector of strings for the branch names. But I would assume that every collection with a begin and end iterator would suffice to call this function right?.

BTW. I am currently using root 6.12.00…(thought i was on 6.12.04 :frowning: ) are there any bugs fixed in the following versions which could lead to this issue?

Thomas

Hi Thomas,
so if I understand correctly you are calling Snapshot by passing a list of branches that you want to save to file (a vector<string> is perfect) but instead it saves all of your branches?

If yes, that is not intended behavior: please provide a small reproducer for me to try/debug. I am not aware we have fixed an issue like this since v6.12.00. Note that I cannot reproduce the problem e.g. with

#include <ROOT/TDataFrame.hxx>
#include <vector>
#include <string>
using namespace ROOT::Experimental;

int main() {
  TDataFrame d(10);
  std::vector<std::string> branches = {"x"};
  d.Define("x", "1").Define("y", "2").Define("z", "3").Snapshot("t", "outfile.root", branches);
  return 0;
}

(and also the tests in $ROOTSYS/tree/treeplayer/test/dataframe seem to Snapshot correctly a subset of the branches)

Hi Enrico,

no it saves only the branches which I marked for writing, but it still outputs warnings (to the terminal) for branches which are not activated, because they are present in some input files but not in all. The issue is that these branches should be irrelevant, because they are not present in the output. I think i could disable the warnings, but in this case also other warning will be hidden. Due to thousands of inputs files this terminal output gets really annoying and slows down the performance.

Thanks for your help

Ah I see, sorry it took me so long to understand the issue correctly.

Have you tried redirecting standard error to a file to avoid slowdown due to console printing? (e.g. ./main.x 2> err.log)

In any case, we should not print such warnings at all. I would open a bug report but I need a reproducer. I tried to re-create your situation by reading two trees, one with branch x and one with branches x and y, and writing out only branch x, but no warnings are printed out:

#include <ROOT/TDataFrame.hxx>
using namespace ROOT::Experimental;

int main()
{
   // write two files with different branch content
   {
      TDataFrame d(10);
      auto dwithx = d.Define("x", "42");
      dwithx.Define("y", "83").Snapshot("t", "file1.root");
      dwithx.Snapshot("t", "file2.root");
   }

   // read the common branch from both files
   TDataFrame d("t", {"file1.root", "file2.root"});
   d.Snapshot("t", "out.root", "x");

   return 0;
}

If you could let me have even just a couple of entries of your files so that I could reproduce the behavior it would be of great help. Or otherwise let me know what I’m missing to reproduce the problem.

Cheers,
Enrico

Hi,

I finally could reproduce a similar issue: test_warnings.C (1.1 KB).
The resulting out.file has stored 20 uuids, but still some warnings will be emitted. Strangely the warnings are now regarding the branch which is written.

I hope that this will help and thanks for your support

Cheers
Thomas

1 Like

The problem is related to ROOT::EnableImplicitMT(); (if you comment it out, it will disappear).
Note that you can also comment out tree->Branch("cal_edep",&cal_edep); and you will still see this problem.

Alright, thank you both, I can definitely reproduce the problem.
It seems to be somehow related to snapshotting a TUUID branch from multiple threads (changing the type of branch, performing other actions other than Snapshot, or removing ROOT::EnableImplicitMT make the warnings disappear).
Multiple files are not required to reproduce the issue, nor multiple branches.
A minimal reproducer is the following:

#include <ROOT/TDataFrame.hxx>
#include <TUUID.h>
using namespace ROOT::Experimental;

int main()
{
   // write a root file containing a TUUID
   TDataFrame(1).Define("info_run", []() { return TUUID(); })
                .Snapshot<TUUID>("t", "test_snap.root", {"info_run"});

   ROOT::EnableImplicitMT(); // commenting eliminates the runtime warnings

   // read and write it again
   std::cout << "now snapshotting" << std::endl;
   TDataFrame("t", "test_snap.root").Snapshot<TUUID>("t", "out_snap.root", {"info_run"});

   return 0;
}

I have no idea what could cause such a behavior. It needs proper investigation, so I opened a jira issue. You can follow up on the issue there.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.