Most efficient way to slice TTree in one variable

Hi @eguiraud,

I literally just copied the code you sent me except that I used

ROOT::RDataFrame df("FCS_ParametrizationInput", "testFile.root");  

instead of

auto df = ROOT::RDataFrame(10).Define("eta", [] { return 1.f; });

and replacing “eta” with newTTC_IDCaloBoundary_eta or newTTC_IDCaloBoundary_eta[0] (tried both, this is the name of the actual vector variable stored in the tree).

with this file https://cernbox.cern.ch/index.php/s/b2sThuMBBKCUGvs :slight_smile:

Cheers

1 Like

I have reduced the reproducer to:

ROOT::RDataFrame df("FCS_ParametrizationInput", "testFile.root");
df.Snapshot("slice", "out.root");

For some weird reason that precise file testFile.root trips up the interpreter – another, simple input file created with a Snapshot does not cause the crash. Investigating.

1 Like

So, something that is definitely a problem is that Snapshot needs to write type FCS_matchedcellvector, but no dictionaries have been loaded for that class. That can’t go well…

I opened this ticket about this “weird” error messages that don’t really point to the issue.

The error should be that "FCS_matchedcellvector" is not a type known to the ROOT interpreter but Snapshot is trying to write a few branches with that type.

Hi @eguiraud,

thanks, I didn’t think about this I am now trying to generate the appropriate dictionaries and expect it to work afterwards, but I might get back to you. Thanks for your help!

Hi @eguiraud,

so I think it finally works as I expect it to :wink: Thanks again! Just a quick question: after I trigger the event loop is there a way to print out the number of events wich have been processes, i.e. lets say every 10 000 events?

EDIT: after almost running over the full 1 000 000 events, the program gets Killed (even though the output is fortunately there), any suggetion to mitigate this?

after I trigger the event loop is there a way to print out the number of events wich have been processes, i.e. lets say every 10 000 events?

There is a feature for this, you can call OnPartialResult on RDF results (tutorial here).

after almost running over the full 1 000 000 events, the program gets Killed (even though the output is fortunately there), any suggetion to mitigate this?

That strongly depends on the cause of the crash, I’m afraid. Do you get a stacktrace?

Hi @eguiraud,

unfortunately I do not get any stacktrace, but I noticed if I disable MT then I get

              Bus error (core dumped)

while when I enable MT I get

               Killed

without any additional information (see screenshot). For a few slices it works perfectly but if I have some more it just gets killed, which is a bit frustrating. Maybe it just uses too much CPU? (I am running on lxplus)

More probably RAM than CPU – you can check ram usage running your program inside /usr/bin/time, it’s the maxresident field.

Hi @eguiraud,

now I indeed get some error messages

 Warning in <TTree::CopyEntries>: The export branch and the import branch do not have the same streamer type. (The branch name is m_vector.)
 Error in <TBranchElement::SetAddress>: STL container with fStreamerType: 500

I suspect this is bedause the tree was generated with a different version than it is now being read? (which I unfortunately cannot change) As this seems to be only an issue if I process large amount of events, do you maybe have a workaround for me? Thanks!

Uhm…could it be that in some input files m_vector is a different type or was anyway written differently than in others? If yes, you’ll have to deal with those input files separately. If not, can you isolate the problematic events and make a small reproducer for us?

Hi @eguiraud,

the input files are all generated in the the same way so they should not be written in any other way. Also, I don’t think there are “problematic events” as if I run with only a few slices (but all events) it works just fine (or at least the error messages do not get print out). At this point I am not sure how to debug this… Is it possible, as a very ugly work around I admit, to just suppress these error messages for the moment? I suspect that this is why the program gets killed in the first place? (even though sometimes the errors do not even show up :worried: ) At least if I look at the slices in the end they seem fine to me (and it wouldn’t hurt too much loosing some events)

I don’t know what this could be.

Do you get the same errors on a different machine or with a different ROOT version on lxplus (e.g. with ROOT nightlies)? Have you tried running the program within /usr/bin/time to check its memory usage?

The best I can do is offer to try and debug if you can make a simple reproducer.

Cheers,
Enrico

Hi @eguiraud,

here is the most minimal reproducer I could come up with:

https://cernbox.cern.ch/index.php/s/l3toLo4cbH59nxI

You can run it with:

  lsetup "root 6.18.04-x86_64-centos7-gcc8-opt"
  root -l
  .L FCS_Cell.h+
  .x sliceTreeInEta.C 

And after the slices are registered you should see the following error:

Error in <TBranchElement::SetAddress>: STL container with fStreamerType: 500
Warning in <TTree::CopyEntries>: The export branch and the import branch do not have the same streamer type. (The branch name is m_vector.)

Nonetheless all events should process and the output files will be saved in output/

Thanks!

EDIT: Just tested with the newest release 6.20.02, same issue.

Hi @eguiraud,

any news on this? :slight_smile:

Hi,
sorry, I had to look at other issues on Thursday and Friday :smile:

I’ll try to jump back on this as soon as possible. In the meanwhile, what does /usr/bin/time say about the program’s memory usage (compared to max amount of RAM available on the lxplus machines)? Do you also get these errors if you run on a different machine than lxplus, e.g. on a personal computer?

Cheers,
Enrico

Hi @eguiraud,

the output of /usr/bin/time for the reproducer is (not sure how to interpret this)

  795.99 user 156.90 system 10:43.40 elapsed 148% CPU (0 avgtext+0 avgdata 8426276 maxresident)k

  800704 inputs+8 outputs (3336 major+2880290 minor) pagefaults 0 swaps

I will try running it on my local machine tomorrow

Hi,
8426276k maxresident means the program is using ~8GB of RAM – might be too much for lxplus. That’s a different problem than the error messages,

Error in <TBranchElement::SetAddress>: STL container with fStreamerType: 500
Warning in <TTree::CopyEntries>: The export branch and the import branch do not have the same streamer type. (The branch name is m_vector.)

I tried your reproducer, I also get these error messages, will have to investigate. This is independent from the large memory usage which might get your job killed on lxplus.

I don’t know if I’ll manage to work on this today. If not, definitely tomorrow.

Cheers,
Enrico

Hi @eguiraud,

okay, thanks! I just ran locally on MacOS Catalina and I get the same problems. With only a few slices (standard reproducer) I get exactly the same output but again, the program runs through fine. If I run over all slices (eta=5.0), then after some time I get additionally

root.exe(34406,0x70000ddc3000) malloc: *** error for object 0x7fc650d91530: pointer being freed was not allocated
root.exe(34406,0x70000ddc3000) malloc: *** set a breakpoint in malloc_error_break to debug 

and the programm stops its execution. I attached the sample files for the root process executed with some slices and with all slices. Hope that helps!

Some slices: https://cernbox.cern.ch/index.php/s/S0ewFXmpVH4tVCY

All slices: https://cernbox.cern.ch/index.php/s/ZOMm0bFKyl3LAIY

Hi,
what’s the difference between these new files and the original you shared?

I can reproduce the issue, but I don’t have a fix yet. Work in progress.

Cheers,
Enrico