Proof code hangs forever when workers merge with large number of files

pauln · August 18, 2020, 11:01am

Hi all,

I am having an issue with running some selector code in Proof (multiple cores on one machine).

When testing the code with 5 run files (~50Gb raw input, ~100Mb processed output) the scrips runs without issue, all workers merge successfully, and the expected output is produced.

However when I then run my script with my full data set (~1Tb raw input producing ~3Gb of processed output) the script runs file, but fails at the last hurdle when the worker merge, where it stays on Terminating output objects ... / (1 workers still sending) indefinitely. Checking htop, one thread seems to be at 100% - but I have experimented with leaving this running for several days, and it never completes.

I have experimented with the number of workers, and this has not made any difference.

I am using a root-wrapper used by my collaboration to do this, which allows the data format used by the collab experiment to be access directly into root. For that reason a minimum working example may be a bit convoluted. Also, as the script is working for a smaller output tree, I suspect there is some “administrative” setting I need to set, as opposed to there being an issue with the code itself, which is quite simple.

Just in case I am handling the writing in some way (perhaps too simply?) which would produce an issue like this, here is the Terminate() method:

  void pi0selector::Terminate()
  {
    // The Terminate() function is the last function to be called during
    // a query. It always runs on the client, it can be used to present
    // the results graphically or save the results to file.

    TFile Out_File(Out_File_Name, "RECREATE");

	  TListIter *iter = (TListIter*)GetOutputList()->MakeIterator();
	  for (TObject *obj = (*iter)(); obj != 0; obj = iter->Next()) {
		  obj->Write();
	  }

    Out_File.Close();

  }

I should note that I also have a “pure”-root script which does exactly what my selector code does and produces the ~3Gb output tree without issue, so again I suspect there is perhaps something “under the hood” with proof which I need to set to allow it to merge the TTree at the end of processing.

Thanks in advance for any and all suggestions!

Sincerely,
Paul

ROOT Version: 6.14.04
Platform: Centos7
Compiler: g++ (GCC) 4.8.5

etejedor · August 18, 2020, 12:46pm

Hi,

Perhaps @ganis can provide some hint?

pauln · August 26, 2020, 10:17am

@ganis any insight you can provide would be greatly appreciated!

ganis · August 26, 2020, 2:02pm

Hi,
It is likely to be a memory issue.
If you intend to continue using PROOF-Lite, which is legacy code, you should perhaps save the separate files created by the workers, instead of trying merging (they contain TTrees, right?); you can then use the output as a dataset. You can add "ds=dsname" in the option field of gProof->Process(...) (then you can process the output as gProof->Process("dsname", selector, ...))

But you should really try to move to using the newer TProcessExecutor or even TThreadExecutor (or RDataFrame, which uses TThreadExecutor under the hood).
See https://root.cern/manual/data_frame/ and examples under ./tutorials/multicore .

G Ganis

pauln · August 31, 2020, 2:53pm

Thanks for your response @ganis, I will indeed look into this!

Cheers,
Paul

system · September 14, 2020, 2:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.