Adapt ProofNtuple tutorial

clingcurious · January 12, 2014, 4:08pm

Dear all,

I have an analysis to do with Proof. I am trying to understand the basic I/O of Proof. From a file “in.root” with a tree “tree_in” ; I want to create a new tree “tree_out” in a new file “out.root”

I want to precise that for the moment, I’am using Proof-Lite but I would like in the next weeks to use Proof On Demand with IBM LoadLeveler batch scheduler… with lots of nodes and no hard drives on the node (i mean each node have access to the same harddrive). Is the TProofOutputFile object still mandatory as in root/tutorials/ProofNtuple.C ?

So my problem is I dont know where to create a new Branch ??

class MySelector : public TSelector {
public :
   TFile            *fFile;
   TProofOutputFile *fProofFile;
   TTree          *tree_out;
   TTree          *tree_in;
   TBranch          *b_in;   
   TBranch          *b_out; 
...

I was not able to adapt root/tutorials/ProofNtuple.C … I can read in my tree_in but I dont know where to put the

tree_out = new TTree("mytree","my tree more informations...") ; 
tree_out->Branch("mynewobject",&mynewobject,bsize,split) ;

It’s not working if I put this it in SlaveBegin as it’s done with an Ntuple in ProofNtuple

// Now we create the ntuple
   fNtp = new TNtuple("ntuple","Demo ntuple","px:py:pz:random:i");

I’ll post a “complete basic not-working example” if I’m not understandable…

Thanks in advance…

clingcurious · January 12, 2014, 8:03pm

no more crash with initialization in MySelector::Process

myobject = 0

It was already initialized in MySelector::Init() generated with TTree::MakeSelector()… it’s not clear for me why it was not sufficient

It does not change the fact that I do not know what’s the simplest skeleton for an analysis which transform a

tree_in in “in.root” into a tree_out in “out.root”

for a cluster where each node has direct access to the filesystem and I would like to use
PoD-3.14-Source/plugins/cli/pod-loadleveler-submit

thanks in advance

clingcurious · January 13, 2014, 2:38pm

For the moment, I can read a tree where each entry is a TClonesArray of TVectorD and write a tree where each entry is a TVectorD… But I have some problem to write in a tree where each entry is also a TClonesArray…

Indeed with

class MySelector : public TSelector {
public :
   TFile            *fFile;
   TProofOutputFile *fProofFile;
   TTree          *newtree;
   TTree          *tree; 
   TBranch *b_vec ; 
   TBranch *b_newvec ; 
   TClonesArray *vec ;
   TClonesArray *newvec ;
...

and

  MySelector::Begin(TTree *tree) ;
...
   newtree = new TTree("newtree","newtree blabla");
   newtree->Branch("newvec",&newvec,256000,0) ; 
   newtree->SetDirectory(fFile);
   newtree->AutoSave();
...

and

   MySelector::Process(Long64_t entry)
...
   vec = 0 ; 
   newvec = new TClonesArray("TVectorD") ; 
   if (tree) {
       Long64_t ent = entry % tree->GetEntries(); 
       tree->SetBranchAddress("vec",&vec,&b_vec) ; 
       b_vec->GetEntry(ent);
   } else {
       return kTRUE;
   }
  newtree->SetBranchAddress("newvec",&newvec,&b_newvec) ; 
  newtree->Fill() ; 
...

=> At the end… I have no branch newvec in my tree in the output out.root file…

=> Is this a problem with

newvec = new TClonesArray("TVectorD") ;
or with

newtree->Branch("newvec",&newvec,256000,0) ;

(newvec changes at each entry in MySelector::Process but not &newvec…) I am lost…

clingcurious · January 14, 2014, 12:01pm

Update of my questions :

I can now process a ttree with TClonesArray and it gives me an other ttree with TClonesArray …
(TFile in.root / TTree tree / TBranch vec => TFile out.root / TTree newtree / TBranch newvec)

I dit it with :

void MySelector::SlaveBegin(TTree /*tree*/) {
...
newtree = new TTree(treeout,"treeout blabla");
newvec = new TClonesArray("TVectorD") ; 
...
}

Bool_t MySelector::Process(Long64_t entry) {
   if (!newtree) return kTRUE;

   vec = 0 ; 
//   newvec = 0 ; 
   if (tree) {
       Long64_t ent = entry % tree->GetEntries(); 
       tree->SetBranchAddress(branchin,&vec,&b_vec) ; 
       b_vec->GetEntry(ent);
   } else {
       Abort("no way to get entries in the input tree... Stop processing", kAbortProcess);
       return kTRUE;
   }
   RTensorT *pt = (RTensorT *) vec->At(0) ; 
   new((*newvec)[0]) RTensorT(*pt) ; 

   newtree->SetBranchAddress(branchout,&newvec,&b_newvec) ; 
   newtree->Fill() ; 
   
//   delete vec ; 
//   delete newvec ; 
   return kTRUE;
}

and the rest is the same as the ProofNtuple.C with one tree replacing one Ntuple…

So now my problem is the following : there are in fact different branches in the input tree … and with different number of entries… Is it possible to merge only Branches with TProof and not the entire tree without creating new trees with the problematic branches ?

b_newvec->Fill()

instead of

newtree->Fill()

crashed

And the old problem remains : what’s the best way to use TProof with Proof on Demand with a big cluster with a unique hardrive (and not geographically disjoined clusters with their own harddrives) => do I have to use TProofOutputFile ? Do I have to use PAR packages ?

For the moment, I am using

TString lib(Form("%s/MyExternalCode.cc+",gSystem->WorkingDirectory())) ; 
proof->Load(lib.Data()) ;

Thanks in advance for any advices…

ganis · January 22, 2014, 11:12am

Hi,
Sorry for the late reply.
First, your question:

PAR packages and TProofOutputFile have two different purposes, and only the second has to do with outputs, which is your problem.
TProofOutputFile was introduced to handle the case of big outputs, potentially creating memory issues.
So the first thing to understand is how big your output trees are going to be. If they can grow (very) large then you need a file support, which can be the common harddrive (distributed file system) that your workers nodes seem to have.

For the problems with the code, you wrote that you would post the full non-working example. Please do it, so I understand better what you are trying to do and I can try to propose a solution, having in mind your target setup.

G. Ganis

clingcurious · January 22, 2014, 11:20pm

Hello,

Thanks for your reply…

My output with TProofOutputFile is working now so I can keep that solution even if it’s maybe not the optimal solution for me… A typical output is for the moment 10Gb but it could reach 100Gb in a next future …and each node of the cluster is 2 * 8 cores and 32 Gb of RAM

I was asking for the PAR packages because I don’t know how to provide easyly to all my worker nodes my compiled shared library directly from the common hardrive (I don’t want to rebuild something as all my client worker nodes have exactly the same architecture as the master )

and it’s too complicated for me to understand in which order I have to provide all the *.cc *.h source files (with interdependance…)

TString lib(Form("%s/MyExternalCode1.cc+",gSystem->WorkingDirectory())) ;
proof->Load(lib.Data()) ; 
lib.Form("%s/MyExternalCode2.cc+",gSystem->WorkingDirectory())) ;
proof->Load(lib.Data()) ; 
...

thanks for your advices …

ganis · January 23, 2014, 5:33pm

Hi,

Sorry, I misunderstood you, I thought you wanted to use the PAR file for the output file.

Yes, to load the required libraries and/or setup include paths you can use the SETUP.C macro of a dedicated PAR file. You can leave the rest of the PAR file empty, and only SETUP.C will be executed when calling EnablePackage.
You can even pass an argument to EnablePackage, a string or a list of objects. These will be passed to the SETUP function.
You have many things to load that’s definitely the best way to do it.

This should be the same order as you would do it in a plain ROOT session.

Let me know if you need some concrete help.

Gerri