SubSelectors

Has someone thought about the implementation of a subselector? A subselector is something similar to the observer pattern (http://en.wikipedia.org/wiki/Observer_pattern).

For example suppose you have a selector to do your analysis, but your analysis is subdivisible in some subanalyses. Typically you can create two different selector and run them on the same data. The point is that the bottleneck usually is in the I/O, so this is not good because you need to run two times on the same data.

You can merge the two selectors:

MySelector::SlaveBegin()
{
   SlaveBegin1();
   SalveBegin2();
}

MySelector::Process()
{
   GetEntry();
   Process1();
   Process2();
   ...
}

The point is that this is not reusable. So a task can be implemented as an observer class: a SubSelector.

MySelector::SlaveBegin()
{
   register_subselector(new TaskOne(this));
   register_subselector(new TaskTwo(this));
   SubSelectors_SlaveBegin();
}

MySelector::SubSelectors_SlaveBegin()
{
    for (all subselectors) { subselector->SlaveBegin(); }
}

MySelector::Process()
{
   GetEntry();
   SubSelectors_Process();   
}

and for the subselector/client/task/observer

class SubSelector : public TSelector
{
    SubSelector::SubSelectro(MySelector *parent) : parent_selector(parent) {};
    MySelector* parent_selector;
    virtual SlaveBegin() = 0;
    ...
};

class Task1 : public SubSelector
{
  Task1::Process()
  {
      if (parent_selector->pt1 > 25000) histo_pt1->Fill(parent_selector->pt1);
  }
}

probably it’s better to declare SubSelector a template like:

template<class FatherSelectorType>
class SubSelector : public TSelector
{
    SubSelector::SubSelectro(FatherSelectorType *parent) : parent_selector(parent) {};
    FatherSelectorType* parent_selector;
    virtual SlaveBegin() = 0;
    ...
};

I’m aware of several ROOT analysis frameworks which essentially do this, that is, they have a master selector which loads the data once and then runs several analysis classes on that data before loading the next event, including the one I use. This is just my opinion, but there are enough other complications with doing it (for instance, separating the output of the individual analyses) that a native implementation might not be particularly helpful.

P.S. If you want to see the code I can provide it.

[quote=“bbutler”]I’m aware of several ROOT analysis frameworks which essentially do this, that is, they have a master selector which loads the data once and then runs several analysis classes on that data before loading the next event, including the one I use. This is just my opinion, but there are enough other complications with doing it (for instance, separating the output of the individual analyses) that a native implementation might not be particularly helpful.

P.S. If you want to see the code I can provide it.[/quote]

In my opinion I think that it would be good if an official class exists that solve some problem for example the merging of object created by the SubSelectors. Now I’m solving it adding these objects to the fOutput of the main TSelector.

[quote=“wiso”]
In my opinion I think that it would be good if an official class exists that solve some problem for example the merging of object created by the SubSelectors. Now I’m solving it adding these objects to the fOutput of the main TSelector.[/quote]

There is an official solution to that issue I think, that may be faster as well, using TProofOutputFiles for both the single and multiple output cases. On each worker you have a one or more output files, add whatever objects you want to them, and then all you have to do is add the TProofOutputFile(s) to fOutput and all objects in them go along for the ride and get merged automatically, no need to add them individually to fOutput. The TProofOutputFile also handles creating the output files, rather than having to retrieve object from the output list on the client.

And as I said before, you can use multiple TProofOutputFiles if you want multiple output files from the job. I’ve been using this feature in particular to run over large signal Monte Carlo grids in a single job, with a separate output file with histograms for each mass point in the grid.