Creating histograms while reading entries with dataframes

dudouet · October 3, 2024, 3:09pm

Hello, me again

In my standard analysis with a TSelector, I am reading my nuclei events by Z and N values event/event. If Z and N does not correspond to an already seen nucleus, I create a new histogram and fill it, something like:


std::map<int, map<int, TH1D*>> mapofhistos;

for(int i=0 ; i<tree->GetEntries ; i++) { // to emulate the selector loop
    tree->GetEntry(i);
    if(!mapofhistos.count(Z) || !mapofhistos(Z).count(N)) {
        mapofhistos[Z][N] = new TH1D(...);
    }
    mapofhistos[Z][N]->Fill(E);
}

How can I produce something similar using dataframes ?

Thanks in advance

StephanH · October 3, 2024, 4:06pm

Hello,

you cannot (rather should not) reproduce side effects in RDataFrame. In your example, the loop writes into a “global” map as a side effect of running through the tree. In RDF, this shouldn’t be done, because you could not run this multithreaded.

Although we could write a custom action with locks and everything for this, and hook this into RDataFrame, there is probably an easier solution:
You can write one filter per Z and N, and create a histogram. This would roughly look like this:

std::map<std::pair<int,int>, ROOT::RDF::RResultPtr<double>> mapofhistos;

for (int Z = 10; Z < 20; ++Z) {
  for (int N = Z/2; N < Z + 20; ++N) {
    auto histo = rdf.Filter(<Filter on N and Z>)
                    .Histo1D(....);
    mapofhistos[std::pair{Z,N}] = std::move(histo);
  }
}

It does more work than your original loop, and some of these combinations might not exist, but it can run multithreaded. If you run this a lot of times, you can also create a list of all (Z,N) pairs that you are interested in, and create the entries only for that.

dudouet · October 4, 2024, 7:30am

Thanks for your answer.

Could’t be easier to make a first short loop on Z and A only to fill a TH2, to be then used to create the desired spectra ? In such case, do I need to unset branches as in classical analysis ‘SetBranchStatus(false)’ or is is automatically the case when I do something like:

    auto ZvsA = dfnode.Histo2D({"ZvsA", "ZvsA", 200, 0, 200,100,0,100}, "A","Z");

I have another ver naive question. In my analysis, I need to make gamma-gamma matrices to correlate all gamma-rays from the same event. In Selector, this result in something like:

TH2F *mymatrix = new TH2F("mymatrix","mymatrix",2000,0,2000,2000,0,2000);

for(int i=0 ; i<tree->GetEntries ; i++) { 
    tree->GetEntry(i);
    for(int i=0 ; i<GammaMult ; i++) {
        for(int j=0 ; j<i ; j++) {
            mymatrix->Fill(GammaE[i],GammaE[j]);
            mymatrix->Fill(GammaE[j],GammaE[i]);
        }
    }
}

What is the best way to reproduce something like this using dataframes ?

StephanH · October 4, 2024, 9:27am

In such case, do I need to unset branches as in classical analysis ‘SetBranchStatus(false)’ or is is automatically the case when I do something like

No, RDataFrame does this automatically. It only reads the branches that you are actively asking for. Your solution of doing a double event loop looks very reasonable!

What is the best way to reproduce something like this using dataframes ?

You do it in two steps:

Define the columns that you want to histogram, e.g. like in this tutorial df107. Check the function WithDataframe. It takes a vector of values, and also returns a vector of values. So you can write a loop over all combinations like you did above, and return all the results. This is the example from the tutorial that loops through a collection of inputs, and returns a collection of outputs:

auto CalcPt = [](RVecD &px, RVecD &py, RVecD &E) {
      RVecD v;
      for (auto i=0U;i < px.size(); ++i) {
         if (E[i] > 100) {
            v.emplace_back(sqrt(px[i]*px[i] + py[i]*py[i]));
         }
      }
      return v;
   };
   f.Define("pt", CalcPt, {"px", "py", "E"});

You can do the same, writing all the left-hand values, and in a second define you write all the right-hand values.

And this is a helper from ROOT:VecOps that can give you all indices that you want, so you don’t have to compute them yourself:
https://root.cern/doc/master/group__vecops.html#ga134d68284fca51f3460c8c3c508ad351

auto v_3 = Combinations(v, 3);
(ROOT::VecOps::RVec<ROOT::VecOps::RVec<unsigned long> > &) { { 0, 0, 0, 1 }, { 1, 1, 2, 2 }, { 2, 3, 3, 3 } }

With the columns “GammaE_i” and “GammaE_j” defined, you can simply histogram them. Have a look at the documentation here. You will see that it can histogram both single values, but also collections of those, so you can directly histogram what you defined in step 1.

If you want to do both the combinations (i,j) and (j,i), I recommend to make two histograms, one for (i,j) and one for (j,i), and when done you add those histograms.

There is an alternative to write this all manually as a custom RDataFrame action:
https://root.cern/doc/master/df018__customActions_8C.html

You have to watch out a bit to make it threadsafe if you want to run the RDF multithreaded. If you are comfortable with that, the custom action is a viable option.
Otherwise, define the steps as above, and the multithreading will work out of the box.

dudouet · October 7, 2024, 1:04pm

Here is what I have done:

auto Gamma_i = [](unsigned int, RVecF &E) {
        RVecF vec;
        if(E.size()>1) {
            auto combi = Combinations(E,2);
            vec.emplace_back(E.at(0));
        }
        return vec;
    };
    auto Gamma_j = [](unsigned int, RVecF &E) {
        RVecF vec;
        if(E.size()>1) {
            auto combi = Combinations(E,2);
            vec.emplace_back(E.at(1));
        }
        return vec;
    };
    auto GGHist_df = dfnode.DefineSlot("GGE1", Gamma_i, {"trackEDC"})
                      .DefineSlot("GGE2", Gamma_j, {"trackEDC"});

    auto GGHist = GGHist_df.Histo2D({"GGHist", "GGHist", 2000, 0., 2000., 2000, 0., 2000.}, "GGE1", "GGE2");

This is working, but it’s a bit annoying to be obliged to repeat two times the same function. I have not seen a way to pass arguments to a function given to a Define()

I have tried something like this:

    int gg_idx=0;
    auto Gamma_ij = [&gg_idx](unsigned int, RVecF &E) {
        RVecF vec;
        if(E.size()>1) {
            auto combi = Combinations(E,2);
            for(auto &idx: combi.at(gg_idx)) vec.emplace_back(E.at(idx));
            (gg_idx == 0) ? gg_idx = 1 : gg_idx = 0;
        }
        return vec;
    };

auto GGHist_df = dfnode.DefineSlot("GGE1", Gamma_ij, {"trackEDC"})
                      .DefineSlot("GGE2", Gamma_ij, {"trackEDC"});

But it’s not working, I have dummy values in my matrix

Any idea ?

Thanks

Jérémie

StephanH · October 8, 2024, 3:20pm

Hello Jérémie,

I have a few comments, in random order:

DefineSlot is only necessary if you want to do something that is not thread safe (let’s say in this case not self-contained). That doesn’t seem to be the case, since you are not writing to any external objects. So in the current case, you could probably use Define and leave out the first unsigned int from the arguments.
In your first example, you don’t seem to use the result of auto combi = Combinations(E,2);, so it won’t have any effect. Maybe it’s on older version of the function?
Your second example seems to go through the combinations, but it has one problem: It captures gg_idx by reference. That means that when RDataFrame runs, you are changing a variable that’s outside of the lambda, so the two will interfere in a not-so-nice way.
What you could do instead is to create the lambda with a capture by value, so the lambda carries the value around, and it cannot be changed at run time. An easy way to do it without code duplication is to create it inside another lambda. This could look like (please excuse any typos):

auto makeGamma_ij = [](int gg_idx) {
  return [gg_idx](RVecF const & E) {
    RVecF vec;
    if(E.size()>1) {
      auto combi = Combinations(E.size(), 2);
      for(auto const idx : combi.at(gg_idx)) vec.emplace_back(E.at(idx));
    }
    return vec;
  };
};

auto GGHist_df = dfnode.Define("GGE1", makeGamma_ij(0), {"trackEDC"})
                       .Define("GGE2", makeGamma_ij(1), {"trackEDC"});

Note that I used [gg_idx], not [&gg_idx], to keep everything constant and thread-safe. I also added a few const for the same reason. You didn’t need them, but they might prevent future errors in case you would decide to change the code.

dudouet · October 9, 2024, 8:26am

Thanks a lot Stephan !

I was wondering how we can pass arguments to define functions, that’s exactly what you made

Actually, I never used such imbricated lambdas, that very powerful !

dudouet · October 9, 2024, 10:03am

Just a comment: Combinations(E.size(), 2); is not what I am needed, because I need all combinations between the elements of a vector of size E.size(). I need either to do:

auto makeGamma_ij = [](int gg_idx) {
        return [gg_idx](RVecF const & E) {
            RVecF vec;
            if(E.size()>1) {
                auto combi = ROOT::VecOps::Combinations(E, 2);
                for (auto const idx : combi.at(gg_idx)) vec.emplace_back(E.at(idx)); // loop i<j
                for (auto const idx : combi.at((gg_idx == 0) ? 1 : 0)) vec.emplace_back(E.at(idx)); // to symetrize the matrix
            }
            return vec;
        };
    };

Either:

auto makeGamma_ij = [](int gg_idx) {
        return [gg_idx](RVecF const & E) {
            RVecF vec;
            if(E.size()>1) {
                auto combi = ROOT::VecOps::Combinations(E.size(), E.size());
                for(int i=0 ; i<combi.front().size() ; i++) if(combi.at(0).at(i) != combi.at(1).at(i)) vec.emplace_back(E.at(combi.at(gg_idx).at(i)));
            }
            return vec;
        };
    };

By the way, is it not too CPU time consuming to calculate the combinations for each event ? Can we think to something like having a global variable like:

map<int, RVec< RVec< std::size_t > > > map_of_combinations;

that is passed to the lambda, and we calculate only once the new combinations ? But in this case, we would need to make sure that the function is thread safe. Using DefineSlot ? how this should be used ?

Something like this is ok ?

    map<int, RVec< RVec< std::size_t > > > map_of_combinations;

    // Gamma-Gamma matrix definition
    auto makeGamma_ij = [&map_of_combinations](int gg_idx) {
        return [gg_idx,&map_of_combinations](unsigned int, RVecF const & E) {
            RVecF vec;
            if(E.size()>1) {
                if(!map_of_combinations.count(E.size())) {
                    map_of_combinations.emplace(E.size(),ROOT::VecOps::Combinations(E.size(), E.size()));
                }
                auto combi = map_of_combinations.at(E.size());
                for(int i=0 ; i<combi.front().size() ; i++) if(combi.at(0).at(i) != combi.at(1).at(i)) vec.emplace_back(E.at(combi.at(gg_idx).at(i)));
            }
            return vec;
        };
    };
    auto GGHist_df = dfnode.DefineSlot("trackEDC_1", makeGamma_ij(0), {"trackEDC"}).DefineSlot("trackEDC_2", makeGamma_ij(1), {"trackEDC"});

Or I need to use mutex locks for thread safety ?

Thanks in advance

StephanH · October 10, 2024, 8:29am

Hello,

You can always measure first.

If you want to cache the combinations, you either need a cache per slot, or you protect the map_of_combinations with a lock every time you read or write.

dudouet · October 10, 2024, 8:45am

Can you show me an example of how doing such a cache per slot ?

StephanH · October 11, 2024, 3:10pm

You have to make one map_of_combinations for each slot you will use.
When you write the function that acts on the slot, take the slot index, and retrieve the map_of_combinations for that slot number.

So using your example:

std::vector<map<int, RVec< RVec< std::size_t > > > > map_of_combinations_perSlot;
map_of_combinations_perSlot.resize(???); // This is important. You have to create the correct size *before* you run the event loop for it to be thread safe.
// Fill in the number of threads you will use.

    // Gamma-Gamma matrix definition
    auto makeGamma_ij = [&map_of_combinations](int gg_idx) {
        return [gg_idx,& map_of_combinations_perSlot](unsigned int slotNumber, RVecF const & E) {
            RVecF vec;
            // Get a map_of_combinations that's *only* for this slot:
            auto & map_of_combinations = map_of_combinations_perSlot[slotNumber];

            if(E.size()>1) {
                if(!map_of_combinations.count(E.size())) {
                    map_of_combinations.emplace(E.size(),ROOT::VecOps::Combinations(E.size(), E.size()));
                }
                auto combi = map_of_combinations.at(E.size());
                for(int i=0 ; i<combi.front().size() ; i++) if(combi.at(0).at(i) != combi.at(1).at(i)) vec.emplace_back(E.at(combi.at(gg_idx).at(i)));
            }
            return vec;
        };
    };
    auto GGHist_df = dfnode.DefineSlot("trackEDC_1", makeGamma_ij(0), {"trackEDC"}).DefineSlot("trackEDC_2", makeGamma_ij(1), {"trackEDC"});

You see now that we are doing the work separately for each slot, but it’s thread safe. Alternatively, you make one global supermap, but in that case, you always have to lock when you access the map. The threads would share the information, though.

dudouet · October 14, 2024, 1:13pm

Thanks a lot for the example.

I tried to generalize this to 3D, but the code is crashing, as if the TH3 object size is too large:

    auto GGGHist = GGHist_df.Histo3D({"GGHist", "GGHist", 2000, 0., 2000., 2000, 0., 2000., 2000, 0., 2000.}, "trackEDC_1", "trackEDC_2", "trackEDC_2");

The output error is the following:

 *** Break *** segmentation violation
[/usr/lib/system/libsystem_platform.dylib] _sigtramp (no debug info)
[<unknown binary>] (no debug info)
[/Users/dudouet/Softs/root/install/root-6-32-06/lib/libHist.so] TH3::Fill(double, double, double) (no debug info)
[<unknown binary>] (no debug info)
[<unknown binary>] (no debug info)
[<unknown binary>] (no debug info)
[/Users/dudouet/Softs/root/install/root-6-32-06/lib/libROOTDataFrame.so] ROOT::Detail::RDF::RLoopManager::RunAndCheckFilters(unsigned int, long long) (no debug info)
[/Users/dudouet/Softs/root/install/root-6-32-06/lib/libROOTDataFrame.so] ROOT::Detail::RDF::RLoopManager::RunTreeReader() (no debug info)
[/Users/dudouet/Softs/root/install/root-6-32-06/lib/libROOTDataFrame.so] ROOT::Detail::RDF::RLoopManager::Run(bool) (no debug info)
[/Users/dudouet/Softs/EurolabsSchool/install/bin/AGATA_VAMOS_Analyzer] ROOT::RDF::RResultPtr<TH1D>::TriggerRun() /Users/dudouet/Softs/root/install/root-6-32-06/include/ROOT/RResultPtr.hxx:398
[/Users/dudouet/Softs/EurolabsSchool/install/bin/AGATA_VAMOS_Analyzer] ROOT::RDF::RResultPtr<TH1D>::Get() /Users/dudouet/Softs/root/install/root-6-32-06/include/ROOT/RResultPtr.hxx:0
[/Users/dudouet/Softs/EurolabsSchool/install/bin/AGATA_VAMOS_Analyzer] ROOT::RDF::RResultPtr<TH1D>::operator->() /Users/dudouet/Softs/root/install/root-6-32-06/include/ROOT/RResultPtr.hxx:252

The ROOT limitation in size was 2Go if I remember well, but only for writing objects on disk, not in memory. Is it the reason of such crash ?

Jérémie

StephanH · October 15, 2024, 2:03pm

Segmentation violation looks like something else. It seems to be calling TH3::Fill as we see from the stack trace, so my first guess is that this is being called on an invalid histogram. If you were out of memory, the problem would have shown before when allocating the histograms.

To debug it, you might have to compile the example with debug symbols, that is root macroFile.cxx+g in case you are running a macro, or you have to switch your build system to debug.

dudouet · October 15, 2024, 2:24pm

What is strange is that there is no crash if I reduce the histogram size.

My code is already compiled in debug, to have more details I need to recompile ROOT in debug. Using my debugger, the only information that I can extract is that this is coming from the TH3D::AddBinContent(int) method.

StephanH · October 16, 2024, 8:51am

That’s strange, indeed. Did you monitor the memory consumption while the analysis runs?

dudouet · October 16, 2024, 9:08am

I tried but it crashes so fast that it’s not easy to do, but I actually a TH3D of 2000 per axis means ~60Go, so this cannot be handled by my laptop for sure

dudouet · October 24, 2024, 2:48pm

Sorry, I have another question for you. We are loosing a bit the initial subject but I don’t want to flood the forum with so simple questions…

I would like to fill an histogram of “N” from a TTree using dataframe, with two different branches “N1” and “N2”, namely:

TH1F *h = new TH1F("name","title"...);
h->Fill(N1);
h->Fill(N2);

Is there a way to do it simple, or do I need to redefine a new column, “N”, which is the merge of “N1” and “N2”, and then initialize the TH1 on this new variable ?

Thanks in advance

Jérémie

system · November 7, 2024, 2:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.