Extract information outside of columns provided in RDataFrame Define

Hi all,

I have two questions related to the RDataFrame usage.
They are very relevant right now to continue using RDataFrame in my work.
I provide here only very simplified examples that should give you an idea of what I want.

  1. Is there a way to get a branch name from the user-custom function?
import ROOT
ROOT.EnableImplicitMT()
ROOT.gInterpreter.Declare('#include "functions.h"')

df = ROOT.RDataFrame(10)\
         .Define("x", "rdfentry_")\
         .Define("y", "x*x")\
         .Define("z", "my_func(x, y)")
//functions.h
auto my_func(int x, int y){
    //FIXME: make sure that `y` data comes from the "y" named column!
    return x*x+y*y;
}
  1. Is there a way to get the event data if I know the column’s name?
import ROOT
ROOT.EnableImplicitMT()
ROOT.gInterpreter.Declare('#include "functions.h"')

df = ROOT.RDataFrame(10)\
         .Define("person", "gRandom->Integer(3)")\
         .Define("present_for_frodo", "rdfentry_")\
         .Define("present_for_santa", "rdfentry_")\
         .Define("present_for_anakin", "rdfentry_")\
         .Define("presents", "get_presents(person)")
//functions.h
int get_presents(int person){
    std::string column_name = "";
    if (person == 0) column_name = "present_for_frodo";
    else if (person == 1) column_name = "present_for_santa";
    else if (person == 2) column_name = "present_for_anakin";
    //FIXME I know the column_name but I don't know how to get the data
    return -1;
}

Hello,

Thanks for the questions: they are not trivial at all!
Let me try to provide two initial answers, then we can maybe iterate:

About your item 1: By using the string "my_func(x, y)" you are already guaranteed that the column x and y are being used: RDF looks for those names in the list of columns available. The same happens if you decide to use the full C++ syntax, e.g. Define("r", func, {"x","y"}). Once the value is passed to the function (or functor) it is not easy to get back to the name of the column in the RDF tree.

About your item 2: Once inside a C++ function (or functor), there is no easy way to access the RDF tree to get the value of a column by name, unless of course you pass the possible values via the signature and then pick the right one.

Thank you for distilling these examples and very clear questions and apologies for the very generic initial answers. If the answers is not what you were expecting, perhaps would it help to discuss in a bit more detail the context and the usecase you are trying to address?

Cheers,
Danilo

I am working with the simulated files in the EDM4hep event data format, which is based on podio.
I know that, for example, FCCAnalysis try to combine EDM4hep and RDataFrame and so do I, because I like the RDataFrame’s performance.

The TTree structure of EDM4hep file looks something like this.

The main trouble comes when working with branches that relate to other branches, like in my example, where the branch “person” indicates from which branch to take the “present”.

In my TTree:
Ecal*, Hcal*, LCAL, LHCAL, MUON are collections of CalorimeterHits from different subdetectors in the event. (RVec<edm4hep::HitData>).

PandoraClusters is the collection of clusters in the event RVec<edm4hep::ClusterData>.
Each cluster has associated CalorimeterHits from abovementioned collections.
_PandoraClusters_hits (RVecpodio::ObjectID) store information about these cluster hits. Basically collectionID, which is 1 to 1 map to the name of the collection and index of the cluster hit in this collection.

I would like to collect all the hits related to each PandoraCluster in single collection.

I managed to get what I wanted, but it is very ugly, because a) all collection names are hardcoded; b) the position of all input arguments is hardcoded:

auto get_cluster_hits(edm4hep::ClusterData cluster, ROOT::VecOps::RVec<podio::ObjectID> hit_ids_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> lhcal_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> lcal_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> muon_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> ecal_barrel_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> ecal_barrel_gap_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> ecal_endcap_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> ecal_endcap_gap_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> ecal_endcap_ring_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> hcal_barrel_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> hcal_endcap_hits_col,
                        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> hcal_endcap_ring_hits_col){
    //return clusters associated to the given PFO
    ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> result;

    auto n_hits = cluster.hits_end - cluster.hits_begin;
    if (n_hits == 0) return result;

    for(int i=cluster.hits_begin; i != cluster.hits_end; i++){
        auto objID = hit_ids_col[i];
        auto col_name = collection_id2name[objID.collectionID];
        ROOT::VecOps::RVec<edm4hep::CalorimeterHitData> hits_col;
        if (col_name == "LHCAL") hits_col = lhcal_hits_col;
        else if (col_name == "LCAL") hits_col = lcal_hits_col;
        else if (col_name == "MUON") hits_col = muon_hits_col;
        else if (col_name == "EcalBarrelCollectionRec") hits_col = ecal_barrel_hits_col;
        else if (col_name == "EcalBarrelCollectionGapHits") hits_col = ecal_barrel_gap_hits_col;
        else if (col_name == "EcalEndcapsCollectionRec") hits_col = ecal_endcap_hits_col;
        else if (col_name == "EcalEndcapsCollectionGapHits") hits_col = ecal_endcap_gap_hits_col;
        else if (col_name == "EcalEndcapRingCollectionRec") hits_col = ecal_endcap_ring_hits_col;
        else if (col_name == "HcalBarrelCollectionRec") hits_col = hcal_barrel_hits_col;
        else if (col_name == "HcalEndcapsCollectionRec") hits_col = hcal_endcap_hits_col;
        else if (col_name == "HcalEndcapRingCollectionRec") hits_col = hcal_endcap_ring_hits_col;
        auto hit = hits_col[objID.index];
        result.push_back(hit);
    }
    return result;
}

In EDM4hep it is a common way to link a lot of related information.

  • Calorimeter Cluster ↔ Calorimeter Cluster Hits
  • Track ↔ Track Hits
  • Reconstructed Particle ↔ It’s tracks/clusters
  • MCParticle ↔ ReconstructedParticle/Track/Cluster
  • SimHits ↔ RecoHits

So, currently, using RDataFrame with EDM4hep relations is very inconvenient and requires a lot of hardcoding…

I am wondering if there are already existing tools in ROOT, which I am not aware of, that could improve my ugly hardcoded example above.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.