Problem with filtering a RDataFrame

mdgalati · October 10, 2023, 9:52am

I am experiencing an unexpected issue while working with DataFrame filtering using the ROOT library in Python. Below is a simplified version of my code:

... df.Define("shift", "getShift()") ...

where:

getShift = 
""" 
#include <ctime> 
#include "TRandom3.h" 
float getShift() {     // smear vertices with a gaussian (interaction region of  2*sigma_beam = 126 mm)     
auto now = std::chrono::system_clock::now();     
auto timeSeed = now.time_since_epoch().count();     
auto rnd = TRandom3(timeSeed);     
auto shift =  rnd.Gaus(0, 63.) ;     
return shift; } """ R.gInterpreter.Declare(getShift)

…

df_Fil = df.Filter("nHits \> 0 && PZ_pip0 \> 0 && PZ_pip1 \> 0 && PZ_pim0 \> 0 && \
std::isnan(theta_pvtv) == 0 && std::isnan(theta_fh) == 0 && std::isnan(theta_TRUE) == 0")

nBhits = (df_Fil.Filter('nHits_mother \> 0')).Count().GetValue()
print(f"stage1: {df_Fil.Count().GetValue()}")
noBhits = (df_Fil.Filter('nHits_mother == 0')).Count().GetValue()
print(f"stage2: {df_Fil.Count().GetValue()}")
ntauhits = (df_Fil.Filter('nHits_daughter \> 0')).Count().GetValue()
print(f"stage3: {df_Fil.Count().GetValue()}")
notauhits = (df_Fil.Filter('nHits_daughter == 0')).Count().GetValue()
print(f"stage4: {df_Fil.Count().GetValue()}")
notauhits = (df_Fil.Filter('nHits_daughter == 0')).Count().GetValue()
print(f"stage5: {df_Fil.Count().GetValue()}")
nr_Fil = df_Fil.Count().GetValue()
eff = nr_Fil / 10000000
print(f"nrFil = {nr_Fil}, nBhits = {nBhits}, ntauhits = {ntauhits}, noBhits = {noBhits}, notauhits = {notauhits}")

However, the output is not as expected:

stage1: 484 stage2: 510 stage3: 499 stage4: 483 stage5: 500 nrFil = 498, nBhits = 36, ntauhits = 493, noBhits = 465, notauhits = 18

(check: it should be nBhits+ntauhits = nrFil, notauhits=nBhits, noBhits=ntauhits"

The issue is that the counts of the DataFrame df_Fil seem to change unexpectedly after applying filters. I expected the count to remain consistent, but it appears to fluctuate after each filter is applied.

I am quite sure the problem depends on the generation of a random number “shift” that is re-generated everytime I use the Filter method, therefore changing the number of counts in nrFil (it’s dependent on the exact shifts").

Anyone can help me understanding how to avoid this issue?

I’ve tried using TRandom3(42) for initialization and the counts at every stage are consistent, but it’s because the shift value remains constant.

mczurylo · October 10, 2023, 12:25pm

Hi @mdgalati,

thanks for your question!

Could you please add a full reproducer of your problem (or a simplified version), so I can run it and also test the solution? (If you prefer not to share it here, you can also send it to me via email).

Cheers,
Marta

mdgalati · October 10, 2023, 2:15pm

Hi Marta,
my code reads a root file: d = R.RDataFrame("DecayTree", inputTree_path)
and I then add a variable called shift, to shift the generated primary vertices. The reason is that RapidSim generated PVs always at (0,0,0) and in reality at LHCb they’re distributed with a sigma=63mm, that’s why I defined the C function getShift().
That’s all you need I think, you can try and add a shift variable in any root file and if you try and filter like df.Filter("shift > 0") you will see that df.Count().GetValue() will change everytime you run it.
Let me know, thanks!

mczurylo · October 10, 2023, 3:17pm

Hi @mdgalati,

ok, thanks for clarification. Try booking all your lazy operations first, i.e. only filters with counts, but without adding .GetValue() and only get the values after you booked your operations, i.e. all your filters here. So something like:

nBhits = (df_Fil.Filter('nHits_mother \> 0')).Count()
noBhits = (df_Fil.Filter('nHits_mother == 0')).Count()
ntauhits = (df_Fil.Filter('nHits_daughter \> 0')).Count()
notauhits = (df_Fil.Filter('nHits_daughter == 0')).Count()
notauhits = (df_Fil.Filter('nHits_daughter == 0')).Count()
nr_Fil = df_Fil.Count()
print(f"nrFil = {nr_Fil.GetValue()}, nBhits = {nBhits.GetValue()}, ntauhits = {ntauhits.GetValue()}, noBhits = {noBhits.GetValue()}, notauhits = {notauhits.GetValue()}")

The way RDataFrame works is that it triggers a computation graph every time you trigger it with a non-lazy action, here by doing “GetValue”, and therefore it generates a new random shift variable each time.

Cheers,
Marta

mdgalati · October 11, 2023, 8:46am

Many thanks, Marta, this worked!
However I still have an issue: when I print nr_Fil size as you suggested I get 536, but when I save it in root file its dimension is different, so probably something similar happens also when using df_Fil.Snapshot?

From ROOT: ROOT::RDataFrame Class Reference I see " Snapshot can be made lazy setting the appropriate flag in the snapshot options.", but I can’t seem to find how in pyroot.
What I’m doing now is just dataframe.Snapshot("DecayTree", outFile, branchList)

mczurylo · October 11, 2023, 9:48am

Hi @mdgalati,

yes, Snapshot is an instant action and yes you are right, it can be made lazy:

lazy_options = ROOT.RDF.RSnapshotOptions()
lazy_options.fLazy = True
dataframe.Snapshot("DecayTree", outFile, branchList,lazy_options)

Cheers,
Marta

mdgalati · October 11, 2023, 9:56am

The tree still has a different number of entries from the one printed in the script…

df_Fil = df.Filter("nHits > 0 && PZ_pip0 > 0 && PZ_pip1 > 0 && PZ_pim0 > 0 && \
                   std::isnan(theta_pvtv) == 0 && std::isnan(theta_fh) == 0 && std::isnan(theta_TRUE) == 0")

nBhits = (df_Fil.Filter('nHits_mother > 0')).Count()
noBhits = (df_Fil.Filter('nHits_mother == 0')).Count()
ntauhits = (df_Fil.Filter('nHits_daughter > 0 && nHits_mother == 0')).Count()
notauhits = (df_Fil.Filter('nHits_daughter == 0 && nHits_mother > 0')).Count()
nr_Fil = df_Fil.Count()

print(f"nrFil = {nr_Fil.GetValue()}, nBhits = {nBhits.GetValue()}, ntauhits = {ntauhits.GetValue()}, \
       noBhits = {noBhits.GetValue()}, notauhits = {notauhits.GetValue()}")

eff = nr_Fil.GetValue()/10000000

print(f"Efficiency: {nr_Fil.GetValue()}/10000000 = {eff}")

branchList = R.vector('string')()
for branchName in ["m_3pi", "m_3pi_TRUE", "IP_3pi", "IP_3pi_TRUE", \
                "theta_fh", "mcorr_fh", "mcorr_B_TRUE", "mcorr_tau_TRUE", \
                "m_pip0_pim", "m_pip0_pim_TRUE", "m_pip1_pim", "m_pip1_pim_TRUE", \
                "shift", "PV", "TV", "FH", "PV_TRUE", "SV_TRUE", "EV_TRUE", "TV_TRUE", \
                "Pvec_3pi", "Pvec_3pi_TRUE", "P_3pi", "PT_3pi",\
                "theta_pvtv", "theta_TRUE", "full_FD", "full_FD_T",\
                "eta_mother", "eta_mother_TRUE", \
                "nHits", "nHits_mother", "nHits_father", "nHits_daughter"]:
    branchList.push_back(branchName)

    

filtered_path = f"/dcache/bfys/mgalati/RapidSim/{decay}/FIL"
outFile = filtered_path + f"/{decay}_{jobID}_FILhits.root"
lazy_options = R.RDF.RSnapshotOptions()
lazy_options.fLazy = True
print(f"Writing filtered TTree in '{outFile}'")
df_Fil.Snapshot("DecayTree", outFile, branchList, lazy_options)

Here’s the output:

nrFil = 519, nBhits = 34, ntauhits = 485,        noBhits = 485, notauhits = 18
Efficiency: 519/10000000 = 5.19e-05

Writing filtered TTree in '/dcache/bfys/mgalati/RapidSim/Bcp2taup2pipipi/FIL/Bcp2taup2pipipi_14621394_FILhits.root'
bash-5.2$ root -l /dcache/bfys/mgalati/RapidSim/Bcp2taup2pipipi/FIL/Bcp2taup2pipipi_14621394_FILhits.root
root [0] 
Attaching file /dcache/bfys/mgalati/RapidSim/Bcp2taup2pipipi/FIL/Bcp2taup2pipipi_14621394_FILhits.root as _file0...
(TFile *) 0x556423529f70
root [1] DecayTree->GetEntries()
(long long) 529

mczurylo · October 11, 2023, 10:05am

Hi,

you are still first triggering the computation graph using GetValue for printing. What if you do the snapshot first and then use the “GetValue” and all the printing at the end?

Cheers,
Marta

mdgalati · October 11, 2023, 1:51pm

It worked with this order

df_Fil.Snapshot("DecayTree", outFile, branchList)

print(f"nrFil = {nr_Fil.GetValue()}, nBhits = {nBhits.GetValue()}, ntauhits = {ntauhits.GetValue()}, \
       noBhits = {noBhits.GetValue()}, notauhits = {notauhits.GetValue()}")

eff = nr_Fil.GetValue()/10000000
print(f"Efficiency: {nr_Fil.GetValue()}/10000000 = {eff}")

and without the option lazy. If I use it it doesn’t save the file

Many many thanks for you help, Marta! Have a wonderful day

mczurylo · October 11, 2023, 3:34pm

Hi @mdgalati,

this makes sense, although the reason is rather subtle. As with Count() (that is by definition lazy), the lazy version of Snapshot is not executed at the time you book it. However, for the Count()-ed example you used a variable, for example nr_Fil, so that the RDF remembers it exists and can execute the counting when you start the event loop by executing GetValue().

Now, for the lazy snapshot, if you don’t save it in a variable first, it will be created and forgotten directly, before the execution of the event loop with GetValue. So for you to have a working lazy snapshot, you need to simply save it as a variable and the snapshot action will then be remembered by the RDF. It won’t be executed at the time of booking, but will only be triggered while you run the next line with the GetValue() (for the Count() operation here, the snapshot action doesn’t need an additional call). I hope this is a bit clearer now. Sorry for not bringing it up earlier.

Anyways, thank you for your question as well - it will help us improve the documentation.

If you have more RDF or other ROOT questions, we are here to help!

Cheers,
Marta

system · October 25, 2023, 3:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.