Better way to determine most common difference of two columns of an RDataFrame

KAM · September 26, 2021, 4:10am

I have an RDataFrame, which consists of three columns (A, B, and run). Here’s what it might look like in tabular form:

+-----+-----+-----+
| run |  A  |  B  |
+-----+-----+-----+
| 001 |  35 |   5 |
| 001 |  40 |  10 |
| 001 |  77 |  60 |
|     |     |     |
| ... | ... | ... |
|     |     |     |
| 002 |  42 |  40 |
| 002 |  30 |  28 |
| 002 |  50 |   1 |
| ... | ... | ... |
+-----+-----+-----+

(where the … dots indicate continuation)

For each run, I want to determine the most common difference of A and B (i.e., the mode of the set of entry-wise differences of A and B for each run). So, for example, for run 002 we have 42-40 = 2, 30-28=2, and 50 - 1=49, the most common element/mode of 2,2, and 49 is 2, so the result for run 002 is 2.

A couple important notes: A-B is guaranteed to be a positive, integer value for all entries, and there is guaranteed to be a single unique mode for each run

Currently, what I’m doing is this:

// Some code above

map<int, map<int, int>> offset;

df.Foreach([](int A, int B, int run){
    ++map[run][A-B];
},{"A","B","run"});

// Some code that, for each map in the map, determines the key of the 
// max value (i.e. the mode), and stores this in a vector

So, we have a map of maps. The “outer” map maps run numbers to “interior” maps. “Interior” maps map offset (A-B) to frequency. I then, for each “interior” map, get the key of the largest value (i.e. the mode), and store this in a new map, which maps run numbers to mode of A-B for that run.

This works, however it’s quite slow (and, for my purposes, impractically so). My RDataFrame contains on the order of tens of millions or more entries.

I imagine there must be a better way. I’ve spent some time with the RDataFrame docs and, unfortunately, I can’t seem to piece together anything measurably more efficient.

Is there a better way? Please, let me know if the issue at hand isn’t clear. Thanks in advance!

dastudillo · September 26, 2021, 5:58am

I see you posted this problem before and looks like the ROOT suggestions are not good or fast enough, so probably you should google/whatever search for c++ solutions. Perhaps something like this can help
https://www.codeproject.com/articles/866996/fast-implementations-of-maps-with-integer-keys-in
or you may be able to devise your own solution, which might be more efficient for your specific case.

eguiraud · June 22, 2022, 10:43am

Hi,

I just came across this thread that somehow I had completely missed at the time. Sorry about that!

@KAM if you are still around I would be curious to hear how you would code an algorithm that does that independently of RDataFrame – then we can think how to fit that algorithm in RDF’s API.

Cheers,
Enrico

KAM · June 24, 2022, 9:09pm

Sure thing-- I’ll try to find and post the code I wrote soon.