# Better way to determine most common difference of two columns of an RDataFrame

I have an `RDataFrame`, which consists of three columns (`A`, `B`, and `run`). Here’s what it might look like in tabular form:

``````+-----+-----+-----+
| run |  A  |  B  |
+-----+-----+-----+
| 001 |  35 |   5 |
| 001 |  40 |  10 |
| 001 |  77 |  60 |
|     |     |     |
| ... | ... | ... |
|     |     |     |
| 002 |  42 |  40 |
| 002 |  30 |  28 |
| 002 |  50 |   1 |
| ... | ... | ... |
+-----+-----+-----+
``````

(where the … dots indicate continuation)

For each run, I want to determine the most common difference of `A` and `B` (i.e., the mode of the set of entry-wise differences of `A` and `B` for each run). So, for example, for run `002` we have `42-40 = 2`, `30-28=2`, and `50 - 1=49`, the most common element/mode of `2,2, and 49` is `2`, so the result for run `002` is `2`.

A couple important notes: `A`-`B` is guaranteed to be a positive, integer value for all entries, and there is guaranteed to be a single unique mode for each run

Currently, what I’m doing is this:

``````// Some code above

map<int, map<int, int>> offset;

df.Foreach([](int A, int B, int run){
++map[run][A-B];
},{"A","B","run"});

// Some code that, for each map in the map, determines the key of the
// max value (i.e. the mode), and stores this in a vector

``````

So, we have a map of maps. The “outer” map maps run numbers to “interior” maps. “Interior” maps map offset (`A-B`) to frequency. I then, for each “interior” map, get the key of the largest value (i.e. the mode), and store this in a new map, which maps run numbers to mode of `A-B` for that run.

This works, however it’s quite slow (and, for my purposes, impractically so). My `RDataFrame` contains on the order of tens of millions or more entries.

I imagine there must be a better way. I’ve spent some time with the `RDataFrame` docs and, unfortunately, I can’t seem to piece together anything measurably more efficient.

Is there a better way? Please, let me know if the issue at hand isn’t clear. Thanks in advance!

I see you posted this problem before and looks like the ROOT suggestions are not good or fast enough, so probably you should google/whatever search for c++ solutions. Perhaps something like this can help
https://www.codeproject.com/articles/866996/fast-implementations-of-maps-with-integer-keys-in
or you may be able to devise your own solution, which might be more efficient for your specific case.

Hi,

I just came across this thread that somehow I had completely missed at the time. Sorry about that!

@KAM if you are still around I would be curious to hear how you would code an algorithm that does that independently of RDataFrame – then we can think how to fit that algorithm in RDF’s API.

Cheers,
Enrico

Sure thing-- I’ll try to find and post the code I wrote soon.