I have an RDataFrame
, which consists of three columns (A
, B
, and run
). Here’s what it might look like in tabular form:
+-----+-----+-----+
| run | A | B |
+-----+-----+-----+
| 001 | 35 | 5 |
| 001 | 40 | 10 |
| 001 | 77 | 60 |
| | | |
| ... | ... | ... |
| | | |
| 002 | 42 | 40 |
| 002 | 30 | 28 |
| 002 | 50 | 1 |
| ... | ... | ... |
+-----+-----+-----+
(where the … dots indicate continuation)
For each run, I want to determine the most common difference of A
and B
(i.e., the mode of the set of entry-wise differences of A
and B
for each run). So, for example, for run 002
we have 42-40 = 2
, 30-28=2
, and 50 - 1=49
, the most common element/mode of 2,2, and 49
is 2
, so the result for run 002
is 2
.
A couple important notes: A
-B
is guaranteed to be a positive, integer value for all entries, and there is guaranteed to be a single unique mode for each run
Currently, what I’m doing is this:
// Some code above
map<int, map<int, int>> offset;
df.Foreach([](int A, int B, int run){
++map[run][A-B];
},{"A","B","run"});
// Some code that, for each map in the map, determines the key of the
// max value (i.e. the mode), and stores this in a vector
So, we have a map of maps. The “outer” map maps run numbers to “interior” maps. “Interior” maps map offset (A-B
) to frequency. I then, for each “interior” map, get the key of the largest value (i.e. the mode), and store this in a new map, which maps run numbers to mode of A-B
for that run.
This works, however it’s quite slow (and, for my purposes, impractically so). My RDataFrame
contains on the order of tens of millions or more entries.
I imagine there must be a better way. I’ve spent some time with the RDataFrame
docs and, unfortunately, I can’t seem to piece together anything measurably more efficient.
Is there a better way? Please, let me know if the issue at hand isn’t clear. Thanks in advance!