I have an RDataFrame, which consists of three columns (A, B, and run). Here’s what it might look like in tabular form:
+-----+-----+-----+
| run | A | B |
+-----+-----+-----+
| 001 | 35 | 5 |
| 001 | 40 | 10 |
| 001 | 77 | 60 |
| | | |
| ... | ... | ... |
| | | |
| 002 | 42 | 40 |
| 002 | 30 | 28 |
| 002 | 50 | 1 |
| ... | ... | ... |
+-----+-----+-----+
(where the … dots indicate continuation)
For each run, I want to determine the most common difference of A and B (i.e., the mode of the set of entry-wise differences of A and B for each run). So, for example, for run 002 we have 42-40 = 2, 30-28=2, and 50 - 1=49, the most common element/mode of 2,2, and 49 is 2, so the result for run 002 is 2.
A couple important notes: A-B is guaranteed to be a positive, integer value for all entries, and there is guaranteed to be a single unique mode for each run
Currently, what I’m doing is this:
// Some code above
map<int, map<int, int>> offset;
df.Foreach([](int A, int B, int run){
++map[run][A-B];
},{"A","B","run"});
// Some code that, for each map in the map, determines the key of the
// max value (i.e. the mode), and stores this in a vector
So, we have a map of maps. The “outer” map maps run numbers to “interior” maps. “Interior” maps map offset (A-B) to frequency. I then, for each “interior” map, get the key of the largest value (i.e. the mode), and store this in a new map, which maps run numbers to mode of A-B for that run.
This works, however it’s quite slow (and, for my purposes, impractically so). My RDataFrame contains on the order of tens of millions or more entries.
I imagine there must be a better way. I’ve spent some time with the RDataFrame docs and, unfortunately, I can’t seem to piece together anything measurably more efficient.
Is there a better way? Please, let me know if the issue at hand isn’t clear. Thanks in advance!