RVecOps ArgMax weird behavior

aaarora · April 10, 2024, 10:28am

Hello ROOT team, I am trying to do the following in RDataFrame,

auto df_higgs = df_leps.Define("HCandidate", "MyVar > 0")
        .Define("Score1", "Branch1[HCandidate]")
        .Define("Score2", "Branch2[HCandidate]")
        .Define("HScore", "Score1 / Score2")
        .Define("HighestHScoreIdx", "ArgMax(HScore)")
        .Define("HighestHScore", "HScore[HighestHScoreIdx]")
        .Filter("HighestHScore > 0", "Higgs candidate exists")

It compiles and runs fine but when I Display the df after these selections, I see

+------+-------------+------------------+---------------+
| Row  | HScore      | HighestHScoreIdx | HighestHScore | 
+------+-------------+------------------+---------------+
| 54   | 0.947388f   | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 60   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 62   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 63   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 67   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 69   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 75   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 91   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 99   |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 105  |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 106  |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 112  |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 118  |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 120  |             | 0                | 0.947388f     | 
+------+-------------+------------------+---------------+
| 226  | 0.00148309f | 0                | 0.00148309f   | 
+------+-------------+------------------+---------------+
| 229  |             | 0                | 0.00148309f   | 
+------+-------------+------------------+---------------+
| 237  |             | 0                | 0.00148309f   | 
+------+-------------+------------------+---------------+
| 242  |             | 0                | 0.00148309f   | 
+------+-------------+------------------+---------------+
| 244  |             | 0                | 0.00148309f   | 
+------+-------------+------------------+---------------+
| 246  |             | 0                | 0.00148309f   | 
+------+-------------+------------------+---------------+
| 251  |             | 0                | 0.00148309f   | 
+------+-------------+------------------+---------------+
| 270  | 0.00133200f | 0                | 0.00133200f   | 
+------+-------------+------------------+---------------+
| 277  |             | 0                | 0.00133200f   | 
+------+-------------+------------------+---------------+

It seems ArgMax sets the default value to 0 if the RVec has size zero. Why it’s doing this makes sense after looking at RVec.cxx but I don’t know if this is ideal because there is no way to tell if the maximum is the 0th index element or if the vec size is 0.

Also, If HScore has size zero then the HighestHScore is automatically set to be the last non-zero vec maximum.

Using .Define("HighestHScore", "Max(HScore)") leads to the same behavior.

Can you suggest a way of circumventing this issue? Perhaps a way to check if the vector size is non zero? Thank you.

Update: I was able to apply a quick fix by defining

RVec<int> MyArgMax(const RVec<float> &v){
    RVec<int> idx = {};
    if (v.size() != 0){
        idx.push_back(std::distance(v.begin(), std::max_element(v.begin(), v.end())));
    }
    return idx;
}

and picking v[0] and something similar for MyMax. It works for now.

ROOT Version: 6.30/04
Platform: linuxx8664gcc / installed through conda-forge

couet · April 10, 2024, 10:55am

Welcome to the ROOT forum,

I guess @vpadulan can help you.

vpadulan · April 10, 2024, 10:42pm

Dear @aaarora ,

Thanks for reaching out to the forum!

Sorry if I may be naive here, but wouldn’t it be possible to embed the check on the size of the vector inline and decide on some convention for the return value (e.g. a flag value or something like std::optional)? For example via

df.Define("argmax", "x.size() !=0 ? ArgMax(x) : 1234")
df.Define("argmaxopt", "x.size() !=0 ? std::optional<std::size_t>(ArgMax(x)) : std::optional<std::size_t>()");

Depending on what you decide then you will need to act downstream accordingly (for example in case you choose std::optional check if the optional has a value or not).

Cheers,
Vincenzo

aaarora · April 11, 2024, 10:19am

This works, thank you!

But this persists if I don’t enforce a check on the return value of ArgMax

Also, If HScore has size zero then the HighestHScore is automatically set to be the last non-zero vec maximum.

Trying to get the max of a zero size vector returns the max of the last non zero size vector.
Is this supposed to happen?

vpadulan · April 11, 2024, 1:30pm

This is just one of the many things that could happen in case the vector you are processing at that particular event has size zero. The implementation of ROOT::VecOps::Max is a simple redirect to std::max_element root/math/vecops/inc/ROOT/RVec.hxx at b599578942cbabd8c871b746d9fa90847d267099 · root-project/root · GitHub

In particular, for vectors of size zero vector.begin() == vector.end() which means that std::max_element returns an iterator to past-end pointer. Dereferencing this iterator is undefined behaviour, anything could happen in the program. Just to give an example, on my machine with my particular compiler/ROOT installation, I get zero (but also this value doesn’t make sense because zero is not the maximum value of an empty vector):

root [0] ROOT::RVecF v{};
root [1] ROOT::VecOps::Max(v)
(float) 0.00000f
root [2] *std::max_element(v.begin(), v.end())
(float) 0.00000f

Bottom line: make sure your input vector has size greater than zero in order for the operation to make sense.

Cheers,
Vincenzo

aaarora · April 11, 2024, 3:31pm

In addition to the vars from my initial message, I defined the size and the check variable as follows,

.Define("HScore", "col1 / col2")
.Define("HScore_size", "HScore.size()")
.Define("check", "return *std::max_element(HScore.begin(), HScore.end())")
.Define("HighestHScore", "Max(HScore)")

Here is the output of Display,

+-----+-------------+-------------+-------------+---------------+
| Row | HScore      | HScore_size | check       | HighestHScore | 
+-----+-------------+-------------+-------------+---------------+
| 54  | 0.00542168f | 1           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 57  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 62  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 64  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 68  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 69  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 70  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 75  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 76  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+
| 82  |             | 0           | 0.00542168f | 0.00542168f   | 
+-----+-------------+-------------+-------------+---------------+

Seems like something weird is going on.

Thanks,
Aashay

vpadulan · April 11, 2024, 3:38pm

Dear @aaarora ,

I cannot reproduce the situation you report with synthetic benchmarks. Can you make a snapshot of a few entries and send me the file so I can try to debug?

Cheers,
Vincenzo

vpadulan · April 11, 2024, 3:39pm

Also, HScore seems to be just a float according to the output of Display. Can you confirm? Can you also print df.GetColumnType("HScore")?

aaarora · April 12, 2024, 9:30am

Hi Vincenzo,

The column type is listed as RVec,

print([(i, df.GetColumnType(i)) for i in df.GetColumnNames()])

# output
('HScore', 'ROOT::VecOps::RVec<float>')

I have sent you a snapshot of the df and a test script with some selections in a private message.

Thanks,
Aashay

vpadulan · April 12, 2024, 11:59am

Dear @aaarora ,

I have given some more thought to your case. As I was mentioning before ROOT::VecOps::Max is just forwarding the input to std::max_element. This function returns an iterator, specifically if a vector is empty then vector.begin() == vector.end(), which means that std::max_element returns an iterator to past-end pointer. Dereferencing this pointer (i.e. what ROOT::VecOps::Max does) is undefined behaviour. Practically, the values you are seeing are “random” (i.e. the program might do whatever, it could also crash). I suggest you add the same check as you did before also to this other column definition, something like

"HighestHScore", "HScore.size() > 0 ? Max(HScore) : 0"

Or something similar.

As a separate discussion, we could talk about whether ROOT::VecOps::Max should provide this extra guard automatically for the user, and throw an exception if the input vector size is 0. But keep in mind that this would involve a performance cost.

Cheers,
Vincenzo

system · April 26, 2024, 11:59am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.