Problem with large number of entries inside RDataFrame

I got a large DataFrame df. When I retrieve the values of a column like this

 auto values = df.Take<double>("column");

and do a simple operation (I need to do something else**, this is just to illustrate, I don’t need to do the sum, I know there is another mechanism in the RDF for the sum):

double sum = 0;
for (const auto v : parValues) sum += v;

Then, I get the following error:

RDataFrame::Run: event loop was interrupted
Error in <TRint::HandleTermInput()>: std::bad_alloc caught: std::bad_alloc

If I reduce the number of entries inside df using df.Range(0,200000000) it runs smoothly. The total number of entries in my data frame is above 1500M entries.

That’s 12Gbytes memory required. Do you think it is a memory issue?

Thank you for your response!

In that case, there is a smarter way to do something with the column data?

**What I want to do is to create a vector of unique elements, i.e. remove all those elements that are repeated and get the unique elements back as an std::vector?

I guess @vpadulan or @mczurylo can help.

I found a work around using the following:

    auto GetUniqueElements = [](const std::vector<double>& vec) {
        std::set<double> uniqueSet(vec.begin(), vec.end());
        return std::vector<double>(uniqueSet.begin(), uniqueSet.end());
    };

    for( size_t n = 0; n < 1 + fDataSet.GetEntries()/fSplitEntries; n++ )
    {
        auto parValues = fDataSet.Range(n*fSplitEntries,(n+1)*fSplitEntries).Take<double>(fParameter);
        std::vector<double> uniqueVec = GetUniqueElements(*parValues);
        vs.insert(vs.end(), uniqueVec.begin(), uniqueVec.end());
    }

It seems for me it is enough to split the data frame into 3 parts by defining fSplitEntries=600000000.

But it would be good to understand how to make it at once.

Hi @Javier_Galan,

thanks for your post and sorry for a bit of a delay in replying. Before going further, it would be great if you could share a fully working reproducer of your problem (including the data) so we could also test it ourselves. You can also share it via email if you feel more comfortable with that.

Cheers,
Marta

I could share the 60Gbytes RDataFrame root file I got. But how?

Hi @Javier_Galan,

do you have access to CERN EOS?

Cheers,
Marta

Hi @mczurylo, I have lxplus access. My user name jgalan I guess I could scp to any location there were I would have enough space.

Hi @Javier_Galan,

I replied to you in a private message.

Cheers,
Marta

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.