How to get unique values of an RDataFrame column

marco_giammei · January 10, 2024, 12:10pm

Hi,
I’am doing an analysis with RDataFrame and I want to get all the unique values from a dataframe column; for example if I have a column with the elements (1,2,1,2,3,6,6), I would like to get only (1,2,3,6) (i.e. only the set of the element contained). The only way I know to do this is to pass from Numpy but I would prefere a solution without loading the column in memory.

Danilo · January 10, 2024, 12:32pm

Hi Marco,

Welcome to the ROOT community!

This is an interesting question. Let me ask for some additional context. I understand that you have a RDF and you want to extract one single column of individual numbers (one per entry) and eliminate the duplicates. Is that correct?
Or do you mean you have a column with numbers and you want to skip the processing of entries in case the same number in a column is encountered more than once?

Cheers,
D

marco_giammei · January 10, 2024, 12:54pm

Hi Danilo,
I would like to know how many different values are in a column, to be able to extract a subdataframe for each of the values. For example if I have a dataset of the form :
col0 col1
1 5.4
2 6.8
1 2.5
5 7.9
I want to separate it in different dataframes by each different value of col1 (of which i don’t know a priori what is the content, so I need to extract this info), giving:
dataset1:
col0 col1
1 5.4
1 2.5

dataset2:
col0 col1
2 6.8

dataset3:
col0 col1
5 7.9

Thank you,
Marco

Danilo · January 11, 2024, 3:01pm

Hi Marco,

Thanks, now I understand.
I will try to give you an option, then we can iterate, if needed, to improve. Good news: it’s simple, if I did not miss any important detail.

The first step is to extract “col1” and check what are the the unique values. Then, based on those values, we can build a set of RDataFrame nodes resulting from filtering the events based on those unique values. Those nodes will be what you called “subdataframes for each of the values”.

In code, it can look a bit like this (I start from a csv file, but it works for any source!):

import ROOT

filename = 'data.csv'
df = ROOT.RDF.FromCSV(filename)

col1ValsRP = df.Take["Long64_t"]("col1")
uniqueCol1Vals = list(set(col1ValsRP.GetValue()))

subdfs = [df.Filter('col1 == %s' %uniqueColVal) for uniqueColVal in uniqueCol1Vals]

# do something with the subdataframes
evtsPerCol1ValsRP = [subdf.Count() for subdf in subdfs]

for uniqueCol1Val, evtsPerCol1ValRP in zip(uniqueCol1Vals, evtsPerCol1ValsRP):
    print ("The number of events for col1 value %s is %s." %(uniqueCol1Val, evtsPerCol1ValRP.GetValue()))

and the csv

col1,col2
1,5.4
2,6.8
1,2.5
5,7.9

This should get you started. Let us know how it goes and whether there are more questions!

Cheers,
D

marco_giammei · January 11, 2024, 5:57pm

Thank you Danilo,
the problem is that I have something like 10^8 entries per column and I really need ROOT dataframes to manage them (very useful for my purpose).
I did some tests to be (almost) sure, but from the first look I suspected that the line with the definition of uniqueCol1Vals loads, when GetValue() is called, all entries of the column in memory and I can’t afford it.

My real case, referring to the previous example dataset, is like the following:
run_number | Channel | Voltage …
1 1 0.34
1 2 0.25
1 1 0.10
… (10^8 times)

Choosing the separator (run_number or Channel) I am able to plot the Voltage (or other variables) filtering on the corresponding column: my problem is that a priori I don’t know which different values of run_number or Channel are inside the file (and I don’t want to know to be as general as possible).
All this has already been done, the only thing I miss is how to collect the set of different values without loading into memory the whole column first; then I can select and manage the subdataframes in various way (including yours that is surely the most linear one that I have).

I can define or redefine dataframe columns for this purpose if needed (although expensive in time) but I’m limited in saving the whole column in memory: I fear that the line with uniqueCol1Vals is my whole problem.
Thank you,
Marco

Danilo · January 11, 2024, 8:52pm

Hi Marco,

Indeed, you are right. Assuming those are integer numbers in your case, you would need 400 MB, which is not a lot but definitively not in the noise.

If you want to avoid this large allocation, this is the way to go:

import ROOT

filename = 'data.csv'
df = ROOT.RDF.FromCSV(filename)

# This extracts the unique values
col1UniqueValsRP = df.Filter("static set<Long64_t> uvals; return uvals.emplace(col1).second;")\
                     .Take["Long64_t"]("col1")

uniqueCol1Vals = col1UniqueValsRP.GetValue()

subdfs = [df.Filter('col1 == %s' %uniqueColVal) for uniqueColVal in uniqueCol1Vals]

# do something with the subdataframes
evtsPerCol1ValsRP = [subdf.Count() for subdf in subdfs]

for uniqueCol1Val, evtsPerCol1ValRP in zip(uniqueCol1Vals, evtsPerCol1ValsRP):
    print ("The number of events for col1 value %s is %s." %(uniqueCol1Val, evtsPerCol1ValRP.GetValue()))

As you can see, we mix in the filter C++ and Python. We plan to make this kind of interoperation smoother in the next months, but it works and it’s a very localised change.

I hope it helps.

Cheers,
D

marco_giammei · January 12, 2024, 3:20pm

Thanks Danilo,
I tried your code and it works for the first iteration.
But this code is meant to be executed multiple times on the same dataset with different separator columns; when I execute these lines twice

col1UniqueValsRP = df.Filter("static set<Long64_t> uvals; return uvals.emplace(col1).second;")\
                     .Take["Long64_t"]("col1")

uniqueCol1Vals = col1UniqueValsRP.GetValue()

in the second iteration uniqueCol1Vals is empty. This is fixed if for every iteration I change the name of uvals, but this is not a fix I can implement in my code (because there is no trace of at which iteration of this code we are).
How can I delete or reset uvals each time after these two lines?

Danilo · January 12, 2024, 5:28pm

Hi Marco,

I think I do not understand exactly what is happening, but I am confident we can make the system work.
Could you post a minimal reproducer of your code so that I can see what is going on?

Cheers,
Danilo

marco_giammei · January 12, 2024, 8:34pm

Hi Danilo,
Here it is an example file
example.py (1.1 KB)

In addition to the comments inside, my output is the following:
{ 1 , 2 }
{}

It seems to me like uvals declared inside the Filter() function somehow is not deleted when the next time, using the same df, I need again to perform that filter.
This behavior is not compatible with my program because this part is inside a function in my graphical tool: I work on Jupyter notebook and for example if I want to perform a small correction on the style of an histogram made up by my tool, now I have to shut down and rerun the notebook (sometimes they are huge).
All other parts of my tools are working, I tested them with the original solution for uniqueCol1Vals you sent loading into memory (for this small example RDataFrame is ok), so I isolate the problem to that line and that’s my possible guess.
Hope my problem is clear; about my guess on what causes it I’m not 100% sure but anyway I can’t solve it if it’s the case.
Thank you again,
Marco

Danilo · January 13, 2024, 9:59am

Hi Marco,

On purpose, I have hidden the static std::set inside a function to make it “survive” across entries but without making it accessible. We can declare it outside through the interpreter, no problems with that.
However, I do not understand why this would be needed. This is not a technical question, but maybe a clarification about your process.
You loop over the dataset changing the “separator columns name”. For this, you modify the filter string, which is very smart.

What is stopping you from changing the name of the std::set variable as well, in order to declare a new set per loop?

uniqueCol1Vals = df.Filter(f"static set<{sep_field_type}> uvals_{sep_field}; return uvals_sep_field.emplace({sep_field}).second;").Take[sep_field_type](f"{sep_field}").GetValue()

Cheers,
D

marco_giammei · January 15, 2024, 10:47am

Hi Danilo,
To answer you on the last question: nothing! In fact before your answer I tried to use uvals_{i}where i is the output of the function GetNRuns(), in order to differentiate it for each execution of the single jupyter notebook cell.
I also tried to combine my idea with yours and to have uvals_{i}_{sep_field} and now it works for multiple iterations.
But this can’t be the permanent solution (in Italian this is “una pezza”) inside my tools. I have a work environment in which the modules are the more general as possible, so the names of the variable are always the same (df,hist,…): when redefining a dataframe with the same name df without killing and restarting the notebook, any combination of complicated name for uvals is not sufficient to keep them different from the previous definition and I get the output {}.
I know it’s a vary complicated case but the real solution to my problem would be to declare uvals outside the through the interpreter and then delete it soon after the uniqueCol1Vals is computed: I don’t know how to do it.
Thank you,
Marco

Danilo · January 15, 2024, 11:29am

Hi Marco,

Thanks for the patience and additional clarifications: I did not appreciate the sophistication at the start of this thread.
Here you go with another proposal, which is now creating a global set all jitted filters can access. I do not think it’s a ‘pezza’, it’s how Cpp works and about the only way I can think of (right now, please correct me if I am wrong, with the current data frame interface.

import ROOT
import numpy as np

N = 100
a = []
b = []

for i in range(N):
    ap = 1 if i<50 else 2
    a.append(ap)
    b.append(i*i +1)
    
a = np.array(a)
b = np.array(b)

# I create an example RDataFrame with 2 columns, 
# col1 is made up by 50 times 1 and 50 times 2
# col2 is an incremental index i square + 1 (arbitrary choice just to use something)

df = ROOT.RDF.FromNumpy({'col1':a,"col2":b})

# now I define the sep_field (name of the column to be used as data separator) and get its type
sep_field="col1"
sep_field_type=df.GetColumnType(sep_field)

# Create set, encapsulated in an appropriate namespace
ROOT.gInterpreter.Declare(f"namespace Internal::GlobalContainers {{ set<{sep_field_type}> gUvals;}} ")

# now I repeat your line (compact way but it works)
uniqueCol1Vals = df.Filter(f"return Internal::GlobalContainers::gUvals.emplace({sep_field}).second;").Take[sep_field_type](f"{sep_field}").GetValue()

# printing
print(uniqueCol1Vals)

# clear set
ROOT.Internal.GlobalContainers.gUvals.clear()

# now I repeat again your line (compact way but it works)
uniqueCol1Vals = df.Filter(f"return Internal::GlobalContainers::gUvals.emplace({sep_field}).second;").Take[sep_field_type](f"{sep_field}").GetValue()

# printing again but now it's empty
print(uniqueCol1Vals)

I hope it helps.

Cheers,
D

marco_giammei · January 15, 2024, 1:20pm

Hi Danilo,
As always, you are right: I don’t know why I convinced myself that this was a “pezza” and not the way C++ works, sorry for that moment of confusion.
Anyways, in order to avoid possible conflicts in my tools (very contort, I know), the optimal solution is the one calling the gInterpreter.Declare() function; thanks a lot, you open me and my colleagues a whole new way to cope with ROOT.
I have also to think about a way to cope with gUvals declared in conflict if there are different sep_field_type in future: the solution I use is the following

## dummy set just to create the namespace
ROOT.gInterpreter.Declare(f"namespace Internal::GlobalContainers {{ set<Long64_t> gAaaBbbCcc;}} ")

# a lot of other code

if hasattr(ROOT.Internal.GlobalContainers,f"gUvals_{sep_field_type}"):
    getattr(ROOT.Internal.GlobalContainers,f"gUvals_{sep_field_type}").clear()
else:
    ROOT.gInterpreter.Declare(f"namespace Internal::GlobalContainers {{ set<{sep_field_type}> gUvals_{sep_field_type};}} ")
                
uniqueCol1Vals = df_2.Filter(f"return Internal::GlobalContainers::gUvals_{sep_field_type}.emplace({sep_field}).second;").Take[sep_field_type](f"{sep_field}").GetValue()

exploing the python function hasattr and getattr. A feedback from you about this will always be helpful.
Thank you very much for your help, I heard stories about how good is this forum managed but touch it by hand for the first time is wonderful.
To the next issue (if any) and good work,
Marco

Danilo · January 15, 2024, 1:31pm

Hi Marco,

I think that if you want to be very generic, the way you show is adequate.
No worries at all about not settling for anything which was a good solution for your standards. Changing the name to the variable was indeed a kludge. It was not clear to me though, given that I initially fail to grasp the context around your nice minimal repoducer!

Don’t hesitate to come back with more questions or feedback or issues.

Cheers,
D

system · January 29, 2024, 1:32pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.