Accessing attributes of distributed RDataFrame instance after Filter and Define are called

Hello experts,

I am trying to make use of some of the new distributed RDataFrame features, specifically the dask backend. I am using the nightly build from /cvmfs/sft.cern.ch/lcg/views/dev3/latest/x86_64-el9-gcc11-opt.

When using the non-distributed RDataFrame, I can do things like

df_new = df_old.Define("xsquared", "x*x")
cols = df_new.GetColumnNames()

However, when constructing an RDataFrame using RDF.Experimental.Distributed.Dask.RDataFrame, and trying to run something like the above, I find that the call to GetColumnNames() throws an error like below:

Traceback (most recent call last):
  File "/cvmfs/sft-nightlies.cern.ch/lcg/views/dev3/Mon/x86_64-el9-gcc11-opt/lib/DistRDF/Proxy.py", line 237, in __getattr__
    return getattr(self.proxied_node, attr)
AttributeError: 'Node' object has no attribute 'GetColumnNames'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cvmfs/sft-nightlies.cern.ch/lcg/views/dev3/Mon/x86_64-el9-gcc11-opt/lib/DistRDF/Proxy.py", line 248, in __getattr__
    raise AttributeError(msg)
AttributeError: 'Define' object has no attribute 'GetColumnNames'

Perhaps I’m missing something obvious, but is there an example of how to access the attributes of an instance of the distributed RDataFrame after defining/filtering/etc columns?


ROOT Version: 6.33/01 From heads/master@v6-31-01-1852-gdb8f2ef07c
Platform: x86_64-el9-gcc11-opt
Compiler: g++ (GCC) 11.3.0


Dear @gwmyers ,

I am afraid this is a bug and a limitation of the current implementation. I have opened a bug issue at Distributed RDataFrame does not see all defined column names · Issue #15442 · root-project/root · GitHub

Cheers,
Vincenzo

Hi @vpadulan ,

Thanks for the quick reply! I made an attempt at propagating the defined column names and retrieving them. It seems to work in some very simple local tests that I’ve done. And I’ve opened a PR here: Distributed RDataFrame: add support for NodeProxy.GetColumnNames by gwmyers · Pull Request #15476 · root-project/root · GitHub . Please let me know if this is on the right track or there is a better/smarter implementation in mind. Any advice is appreciated!

Thanks!
Greg