AsNumpy error from dataframe with array columns

lilina · September 14, 2023, 3:19am

Hi!

I am trying to make a nparray dictionary from an RDataFrame as follows:

df_nosmear = ROOT.RDataFrame("dvcs", "./pseudo_KM15_BKM10_hallA_t2_nosmear_5pct.root")
np_nosmear = df_nosmear.AsNumpy()

and I get the following error:

/home/lily/opt/root_master/root_install/lib/ROOT/_pythonization/_rdataframe.py:290: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
  tmp = numpy.empty(len(cpp_reference), dtype=numpy.object)
Traceback (most recent call last):
  File "/media/lily/Data/GPDs/HallA/plots/drawpseudo.py", line 7, in <module>
    np_nosmear = df_nosmear.AsNumpy()
  File "/home/lily/opt/root_master/root_install/lib/ROOT/_pythonization/_rdataframe.py", line 239, in RDataFrameAsNumpy
    return result.GetValue()
  File "/home/lily/opt/root_master/root_install/lib/ROOT/_pythonization/_rdataframe.py", line 290, in GetValue
    tmp = numpy.empty(len(cpp_reference), dtype=numpy.object)
  File "/home/lily/.local/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'object_'?
>>>

I traced the problem to the fact that the .root file contains array branches. If I get only the branches that are not arrays for example np_nosmear = df_nosmear.AsNumpy(columns=["kinematics.k"]) it works.

This is how the root file tree was declared:

struct kin_t {
      Double_t k;
      Double_t QQ;
      Double_t xB;
      Double_t t;
   };
   kin_t kin;
   Int_t set;
   Int_t npoints;
   Double_t phi[kMaxNumOfDataPts];
   Double_t F_BH[kMaxNumOfDataPts], F_I[kMaxNumOfDataPts], F[kMaxNumOfDataPts], errF[kMaxNumOfDataPts];
   Double_t varF;
   Double_t gdvcs, dvcs, e_dvcs;
   Double_t gReH, gReE, gReHtilde, gReEtilde, ReH, ReE, ReHtilde;
   Double_t e_ReH, e_ReE, e_ReHtilde;
   Double_t gImH, gImE, gImHtilde, gImEtilde;

   TTree *t3 = new TTree("dvcs","generated dvcs");
   t3->Branch("set",&set,"set/I");
   t3->Branch("kinematics",&kin.k,"k/D:QQ:xB:t");
   t3->Branch("npoints",&npoints,"npoints/I");
   t3->Branch("phi",phi,"phi[npoints]/D");
   t3->Branch("F",F,"F[npoints]/D");
   t3->Branch("F_BH",F_BH,"F_BH[npoints]/D");
   t3->Branch("F_I",F_I,"F_I[npoints]/D");
   t3->Branch("gdvcs",&gdvcs,"gdvcs/D");
   t3->Branch("errF",errF,"errF[npoints]/D");
   t3->Branch("gReH",&gReH,"gReH/D");
   t3->Branch("gReE",&gReE,"gReE/D");
   t3->Branch("gReHtilde",&gReHtilde,"gReHtilde/D");
   t3->Branch("gReEtilde",&gReEtilde,"gReEtilde/D");
   t3->Branch("gImH",&gImH,"gImH/D");
   t3->Branch("gImE",&gImE,"gImE/D");
   t3->Branch("gImHtilde",&gImHtilde,"gImHtilde/D");
   t3->Branch("gImEtilde",&gImEtilde,"gImEtilde/D");

I have attached the .root file and the ROOT version I am using is ROOT 6.27/01

I appreciate your help!

pseudo_KM15_BKM10_hallA_t2_nosmear_5pct.root (206.1 KB)

Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

vpadulan · September 18, 2023, 7:48am

Dear @lilina ,

The error is AttributeError: module 'numpy' has no attribute 'object'.. You are using a more recent version of numpy that deprecated this feature. In fact, also more recent ROOT versions have adapted to it and we don’t use numpy.object in RDF code anymore. Can you please try by updating the ROOT version and let me know if it helps? Installing ROOT - ROOT for installation instructions .

Cheers,
Vincenzo

lilina · September 18, 2023, 4:29pm

Thanks @vpadulan!

I updated the ROOT version to v6.28/06 and now I can use AsNumpy() with df array columns. But I still can not use them since they seem to be read as RVec and when I try to plot F vs phi which are both arrays, for example:

import ROOT
import numpy as np
import matplotlib.pyplot as plt

df_nosmear = ROOT.RDataFrame("dvcs", "./pseudo_KM15_BKM10_hallA_t2_nosmear_5pct.root")
np_nosmear = df_nosmear.AsNumpy(columns=["set","F", "phi"])
plt.scatter(np_nosmear['F'][np_nosmear['set'] == 1], np_nosmear['phi'][np_nosmear['set'] == 1])
plt.show()

I get the following error:

TypeError: float() argument must be a string or a real number, not 'RVec<double>'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/lily/Data/GPDs/HallA/plots/drawpseudo.py", line 8, in <module>
    plt.scatter(np_nosmear['F'][np_nosmear['set'] == 1], np_nosmear['phi'][np_nosmear['set'] == 1])
  File "/home/lily/.local/lib/python3.10/site-packages/matplotlib/pyplot.py", line 2862, in scatter
    __ret = gca().scatter(
  File "/home/lily/.local/lib/python3.10/site-packages/matplotlib/__init__.py", line 1446, in inner
    return func(ax, *map(sanitize_sequence, args), **kwargs)
  File "/home/lily/.local/lib/python3.10/site-packages/matplotlib/axes/_axes.py", line 4667, in scatter
    collection = mcoll.PathCollection(
  File "/home/lily/.local/lib/python3.10/site-packages/matplotlib/collections.py", line 994, in __init__
    super().__init__(**kwargs)
  File "/home/lily/.local/lib/python3.10/site-packages/matplotlib/_api/deprecation.py", line 454, in wrapper
    return func(*args, **kwargs)
  File "/home/lily/.local/lib/python3.10/site-packages/matplotlib/collections.py", line 192, in __init__
    offsets = np.asanyarray(offsets, float)
ValueError: setting an array element with a sequence.
>>>

What should I do in order to correctly plot these 2 arrays?

Thank you so much!

vpadulan · September 20, 2023, 2:19pm

Dear @lilina ,

The context you are giving is a bit thin, so I will try to make some guess here. I suppose you have a dataset with jagged arrays, i.e. for every event you have columns that contain collection of values, e.g.

+-----+-------+----------+
| Row | nMuon | Muon_pt  | 
+-----+-------+----------+
| 0   | 2     | 10.7637f | 
|     |       | 15.7365f | 
+-----+-------+----------+
| 1   | 2     | 10.5385f | 
|     |       | 16.3271f | 
+-----+-------+----------+
| 2   | 1     | 3.27533f | 
+-----+-------+----------+
| 3   | 4     | 11.4292f | 
|     |       | 17.6340f | 
|     |       | 9.62473f | 
|     |       | 3.50223f | 
+-----+-------+----------+
| 4   | 4     | 3.28344f | 
|     |       | 3.64401f | 
|     |       | 32.9112f | 
|     |       | 23.7218f | 
+-----+-------+----------+

Consequently, the column Muon_pt of this example cannot be represented as a numpy.array, in fact numpy only supports array of fixed dimensions (1 or more). Thus, when calling df.AsNumpy the output dictionary will contain a numpy array where each element is an RVec, a vector of arbitrary length depending on how many particles were involved in that event.

>>> npys = df.AsNumpy(columns=["nMuon", "Muon_pt"])
>>> npys
{'nMuon': ndarray([2, 2, 1, 4, 4, 3, 2, 2, 2, 2], dtype=uint32), 'Muon_pt': ndarray([<cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad52490>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad524d0>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad52510>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad52550>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad52590>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad525d0>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad52610>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad52650>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad52690>,
         <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x5640aad526d0>],
        dtype=object)}

There is no “direct translation” of this jagged array to numpy array, and this does not depend on ROOT. For this use case, you probably want to use the integration between RDataFrame and the awkward array library, that allows manipulation of jagged arrays in Python How to convert to/from ROOT RDataFrame — Awkward Array 2.4.3 documentation

Cheers,
Vincenzo

lilina · September 20, 2023, 4:19pm

Thank you so much for the detailed explanation, I really appreciate it. I will try the awkward array.

Best,

Lily

vpadulan · September 20, 2023, 7:52pm

Dear @lilina ,

Meanwhile, I noticed there had been a somewhat related discussion in another forum post. In particular, you can always manually call numpy.asarray on any RVec for a fast, zero-copy conversion as discussed here. In that case the user wanted all the floats from all the events concatenated in a single flattened numpy array, which is not your case as I understand it. But, if it makes the workflow easier, you could think about running the mask np_nosmear["set"]==1 on the outer numpy array, then using numpy.asarray on the resulting RVecs and pass the result to plt.scatter.

Let me know if this helps,
Cheers,
Vincenzo

lilina · September 20, 2023, 8:17pm

Thank you so much! Using numpy.asarray applying the mask np_nosmear["set"]==1 on the resulting RVecs allowed me to plot the array.

df_nosmear = ROOT.RDataFrame("dvcs", "./pseudo_KM15_BKM10_hallA_t2_nosmear_5pct.root")
np_nosmear = df_nosmear.AsNumpy(columns=["set", "F"]) // F is a jagged array
np_F = [np.asarray(e) for e in np_nosmear['F'][np_nosmear['set'] == 1]]

plt.scatter(np_F)
plt.show()

Thanks a lot!

system · October 4, 2023, 8:18pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.