It the commented line is uncommented, I get the NotImplementedError on it. However, initialising another vector directly from PyROOT seems to fix this error, and += operation on the vector created in C++ starts to work. Is it a bug, or am I missing something here?
I just want to store numpy arrays in vectors in a C++ class, thus I need += operator working on those vectors.
ROOT Version: 6.22.06 Platform: Fedora 33 Compiler: Not Provided
Thanks! I know the preferred pythonic way, however since python classes can’t be nicely serialised to a TTree, the “ProcessLine” version is sometimes a must… I do it in another way now, and here comes probably another bug or rather a missing feature.
Both v+=np.array and ROOT.vector(“double”)(np.array) are very slow. I suspect there may be some python loop involved. When I pass the numpy.array as double * into my class constructor and inside the constructor I use vector::assign(), it is much faster. I didn’t measure how much faster, but probably 10 or even 100 times.
Thanks for reporting, @LeWhoo. It would be great if you could attach the code that you are using in the second case. We will look into this as soon as @etejedor is back.
Actually, I found out that the slow-down was caused mainly by another issue - I was filling a vector from HDF5 dataset, which is very slow. If I put the dataset inside np.array() it becomes much faster. Even if it is ROOT issue, I am not sure if it is worth investigating.
After fixing, the difference between assign and other methods is much smaller, but… depends on the environment for benchmarking. Here is the code:
import ROOT
import numpy as np
import time
a = np.random.rand(10000)
print(a.dtype, a.size)
ts = time.process_time()
v = ROOT.vector("double")(a)
print(time.process_time()-ts)
ts = time.process_time()
v1 = ROOT.vector("double")()
v1+=a
print(time.process_time()-ts)
ts = time.process_time()
v2 = ROOT.vector("double")()
v2.assign(a)
print(time.process_time()-ts)
print(a[0], v[0], v1[0], v2[0])
The result on my command line is:
0.08343676900000008
0.031200824999999988
0.013790605000000067
So () init of vector is ~6 times slower than assign, and += is ~2 times slower.
However, if I run the same through jupyter notebooks, I get:
0.04385410700000003
0.05220824099999999
0.02533038100000007
Differences are smaller and += is slower than (). Not sure why…
Still, perhaps at least () init should default to .assign().
After running your code excerpt, I can confirm that assign() is the fastest -actually, on my machine, it is one order of magnitude faster than passing the numpy array to the constructor-.
Maybe @etejedor finds it interesting to investigate these differences when he is back.
You’re looking at noise. First, use perf_counter instead of process_time; second, jack up the size of a by 100x. Then the results are repeatable and make sense (to me anyway).
The results were repeatable here, but you are right. I’ve increased the array size 1000 times, and now the += methods is the slowest.
With process_time:
0.127018226
7.6567942460000005
0.05619094699999927
With perf_counter:
0.20569942897418514
8.372410651005339
0.05795212701195851
I’m new to perf_counter vs process_time, but from what I was able to find, in case of benchmarking process_time was advised, as it excludes sleep and slowdowns of the process due to system activities.
Sure, but the code in your example isn’t sleeping, whereas the other difference is that process_time accumulates time of all threads. Thus, if you just import ROOT, then (unless things have changed that I’m not aware of) you also have the graphics thread doing whatever and all its cycles are accounted for in the total, too. That will vary by quite a bit from run to run and (again, unless things have changed), it has startup work to do, hitting the first loop harder then the others.
(For that matter, given that cppyy and Cling have lazy initialization running on the first call, such as creating the wrappers and deserializing all necessary IR from the PCH or PCMs, you may want to run a warmup round regardless.)
Anyway, see the script below as an example of what I mean: perf_counter is constant with the number of threads, process_time accumulates.
import cppyy
import time
cppyy.cppdef("""\
double calc(size_t sz) {
double res = 0.;
for (int i = 0; i < sz; ++i)
res *= std::atan(i);
return res;
}
void multi(int n, size_t sz) {
std::vector<std::thread> workers(n);
for (int i = 0; i < n; i++)
workers[i] = std::thread(calc, sz);
for (auto& w: workers)
w.join();
}""")
N = 4
SZ = 100000000
for i in range(N):
ts = time.perf_counter()
cppyy.gbl.multi(i, SZ)
ts = time.perf_counter()-ts
print("perf:", i, ts)
for i in range(N):
ts = time.process_time()
cppyy.gbl.multi(i, SZ)
ts = time.process_time()-ts
print("proc:", i, ts)