How to read a root file parallel?

Is there a way to convert:

“for event in chain”

to a parallel like using multiprocess in Python?

Hi,

yes, there is, even using threads, which results in a much more efficient way of resources: TDataFrame.
Here you can have a look to some examples too.

Cheers,
D

and if you are adventurous, you can even use the nice goroutines from Go (ie: “green threads”), using Go-HEP + rootio:

func main() {
	f, err := rootio.Open("testdata/small-flat-tree.root")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	obj, err := f.Get("tree")
	if err != nil {
		log.Fatal(err)
	}

	tree := obj.(rootio.Tree)

	// bind the scanner with a data type
	sc, err := rootio.NewTreeScanner(tree, &Data{})
	if err != nil {
		log.Fatal(err)
	}
	defer sc.Close()

	for sc.Next() {
		var data Data
		err := sc.Scan(&data)
		if err != nil {
			log.Fatal(err)
		}

		go process(data) // process the data in a separate goroutine
	}

	if err := sc.Err(); err != nil && err != io.EOF {
		log.Fatal(err)
	}
}

func process(data Data) {
	// ...
}

type Data struct {
	I64    int64       `rootio:"Int64"`
	F64    float64     `rootio:"Float64"`
	Str    string      `rootio:"Str"`
	ArrF64 [10]float64 `rootio:"ArrayFloat64"`
	N      int32       `rootio:"N"`
	SliF64 []float64   `rootio:"SliceFloat64"`
}

I’ve found the concurrency primitives of Go to be easier to reason about than the rather low-level ones of C++0x (such as mutexes, locks and std::thread) and more performant than, say, std::future/std::promise.

I am not the only one:

Hi,
I totally agree that end users should absolutely not have to think in terms of low-level threading primitives!

Here’s the basic parallel “for event in chain” in TDataFrame:

ROOT::EnableImplicitMT(); // enable multi-threading
TDataFrame d("tree", {"f1.root", "f2.root"});
d.Filter(selectionFun, {"branch1", "branch2"}).Foreach(doWorkFun, {"b2", "b3"});

C++ doesn’t have to be low-level :smile:

EDIT: almost the same code also works in python, see tutorials linked by @dpiparo

Cheers,
Enrico

2 Likes

sure :slight_smile:

I also plan to have something like that to work with ROOT.
I have an API in terms of n-tuples that accepts SQL statements + closures:

(I “just” need to connect this API with ROOT :P)

but, with Go and channels+goroutines, you can leverage all the features of the languages to achieve what you want to do w/o being “constrained” with a library that may be opinionated (sometimes for good reasons.)

everything is a tradeoff.

Thanks!

This seems what I want, and I read the examples. But I have a problem, how to apply ROOT::EnableImplicitMT() in Python?

When I try from ROOT import EnableImplicitMT, a error raised,

----> 1 from ROOT import EnableImplicitMT

ImportError: cannot import name EnableImplicitMT

Best,
Li

Hi,

Thanks a lot! I read the examples but I want to use Python. Do you know how to apply ROOT::EnableImplicitMT() in Python?

Best,
Li

Hi Li,

ROOT.ROOT.EnableImplicitMT()

Cheers,
D

Hi Dpiparo,

Thank you.
I find that this is a limited way. It require a function with TTreeReader parameter. Since I haven’t use C++ for a long time and I always consider the ROOT file ( TChain ) as an iterator…

I have another question. I tried a very simple way to parallel it. Like:

    def run(chain):
        for idx, event in enumerate(chain):
            if idx in a range:
                 do something

and run this function “run” parallel ( different idx range for different run ). And it works. But the problem is that it doesn’t save time at all. This confused me.

I am really confused by the mechanism of ROOT file too. I thought it’s a directory so that it’s the same type as “dataloader”. If so, why it’s is so hard to reach the information parallel?

Best,
Li

Hi Li,

I find that this is a limited way. It require a function with TTreeReader parameter. Since I haven’t use C++ for a long time and I always consider the ROOT file ( TChain ) as an iterator…

I do not understand this comment. There is no trace of TTreeReaders when using TDataFrame. In addition, the point of TDataFrame is to express analyses and, more in general, dataset manipulations, without dealing with the event loop, in a declarative way.

and run this function “run” parallel ( different idx range for different run ). And it works. But the problem is that it doesn’t save time at all. This confused me.

I am not sure how your parallelisation works (multiprocess? how do you serialise back the results?) and what work is perfomed - in short I cannot comment.

I am really confused by the mechanism of ROOT file too. I thought it’s a directory so that it’s the same type as “dataloader”. If so, why it’s is so hard to reach the information parallel?

I am not sure about what you find difficult: could you elaborate?

Cheers,
D

Hi Dpiparo,

Sorry for the late reply because it really takes me some time to understand the mechanism of TDataFrame. I will give more detail when I fully understand it.

Best,
Li

Hi Li,

do not hesitate to ask questions if you have them.

Cheers,
D

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.