How to read a root file parallel?

Li_Huang · April 4, 2018, 8:36pm

Is there a way to convert:

“for event in chain”

to a parallel like using multiprocess in Python?

Danilo · April 4, 2018, 8:57pm

Hi,

yes, there is, even using threads, which results in a much more efficient way of resources: TDataFrame.
Here you can have a look to some examples too.

Cheers,
D

sbinet · April 5, 2018, 7:45am

and if you are adventurous, you can even use the nice goroutines from Go (ie: “green threads”), using Go-HEP + rootio:

func main() {
	f, err := rootio.Open("testdata/small-flat-tree.root")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	obj, err := f.Get("tree")
	if err != nil {
		log.Fatal(err)
	}

	tree := obj.(rootio.Tree)

	// bind the scanner with a data type
	sc, err := rootio.NewTreeScanner(tree, &Data{})
	if err != nil {
		log.Fatal(err)
	}
	defer sc.Close()

	for sc.Next() {
		var data Data
		err := sc.Scan(&data)
		if err != nil {
			log.Fatal(err)
		}

		go process(data) // process the data in a separate goroutine
	}

	if err := sc.Err(); err != nil && err != io.EOF {
		log.Fatal(err)
	}
}

func process(data Data) {
	// ...
}

type Data struct {
	I64    int64       `rootio:"Int64"`
	F64    float64     `rootio:"Float64"`
	Str    string      `rootio:"Str"`
	ArrF64 [10]float64 `rootio:"ArrayFloat64"`
	N      int32       `rootio:"N"`
	SliF64 []float64   `rootio:"SliceFloat64"`
}

I’ve found the concurrency primitives of Go to be easier to reason about than the rather low-level ones of C++0x (such as mutexes, locks and std::thread) and more performant than, say, std::future/std::promise.

I am not the only one:

eguiraud · April 5, 2018, 9:13am

Hi,
I totally agree that end users should absolutely not have to think in terms of low-level threading primitives!

Here’s the basic parallel “for event in chain” in TDataFrame:

ROOT::EnableImplicitMT(); // enable multi-threading
TDataFrame d("tree", {"f1.root", "f2.root"});
d.Filter(selectionFun, {"branch1", "branch2"}).Foreach(doWorkFun, {"b2", "b3"});

C++ doesn’t have to be low-level

EDIT: almost the same code also works in python, see tutorials linked by @Danilo

Cheers,
Enrico

sbinet · April 5, 2018, 9:43am

sure

I also plan to have something like that to work with ROOT.
I have an API in terms of n-tuples that accepts SQL statements + closures:

https://godoc.org/go-hep.org/x/hep/hbook/ntup#Ntuple.Scan

(I “just” need to connect this API with ROOT :P)

but, with Go and channels+goroutines, you can leverage all the features of the languages to achieve what you want to do w/o being “constrained” with a library that may be opinionated (sometimes for good reasons.)

everything is a tradeoff.

Li_Huang · April 8, 2018, 2:05am

Thanks!

This seems what I want, and I read the examples. But I have a problem, how to apply ROOT::EnableImplicitMT() in Python?

When I try from ROOT import EnableImplicitMT, a error raised,

----> 1 from ROOT import EnableImplicitMT

ImportError: cannot import name EnableImplicitMT

Best,
Li

Li_Huang · April 8, 2018, 2:06am

Hi,

Thanks a lot! I read the examples but I want to use Python. Do you know how to apply ROOT::EnableImplicitMT() in Python?

Best,
Li

Danilo · April 8, 2018, 5:48am

Hi Li,

ROOT.ROOT.EnableImplicitMT()

Cheers,
D

Li_Huang · April 8, 2018, 6:05am

Hi Dpiparo,

Thank you.
I find that this is a limited way. It require a function with TTreeReader parameter. Since I haven’t use C++ for a long time and I always consider the ROOT file ( TChain ) as an iterator…

I have another question. I tried a very simple way to parallel it. Like:

    def run(chain):
        for idx, event in enumerate(chain):
            if idx in a range:
                 do something

and run this function “run” parallel ( different idx range for different run ). And it works. But the problem is that it doesn’t save time at all. This confused me.

I am really confused by the mechanism of ROOT file too. I thought it’s a directory so that it’s the same type as “dataloader”. If so, why it’s is so hard to reach the information parallel?

Best,
Li

Danilo · April 8, 2018, 7:22am

Hi Li,

I find that this is a limited way. It require a function with TTreeReader parameter. Since I haven’t use C++ for a long time and I always consider the ROOT file ( TChain ) as an iterator…

I do not understand this comment. There is no trace of TTreeReaders when using TDataFrame. In addition, the point of TDataFrame is to express analyses and, more in general, dataset manipulations, without dealing with the event loop, in a declarative way.

and run this function “run” parallel ( different idx range for different run ). And it works. But the problem is that it doesn’t save time at all. This confused me.

I am not sure how your parallelisation works (multiprocess? how do you serialise back the results?) and what work is perfomed - in short I cannot comment.

I am really confused by the mechanism of ROOT file too. I thought it’s a directory so that it’s the same type as “dataloader”. If so, why it’s is so hard to reach the information parallel?

I am not sure about what you find difficult: could you elaborate?

Cheers,
D

Li_Huang · April 9, 2018, 3:56am

Hi Dpiparo,

Sorry for the late reply because it really takes me some time to understand the mechanism of TDataFrame. I will give more detail when I fully understand it.

Best,
Li

Danilo · April 9, 2018, 7:06am

Hi Li,

do not hesitate to ask questions if you have them.

Cheers,
D

system · April 23, 2018, 7:06am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.