Home | News | Documentation | Download

TChain parallel reading - more than 1 branch

Hi ROOT experts,
I would like to process a Chain using ROOT::TProcessExecutor. I just read the multicore example: https://root.cern.ch/doc/v608/mp102__readNtuplesFillHistosAndFit_8C.html and it works. But what about if I need to process more than one branch in order to build numerous histos?

Thanks
Enrico


ROOT Version: 6.22.02
Platform: linux
Compiler: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0


Hi @ecatanzani,
and welcome to the ROOT forum.

I believe returning a TList of histograms should work in that case.

However for simple plotting needs I would suggest taking a look at RDataFrame, which is often friendlier.

For example here the two histograms are filled in the same event loop and in parallel in multiple threads with these lines:

ROOT::EnableImplicitMT(); // Tell ROOT you want to go parallel
ROOT::RDataFrame d("myTree", "file_*.root"); // Interface to TTree and TChain
auto h1 = d.Histo1D("Branch_A");
auto h2 = d.Filter("Branch_B > 0").Histo1D("Branch_C");
h1->Draw();
h2->Draw();

Cheers,
Enrico

Great, thanks, that is a very interesting library!
I’m building the code using a makefile and I got these errors when I add the data frame header: **

#include <ROOT/RDataFrame.hxx>

out.txt (2.6 MB)

This is the makefile:

MKDIR_P       ?=mkdir -p
COMPILER      ?=g++
TESTS         ?=true
TOP           ?=$(shell pwd)
HAVE_TERM     :=$(shell echo $$TERM)
#undef to none (linux)
ifndef HAVE_TERM
	HAVE_TERM = none
endif
#dump to none (macOS)
ifeq ($(HAVE_TERM),dumb)
	HAVE_TERM = none
endif

#dependencie
DIPS_INCLUDE = $(shell root-config --incdir)
DIPS_LIBS = $(shell root-config --ldflags) $(shell root-config --libs) 

S_DIR  = $(TOP)/source/
S_INC  = $(TOP)/include/

DEBUG_DIR    = Debug/obj
RELEASE_DIR  = Release/obj
DEBUG_PROG   = Debug/Kompressor
RELEASE_PROG = Release/Kompressor

# Don't compile specific files
FILTER := 

# Subdirs
SUB_DIRS := $(wildcard $(S_DIR)/**/.)\
            $(wildcard $(S_DIR)/**/**/.)
SUB_DIRS := $(subst $(S_DIR)/,,$(SUB_DIRS))

####################################################
# C FLAGS
C_FLAGS = -fPIC -D_FORCE_INLINES
# CPP FLAGS
CC_FLAGS = -std=c++14 -I$(S_INC)
# RELEASE_FLAGS
RELEASE_FLAGS = -O3
# DEBUG_FLAGS
DEBUG_FLAGS = -g -D_DEBUG -Wall -Wno-unknown-pragmas
# Linker
LDFLAGS :=
#add dips
ifneq ($(DIPS_INCLUDE),)
	CC_FLAGS += -I$(DIPS_INCLUDE)
	LDFLAGS += $(DIPS_LIBS)
endif

####################################################
# Flags by OS
ifeq ($(shell uname -s),Linux) # LINUX
# threads
C_FLAGS += -pthread
LDFLAGS += -lpthread
endif

####################################################
# All source files
ALL_SOURCE_FILES := $(wildcard $(S_DIR)/*.cpp)\
					$(wildcard $(S_DIR)/**/*.cpp)\
					$(wildcard $(S_DIR)/**/**/*.cpp)
					
####################################################
# Object files
SOURCE_FILES = $(filter-out $(FILTER), $(ALL_SOURCE_FILES))
SOURCE_DEBUG_OBJS = $(subst $(S_DIR),$(DEBUG_DIR),$(subst .cpp,.o,$(SOURCE_FILES)))
SOURCE_RELEASE_OBJS = $(subst $(S_DIR),$(RELEASE_DIR),$(subst .cpp,.o,$(SOURCE_FILES)))

SOURCE_DEPENDENCY_FILES = $(filter-out $(FILTER), $(ANYOPT_SRC))
SOURCE_DEBUG_DEPENDENCY_OBJS = $(subst $(S_DIR),$(DEBUG_DIR),$(subst .cpp,.o,$(SOURCE_DEPENDENCY_FILES)))
SOURCE_RELEASE_DEPENDENCY_OBJS = $(subst $(S_DIR),$(RELEASE_DIR),$(subst .cpp,.o,$(SOURCE_DEPENDENCY_FILES)))

####################################################
# Output dirs
O_DEBUG_DIR    = $(TOP)/$(DEBUG_DIR)
O_RELEASE_DIR  = $(TOP)/$(RELEASE_DIR)
O_DEBUG_PROG   = $(TOP)/$(DEBUG_PROG)
O_RELEASE_PROG = $(TOP)/$(RELEASE_PROG)

##
# Support function for colored output
# Args:
#     - $(1) = Color Type
#     - $(2) = String to print
ifneq ($(HAVE_TERM),none)
define colorecho
	@tput setaf $(1)
	@echo $(2)
	@tput sgr0
endef
else
define colorecho
	@echo $(2)
endef
endif

# Color Types
COLOR_BLACK = 0
COLOR_RED = 1
COLOR_GREEN = 2
COLOR_YELLOW = 3
COLOR_BLUE = 4
COLOR_MAGENTA = 5
COLOR_CYAN = 6
COLOR_WHITE = 7

all: directories show_debug_flags debug release clean

directories: debug_make_dirs release_make_dirs

rebuild: clean directories debug release

rebuild_debug: clean_all_debug debug

rebuild_release: clean_all_release release

debug: directories show_debug_flags $(SOURCE_DEBUG_OBJS) $(SOURCE_DEBUG_DEPENDENCY_OBJS) $(TEST_SOURCE_DEBUG_OBJS)
	$(COMPILER) $(C_FLAGS) $(CC_FLAGS) $(SOURCE_DEBUG_OBJS) $(SOURCE_DEBUG_DEPENDENCY_OBJS) $(LDFLAGS) -o $(O_DEBUG_PROG)

release: directories show_release_flags $(SOURCE_RELEASE_OBJS) $(SOURCE_RELEASE_DEPENDENCY_OBJS) $(TEST_SOURCE_RELEASE_OBJS)
	$(COMPILER) $(C_FLAGS) $(CC_FLAGS) $(SOURCE_RELEASE_OBJS) $(SOURCE_RELEASE_DEPENDENCY_OBJS) $(LDFLAGS) -o $(O_RELEASE_PROG)

# makedir
debug_make_dirs:
	@${MKDIR_P} $(DEBUG_DIR)
	@for dir in $(SUB_DIRS); do \
		${MKDIR_P} $(DEBUG_DIR)/$$dir; \
	done

# makedir
release_make_dirs:
	@${MKDIR_P} $(RELEASE_DIR)
	@for dir in $(SUB_DIRS); do \
		${MKDIR_P} $(RELEASE_DIR)/$$dir; \
	done

show_debug_flags:
	$(call colorecho,$(COLOR_YELLOW),"[ Debug flags: $(C_FLAGS) $(CC_FLAGS) $(DEBUG_FLAGS) ]")

show_release_flags:
	$(call colorecho,$(COLOR_YELLOW),"[ Release flags: $(C_FLAGS) $(CC_FLAGS) $(RELEASE_FLAGS) ]")

##################################################################################################################
# DEBUG
$(SOURCE_DEBUG_OBJS):
	$(call colorecho,$(COLOR_GREEN),"[ Make debug object: $(subst $(DEBUG_DIR),,$(@:.o=.cpp)) => $(subst $(TOP)/,,$(@)) ]")
	@$(COMPILER) $(C_FLAGS) $(CC_FLAGS) $(DEBUG_FLAGS) -c $(subst $(DEBUG_DIR),$(S_DIR),$(@:.o=.cpp)) -o $@

##################################################################################################################
# RELEASE
$(SOURCE_RELEASE_OBJS):
	$(call colorecho,$(COLOR_GREEN),"[ Make release object: $(subst $(RELEASE_DIR),,$(@:.o=.cpp)) => $(subst $(TOP)/,,$(@)) ]")
	@$(COMPILER) $(C_FLAGS) $(CC_FLAGS) $(RELEASE_FLAGS) -c $(subst $(RELEASE_DIR),$(S_DIR),$(@:.o=.cpp)) -o $@

# Clean
clean: clean_debug clean_release
clean_all: clean_all_debug clean_all_release

clean_debug:
	$(call colorecho,$(COLOR_MAGENTA),"[ Delete debug obj files ]")
	@rm -f -R $(O_DEBUG_DIR)
	@rm -f $(ANYOPT_INC)/anyoption.o

clean_release:
	$(call colorecho,$(COLOR_MAGENTA),"[ Delete release obj files ]")
	@rm -f -R $(O_RELEASE_DIR)
	@rm -f $(ANYOPT_INC)/anyoption.o

clean_all_debug:
	$(call colorecho,$(COLOR_MAGENTA),"[ Delete debug obj files ]")
	@rm -f -R $(O_DEBUG_DIR)
	@rm -f $(ANYOPT_INC)/anyoption.o
	$(call colorecho,$(COLOR_MAGENTA),"[ Delete debug executable files ]")
	@rm -f -R $(O_DEBUG_PROG)

clean_all_release:
	$(call colorecho,$(COLOR_MAGENTA),"[ Delete release obj files ]")
	@rm -f -R $(O_RELEASE_DIR)
	@rm -f $(ANYOPT_INC)/anyoption.o
	$(call colorecho,$(COLOR_MAGENTA),"[ Delete release executable files ]")
	@rm -f -R $(O_RELEASE_PROG)

while the code is that very simple:

#include "main.h"

int main(int argc, char** argv)
{
	std::cout << "\nHello World\n";
	return 0;
}

#ifndef MAIN_H
#define MAIN_H

#include <iostream>
#include <ROOT/RDataFrame.hxx>

#pragma once

#endif

I’m not understanding what is going on.
The same code build correctly on Mac…

I think you are passing -std=c++14 but your ROOT was compiled for C++11 (to verify, you can e.g. check the output of root-config --cflags).

You can either downgrade the makefile to c++11 or upgrade ROOT to C++14.

Yes, it works ! Thanks a lot for all your precious help
Best

Enrico

1 Like

Just another simple question, if I can…
What about if I need to fill a vector of hints using RDF?

That’s my idea:
each histogram refers to a certain energy bin. So I need to have separated histos for each energy bin.

auto energy_nbins = (int)energy_binning.size() - 1;
std::vector<ROOT::RDF::RResultPtr<TH1D>> h_sumRms_bin(energy_nbins);
int bin_idx = 1;
auto bin_filter = [=](int energy_bin) -> bool { return energy_bin == bin_idx; };
for (; bin_idx <= energy_nbins; ++bin_idx)
{
h_sumRms_bin[bin_idx - 1] = _fr_bgo_analysis.Filter(bin_filter, {"energy_bin"})
                                        .Histo1D({(std::string("h_sumRms_bin_") + std::to_string(bin_idx)).c_str(), (std::string("sumRms - bin ") + std::to_string(bin_idx)).c_str(), 1000, 0, 3000}, "sumRms", "simu_energy_w_pathc");
}

This is just few lines of the code but all my histos are identical… can you please explain me why?

Manually select the filter outside the for loop works…
Thanks,

Enrico

In [=] (int energy_bin) the = means “copy external variables by value”, i.e. bin_idx is always 1 in that lambda.

You probably want:


for (int bin_idx = 1; bin_idx <= energy_nbins; ++bin_idx)
{
auto bin_filter = [=](int energy_bin) -> bool { return energy_bin == bin_idx; };
h_sumRms_bin[bin_idx - 1] = _fr_bgo_analysis.Filter(bin_filter, ...)....

Sure ! My bad, sorry for the stupid question.
Thanks,
Best
Enrico

Using RDF now I am able to read big trees in few seconds, using multithreading.
In my code I have ROOT::RDF::RResultPtr<TH1D> objects or vectors of them and I need to write them to disk. I would like to ask your opinion regarding the correct way to do that. Using the simple Write() method is really really slow and the whole script takes a long time to complete respect to the case where the histos are not saved to disk.

What is the correct way to save RDF histos? Do I need to write them in parallel?
Thanks

RDF histograms are not special: a RResultPtr<TH1D> is just a pointer to a normal TH1D object (that triggers the RDF event loop upon first access).

How many histograms are we talking about, and how long does it take to write them to disk? Are you sure you are not counting the time of the event loop too?

Cheers,
Enrico

I’m counting the whole time to execute the script saving or not the histos.
But the difference is huge.
Here I created the histos and how they are saved to disk.

Not saving the histos:

real	0m8.749s
user	0m3.532s
sys	0m0.687s

Saving the histos: I stopped the process after 10 minutes and it was still working.

Here how the histos are written to disk (at the end of the script)
reader.cpp (33.0 KB)

If you don’t save the histograms, are you sure the RDF event loop is started? E.g. do you print their number of entries, or otherwise access their contents? If you never access the contents of a RResultPtr, the event loop is never started.

How many histograms are we talking about? And where are they written out (local disk vs EOS might make a big difference)?

How long does it take to write the same number of histograms with roughly similar binning from a separate program that does just that? E.g. this runs in seconds on my laptop:

auto outfile = TFile::Open("f.root", "recreate");
for (int i = 0; i < 10000; ++i) {
   TH1D h(("h" + std::to_string(i)).c_str(), "h", 100, 0, 1);
   h.Write();
}
outfile->Close();

Cheers,
Enrico

Hi, thanks for all precious advices.
I misunderstood the lazy operations.

I would like to have a your opinion on this thing:

This is a part of the TTree that I access with RDF:

*Br 66 :NUD_ADC : vector<double> * *Entries : 30000 : Total Size= 1385597 bytes File Size = 88398 * *Baskets : 48 : Basket Size= 32000 bytes Compression= 15.66 * *............................................................................* *Br 67 :NUD_total_ADC : nud_total_adc/D * *Entries : 30000 : Total Size= 241229 bytes File Size = 7339 * *Baskets : 8 : Basket Size= 32000 bytes Compression= 32.80 * *............................................................................* *Br 68 :NUD_max_ADC : nud_max_adc/D * *Entries : 30000 : Total Size= 241205 bytes File Size = 7301 * *Baskets : 8 : Basket Size= 32000 bytes Compression= 32.97 * *............................................................................* *Br 69 :NUD_max_channel_ID : nud_max_channel_id/I * *Entries : 30000 : Total Size= 120909 bytes File Size = 4520 * *Baskets : 4 : Basket Size= 32000 bytes Compression= 26.63 * *............................................................................*

Why a column is not recognised?

❯ root -l simu_result_9.root
root [0] 
Attaching file simu_result_9.root as _file0...
(TFile *) 0x5558a4b3a420
root [1] TTree *mytree = (TTree*)_file0->Get("DmpMCEvtNtup")
(TTree *) 0x5558a524c230
root [2] ROOT::RDataFrame _f(*mytree)

ù(ROOT::RDataFrame &) A data frame built on top of the DmpMCEvtNtup dataset.
root [3] 
root [3] auto histo = _f.Histo1D("NUD_total_ADC")
Error in <TRint::HandleTermInput()>: std::runtime_error caught: Unknown column: NUD_total_ADC

As you can see this is present in the Tree…
Thanks

Enrico

It’s probably because the name of the branch (“NUD_total_ADC”) is different from the name of the leaf (“nud_total_adc”). Could you share the file with me so I can debug and/or figure out a workaround?

Cheers,
Enrico

Sure, thanks a lot!

That’s the dropbox link (too big to attach here…) https://www.dropbox.com/s/pfl4e590idrr9l7/simu_result_998.root?dl=0

Thanks! Will take a look as soon as possible.

Hey,
so, quick workaround: _f.GetColumnNames() shows that "NUD_total_ADC" is not something that RDF recognizes, but "NUD_total_ADC.nud_total_adc" is.

~ root -l simu_result_998.root
root [0] ROOT::RDataFrame(*DmpMCEvtNtup).Histo1D("NUD_total_ADC.nud_total_adc")->DrawClone()

gets me the following plot:

looks reasonable?

Yes, it works.
Thanks a lot for your workaround.
Isn’t a good idea to use underscore for leaf names, right?
anyway, thanks a lot for all your precious help.
Really enjoy using RDF!

Thanks,
Best

Enrico

Underscores are completely fine (any valid C++ variable name is a good leaf or branch name).
Branch name “NUD_total_ADC” with leaf name “nud_max_adc”, however, triggers the issue. I think this issue is an instance of https://sft.its.cern.ch/jira/browse/ROOT-9558.

Let us know if you encounter any further problem!
Cheers,
Enrico

Oh, I see !
Thanks, now is clear.

Cheers,
Enrico