RDataFrame opens multiple root files in random order even with DisableImplicitMT()

_ROOT Version: 6.24/06
_Platform: CentOS Linux release 7.9.2009
_Compiler: gcc 4.8.5 20150623


Hello,

I’m using RDataFrame in two different python scripts, where one of these scripts loads multiple ROOT files and calculates weights and saves these weights in a .pkl file, and the other script opens the same ROOT files and the .pkl file with the weights and applies the weights to the data stored in the ROOT files.
To test if everything works as expected I make a check if the weights and the data in the RDataFrame are in the same order. This is how I noticed that most of the times they are not in the same order. I also experimented with some print commands and confirmed that every time I run the script the files are loaded in a different order. I read that this can be caused by multithreading, so I disabled implicit multithreading, but the problem persists. If I only load one file this issue does not appear.

For reference, this is how I load the files into the RDataFrame:
branches=[‘var1’, ‘var2’, ‘var3’]
DisableImplicitMT()
df = RDataFrame(input_tree, {‘file1.root’,‘file2.root’,‘file3.root’})
numpy_dict= df.AsNumpy(branches)

Is there some way to define the order in which the ROOT files are loaded into the RDataFrame? Or do I just need to find another way to go around this issue?

Best regards,
Eva

Dear @eva ,

If you do not run with implicit multi threading, the files will be loaded exactly in the order that you specify in the input file list. DisableImplicitMT is not needed. Could you provide us with a full reproducer so that we can understand better your situation?

Cheers,
Vincenzo

Thanks for the quick answer!

Apparently I can’t attach a file to this post because I’m a new user, so I just pasted the code into this answer, I hope that’s alright. This is a minimal version of the code that just loads the files in the dataframe and reads them into a pandas dataframe and then prints it out. If I run this code twice it will print out the entries of the dataframe in a different order:

import string
import tensorflow as tf
print(‘Tensorflow’,tf.version)
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import plot_model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
import os
import sys
import numpy as np
import pandas as pd
import copy
import pickle
from joblib import dump, load
from numpy.lib import recfunctions as rfn
from root_numpy import tree2array, array2root
from sklearn.metrics import confusion_matrix
from ROOT import RDataFrame
import matplotlib.pyplot as plt
import seaborn as sns
from ROOT import TH1F
import ROOT
from ROOT import DisableImplicitMT, EnableImplicitMT

branches = [‘fjet_truth_dRmatched_particle_flavor’,‘EventInfo_mcEventWeight’]
input_tree = ‘FlatSubstructureJetTree;1’
DisableImplicitMT()
df = RDataFrame(input_tree, {‘801661.Flattener_v1_VanillaSD_tree.35488162._000001.tree.root’,
‘801859.Flattener_v1_VanillaSD_tree.35442753._000001.tree.root’,
‘364702.Flattener_v1_VanillaSD_tree.35488179._000001.tree.root’,
‘364703.Flattener_v1_VanillaSD_tree.35488182._000001.tree.root’,
‘364704.Flattener_v1_VanillaSD_tree.35488188._000001.tree.root’,
‘364705.Flattener_v1_VanillaSD_tree.35488196._000001.tree.root’,
‘364706.Flattener_v1_VanillaSD_tree.35488214._000001.tree.root’,
‘364707.Flattener_v1_VanillaSD_tree.35488233._000001.tree.root’,
‘364708.Flattener_v1_VanillaSD_tree.35488253._000001.tree.root’,
‘364709.Flattener_v1_VanillaSD_tree.35488277._000001.tree.root’,
‘364710.Flattener_v1_VanillaSD_tree.35488300._000001.tree.root’,
‘364711.Flattener_v1_VanillaSD_tree.35488319._000001.tree.root’,
‘364712.Flattener_v1_VanillaSD_tree.35488336._000001.tree.root’})
np_dict= df.AsNumpy(branches)
df_aMC=pd.DataFrame(np_dict)

print(df_aMC)

Cheers,
Eva

Dear @eva ,

Thanks, the code snippet you present doesn’t immediately show what the problem could be. And again, you should not write DisableImplicitMT() as it’s not needed.

For a full reproducer I would also need access to the files and see if I encounter the same problem. Would it be possible for you to share these files even privately with me?

Cheers,
Vincenzo

Dear @vpadulan ,

They are quite large, but sure. How can I share them with you?

Cheers,
Eva

Dear @Eva ,

For example you could add them to your CERNBox quota and then share them with me. There are instructions on how to do so at Sharing and permissions - CERNBox Docs .

Cheers,
Vincenzo

Dear @vpadulan ,

I shared the folder with my files and the python script with you, did you receive it?

Cheers,
Eva

Dear @eva ,

I can see the files. I will let you know about the insights I can gather by running your application.

Cheers,
Vincenzo

Dear @eva,

After a second look at your reproducer script, I might have found the culprit. At first glance I did not notice, but you are creating the input files to the RDataFrame constructor as a Python set (via a set literal expression using the curly brackets). This is probably the source of the unpredictability of the file order, since a Python set does not guarantee insertion order. You can see that with a very simple snippet that prints the list of files, repeated runs of the snippet will give different file order

files = {'35442753_000001.tree.root',
'35488179_000001.tree.root',
'35488182_000001.tree.root',
'35488188_000001.tree.root',
'35488196_000001.tree.root',
'35488214_000001.tree.root',
'35488233_000001.tree.root',
'35488253_000001.tree.root',
'35488277_000001.tree.root',
'35488300_000001.tree.root',
'35488319_000001.tree.root',
'35488336_000001.tree.root'}

print(files)

Try substituting the set with a list or tuple (i.e. with square brackets or parenthesis), let me know if that brings things back to normal. Apologies for noticing only now and making you go through the extra step of sharing the files.

Cheers,
Vincenzo

1 Like

Dear @vpadulan ,

yes, using square brackets has solved the issue. Thank you very much!

Cheers,
Eva

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.