Crashing when running with large samples

Hello everyone,

I am currently working on an analysis and intend to use BDT for this purpose, however when I run the script it gets stuck on the Dataset building, it is quite a simple script with an structure based on the Tutorial that comes inside of ROOT, the code has the following structure:

auto signalTree  = input->Get<TTree>("Signal");
auto background  = input->Get<TTree>("Background");
auto outputFile = TFile::Open( outfileName, "RECREATE" );
auto factory = new TMVA::Factory( "TMVAClassification", outputFile,"!V:!Silent:Color:DrawProgressBar:Transformations=I;D;P;G,D:AnalysisType=Classification" );
auto dataloader=new TMVA::DataLoader("dataset");
dataloader->AddVariable( var, var_type);
dataloader->AddSignalTree    (signalTree);
dataloader->AddBackgroundTree(background);
dataloader->SetWeightExpression("weight");
dataloader->PrepareTrainingAndTestTree("", "nTrain_Signal=3000:nTrain_Background=3000:SplitMode=Random:!V"); 
factory->BookMethod( dataloader, TMVA::Types::kBDT, "BDT","!H:!V:NTrees=850:MinNodeSize=2.5%:MaxDepth=3:BoostType=AdaBoost:AdaBoostBeta=0.5:UseBaggedBoost:BaggedSampleFraction=0.1:SeparationType=GiniIndex:nCuts=20" );
factory->TrainAllMethods();
factory->TestAllMethods();
factory->EvaluateAllMethods();
outputFile->Close();

I would appreciate any suggestion regarding how to train the BDT over the large sample ~48M of events.

Thank you.

Welcome to the ROOT Forum! I’m sure @moneta can help you with this

Hi,

The code looks fine to me. Are you just using one single variable ? For only one variable, I don’t think it is useful to BDT, you can simply apply a cut on that variable.
Then you are calling dataloader->PrepareTrainingAndTestTree using only 3000 events for training. If your data set contains 48M events, this means that you use all the remaining events for testing. I don’t think this is what you want. You should also indicate the number of testing events, e.g. nTest_Signal=3000:nTest_Background=3000.

Then if your script is crashing , please add at least the log of the output with t he error you are getting. And if this is not clear, we would need a full reproducible that we can run.

Cheers

Lorenzo

Hi,
Indeed I am using more than one variable, to be precise 13 plus the weight, the main issue is that it doesnt give me an error or log, it just stops running.

root [0] 
Processing BDTVH.C...
--- TMVAClassification       : Using file: Samples/SampleZ.root
DataSetInfo              : [dataset] : Added class "Signal"
                         : Add Tree Signal of type Signal with 39865 events
DataSetInfo              : [dataset] : Added class "Background"
                         : Add Tree Background of type Background with 48941269 events
Error
                         : Dataset[dataset] : Class index : 0  name : Signal
                         : Dataset[dataset] : Class index : 1  name : Background
Factory                  : Booking method: BDT
                         : 
                         : Rebuilding Dataset dataset
                         : Building event vectors for type 2 Signal
                         : Dataset[dataset] :  create input formulas for tree Signal
                         : Building event vectors for type 2 Background
                         : Dataset[dataset] :  create input formulas for tree Background
DataSetFactory           : [dataset] : Number of events in input trees
                         : 
                         : 

and after that it just stops running and goes back to the normal console, I tried also specifying the test numbers and the output was the same.

Also forgot to mention that monitoring my computer resources I found out that the crash happens at the same time the RAM gets full.

Probably you cannot fit all your data in your memory. How much RAM do you have on that machine ? Try to reduce the number of training and test events until you can fit all of them in memory

Lorenzo

I have 14 GB as the integrated graphics take 2 but what I was trying to find out was a way in which the BDT would read the file sequentially and not all at the same time to prevent this from happening, also it only works when the data is about 12k events between test and training samples which is odd because python seems to be able to load all the file therefore I would expect a C macro to be able to handle it.

In TMVA in the initialisation phase all the input data are copied in some internal data structure in memory. If you have ~ 50M events and 14 features, it should be less than 14 GB.
I would try anyway to use less events (~10M) to see if it is working.

Lorenzo