Proper obtaining and comparison of results from HyperparameterOptimisation

Olena · April 30, 2019, 3:12pm

Dear Experts,

I am training a BDT with TMVA and would like to perform a hyperparameters optimisation. I would like to use HyperParameterOptimisation that is not yet documented.
For the training, I use ~1.2 mill signal and ~1 mill background events, therefore my initial aim was to optimise on 10% of those, since even using 1% is taking quite long already (so far it haven’t finished after more than a day with inclusion of additional optimised parameters as I will explain further). Just for now as I am setting things I stick to 1% in the examples below. The data is split into test and train samples 50/50.

What I did initially (after setting dataloader as usually):

        TMVA::HyperParameterOptimisation * HPO = new TMVA::HyperParameterOptimisation(dataloader);
        // like in a snippet I saw somewhere
        HPO->BookMethod(TMVA::Types::kBDT, "BDT", "");
        
        std::cout << "Info: calling TMVA::HyperParameterOptimisation::Evaluate" << std::endl;
        HPO->Evaluate();
        TMVA::HyperParameterOptimisationResult HPOResult = HPO->GetResults();
        HPOResult.Print();

        HPO->SaveAs(Form("HPO_%s.root",datasetDirName.data())); 

        std::cout << "GetROCAverage: " << HPOResult.GetROCAverage();

        std::cout << "\nGetEff01Values:\n";
          for (auto i = HPOResult.GetEff01Values().begin(); i != HPOResult.GetEff01Values().end(); ++i)
            std::cout << ' ' << *i;

        // ... same printouts for GetEff30Values, GetEffAreaValues, GetROCValues etc.

        TFile *MyFile = new TFile(Form("HPOResult_%s.root", datasetDirName.data()),"RECREATE");
        TMultiGraph * t =   HPOResult.GetROCCurves();
        t->Write();
        MyFile->Close();
        delete MyFile;

Which gives me:

<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 1
                         : AdaBoostBeta     0.6
                         : MaxDepth     3
                         : MinNodeSize     15.5
                         : NTrees     207.533
<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 2
                         : AdaBoostBeta     0.6
                         : MaxDepth     2.08005
                         : MinNodeSize     15.5
                         : NTrees     514.316
<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 3
                         : AdaBoostBeta     0.596308
                         : MaxDepth     2.39675
                         : MinNodeSize     15.5
                         : NTrees     505
<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 4
                         : AdaBoostBeta     0.6
                         : MaxDepth     3.82036
                         : MinNodeSize     15.5
                         : NTrees     529.937
<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 5
                         : AdaBoostBeta     0.41429
                         : MaxDepth     2.75713
                         : MinNodeSize     15.5
                         : NTrees     505

Next step I added:

        // like in my actual training
        HPO->BookMethod("BDT", "BDTG", "H:V:NTrees=1000:BoostType=Grad:Shrinkage=0.20:UseBaggedBoost:BaggedSampleFraction=0.4:SeparationType=GiniIndex:nCuts=500:PruneMethod=NoPruning:MaxDepth=5"); 
        HPO->SetNumFolds(3);
        HPO->SetFitter("Minuit");
        HPO->SetFOMType("Separation");

And redefined 2 functions to tune more parameters:

void TMVA::MethodBDT::SetTuneParameters(std::map<TString,Double_t> tuneParameters)
{
   std::map<TString,Double_t>::iterator it;
   for(it=tuneParameters.begin(); it!= tuneParameters.end(); it++){
      Log() << kWARNING << it->first << " = " << it->second << Endl;
      if (it->first ==  "MaxDepth"       ) SetMaxDepth        ((Int_t)it->second);
      else if (it->first ==  "nCuts"       ) {this->fNCuts = (Int_t)it->second;}
      else if (it->first ==  "MinNodeSize"    ) SetMinNodeSize     (it->second);
      else if (it->first ==  "NTrees"         ) SetNTrees          ((Int_t)it->second);
      else if (it->first ==  "NodePurityLimit") SetNodePurityLimit (it->second);
      else if (it->first ==  "AdaBoostBeta"   ) SetAdaBoostBeta    (it->second);
      else if (it->first ==  "Shrinkage"      ) SetShrinkage       (it->second);
      else if (it->first ==  "UseNvars"       ) SetUseNvars        ((Int_t)it->second);
      else if (it->first ==  "BaggedSampleFraction" ) SetBaggedSampleFraction (it->second);
      else Log() << kFATAL << " SetParameter for " << it->first << " not yet implemented " <<Endl;
   }
}

// see: https://root.cern.ch/root/html/src/TMVA__MethodBDT.cxx.html#v6inl%3E
std::map<TString,Double_t>  TMVA::MethodBDT::OptimizeTuningParameters(TString fomType, TString fitType)
{
   std::cout << "call the Optimzier with the set of paremeters and ranges that are meant to be tuned.\n";

   // fill all the tuning parameters that should be optimized into a map:
   std::map<TString,TMVA::Interval*> tuneParameters;
   std::map<TString, Double_t> tunedParameters;

   // note: the 3rd paraemter in the inteval is the "number of bins", NOT the stepsize !!
   //       the actual VALUES at (at least for the scan, guess also in GA) are always
   //       read from the middle of the bins. Hence.. the choice of Intervals e.g. for the
   //       MaxDepth, in order to make nice interger values!!!

   // find some reasonable ranges for the optimisation of MinNodeEvents:
   tuneParameters.insert(std::pair<TString,Interval*>("NTrees",         new Interval(10,1000,5))); //  stepsize 50
   tuneParameters.insert(std::pair<TString,Interval*>("MaxDepth",       new Interval(2,4,3)));    // stepsize 1
   tuneParameters.insert(std::pair<TString,Interval*>("MinNodeSize",    new LogInterval(1,30,30)));    //
   tuneParameters.insert(std::pair<TString,Interval*>("NodePurityLimit",new Interval(.4,.6,3)));   // stepsize .1
   tuneParameters.insert(std::pair<TString,Interval*>("BaggedSampleFraction",new Interval(.4,.9,6)));   // stepsize .1
   tuneParameters.insert(std::pair<TString,Interval*>("nCuts",          new Interval(20,700,10)));
   // method-specific parameters
   tuneParameters.insert(std::pair<TString,Interval*>("Shrinkage",      new Interval(0.05,0.50,5)));

   std::cout << " the following BDT parameters will be tuned on the respective *grid*\n";
   std::map<TString,TMVA::Interval*>::iterator it;
   for(it=tuneParameters.begin(); it!= tuneParameters.end(); it++)
   {
      std::cout <<  it->first << " ";
      (it->second)->Print(std::cout);
      std::cout << "\n";
   }

   OptimizeConfigParameters optimize(this, tuneParameters, fomType, fitType);
   tunedParameters = optimize.optimize();

   return tunedParameters;
}

And respective output looked like:

<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 1
                         : BaggedSampleFraction     0.65
                         : MaxDepth     3
                         : MinNodeSize     15.5002
                         : NTrees     505
                         : NodePurityLimit     0.5
                         : Shrinkage     0.275
<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 2
                         : BaggedSampleFraction     0.65
                         : MaxDepth     2.99998
                         : MinNodeSize     15.5
                         : NTrees     393.523
                         : NodePurityLimit     0.5
                         : Shrinkage     0.275
<HEADER> HyperParameterOptimisa...: ===========================================================
                         : Optimisation for BDT fold 3
                         : BaggedSampleFraction     0.649979
                         : MaxDepth     3
                         : MinNodeSize     15.5
                         : NTrees     505
                         : NodePurityLimit     0.5
                         : Shrinkage     0.275

GetROCAverage: 0
GetEff01Values:
GetEff10Values:
GetEff30Values:
...

So, I am a bit puzzeled with understanding what I’ve got. Since I don’t actually use cross-validation is it even correct to use the HPO module for tuning? Why don’t I get any Get*Values printed, and the created root file has no TMultiGraph? If it’s expected, how do I compare the improvements wrt nominal options I use because I would like to see a proof that the proposed values are actually better (I don’t see the module saving the responses for each training, but only the parameters, see L.102 )? Should the average across folds be corresponding to the best parameters? Can I expect the best parameters found on 1%/10% statistics be also optimal on larger statistics? Are there any parameters that I should freeze to currently found values to make a faster optimisation on larger statistics for tricky parameters (and which would those be than)?

I would highly appreciate your input on any of those questions.

Best regards, Olena

References:
[1] HyperParameterOptimisation

kialbert · May 2, 2019, 10:15am

HPO and CV are two different concepts, it’s completely OK to use the one without the other. What you should keep in mind though, is that for HPO you should keep three datasets: training, validation and testing. In TMVA we provide mechanisms to split into 2 sets: For HPO that would be training and validation. Once you have found your optimal parameters you would then have to apply those to an independent test set (you can use the old data for training though!).

Additionally the TMVA HPO is designed for uncertainty estimation. If you want “simple” HPO, run it with only one fold! The results will then be the best parameter set for your data.

The missing outputs are a problem! What you need is the figure of merit for each fold.

I think this is a bit of a tricky question and I don’t have a definite answer. The easy approach would be to just go with the best one (and maybe just use 1 fold). The average would work I think, but can be a bad estimator in general (if e.g. there are 2 optimal HPO sets, the average could then end up in a relatively bad spot).

This is definitely an approach. The trade-off is that you might miss the global optimum. I am not aware of any general guidelines here, and I expect it to vary case-by-case.

For additional insight, maybe @moneta could help?

Cheers,
Kim

Olena · May 4, 2019, 3:28pm

Dear @kialbert , thank you for your detailed reply.

HPO and CV are two different concepts, it’s completely OK to use the one without the other. What you should keep in mind though, is that for HPO you should keep three datasets: training, validation and testing. In TMVA we provide mechanisms to split into 2 sets: For HPO that would be training and validation. Once you have found your optimal parameters you would then have to apply those to an independent test set (you can use the old data for training though!).

Additionally the TMVA HPO is designed for uncertainty estimation. If you want “simple” HPO, run it with only one fold! The results will then be the best parameter set for your data.

I am a bit puzzled though what do you mean by the “old data for training” and how to split the data. I also found that calling HPO is not possible with SetNumFolds(1), so should I manually create a subset of data on which I will perform HPO with 2 folds and then check on the other part of data which of the 2 parameters sets gives the best ROC by manually running the training?

As you might have noticed in the example with 3 folds, the second fold found a very different value for NTrees parameter. Does it mean that I could say that the matching parameters are the ones found to be the best with quite a low uncertainty, and therefore to find the best NTrees value I can leave only this one to be floating next? I am also wondering if it made any sense from my side to add a search of optimal nCuts since I just found in TMVA manual that “a truly optimal cut, given the training sample, is determined by setting” nCuts=-1. But in this case would the HPO be still valid or setting nCuts=-1 will be interfearing with the grid-search?

The missing outputs are a problem! What you need is the figure of merit for each fold.

Do you mean that missing output is a problem in general because I do need a figure of merit (eg a ROC-curve as follows from the name), or that the TMVA module was actually supposed to return me something back and the fact it didn’t mean there is a problem in the TMVA? I don’t see the HyperParameterOptimisationResult result fields to be filled at any point.

I think this is a bit of a tricky question and I don’t have a definite answer. The easy approach would be to just go with the best one (and maybe just use 1 fold). The average would work I think, but can be a bad estimator in general (if e.g. there are 2 optimal HPO sets, the average could then end up in a relatively bad spot).

w/o a plot one can’t so far judge which is the best though

Cheers, Olena

kialbert · May 6, 2019, 12:18pm

I was trying to say, split data into 3 parts: train set, valid set and test set. For HPO use train + valid. For Final evaluation of performace, use (train+valid) as a new train set and evaluate on the test set.

This is currently tricky to setup inside of TMVA, I would recommend a 2 step procedure where you first find the best paramters, and then evaluate those on an independent test set (see below).

Looking a bit deeper into the TMVAHyperParameterOptimisation class I cannot currently recommend its usage (b/c missing outputs and other reasons). However, some of the functionality can be replicated without it. Thus I would recommend you to try the following:

TString factoryOptions = "AnalysisType=Classification";
TMVA::Factory factory{"<name of factory>", factoryOptions};
TMVA::DataLoader dataloader{"dataset"};

// Set up dataloader ...

TString methodOptions = "BoostType=Grad:<your other options here>";
auto method = factory.BookMethod(&dataloader, TMVA::Types::kBDT,
                                 "<classifier name>", methodOptions);
method->OptimizeTuningParameters();

This will print the best parameters found. You can then use these parameters in a separate script to evaluate the actual performance of this parameter set. Just make sure that the data in this test set is completely independent from the data used in the HPO.

Currently the grid search does not include the nCuts parameter.

Cheers,
Kim

Olena · May 6, 2019, 12:49pm

Dear Kim,

Currently the grid search does not include the nCuts parameter.

yes, but one can redefine the SetTuneParameters and OptimizeTuningParameters methods (as in my first post) to include other things to tune. And do you know if the grid search is implemented in a way that the minimisation step depends on already calculated points or it’s rather estimated best fit after all grid points are calculated? I am wondering if there is an option to parallelize this at least a bit.

So far I have observed that when I run the optimizer, the found tuned parameters performed worse than initial ones eg I got 0.957 ROC while the initial parameters gave 0.963. This time I run with FitGA and ROCIntegral (Since FitGA is the default as I see it and ROC is actually what I’m interested in) - could it be related to that ROCIntegral is calculated differently when optimising (to be honest I read it in the forum somewhere but am not sure it’s still the case)?

In the second training that I’m tuning, the optimiser ended suggesting to use the intervals edge values (and again this gave a worse ROC value) - does it mean the best values might be actually outside the intervals?

Cheers, Olena

kialbert · May 6, 2019, 1:30pm

Ah, sorry. Missed this!

I think this is either a random point based search, fitType="Scan", a genetic alg. fitType="fitGA" or minuit. In scan the evaluations should be independet. In GA, they should be dependent and for miniut I do not know. Also I do not know of a way to easily parallelise the optimisation process (with TMVA)

This probably means the search did not try the initial paramters. The discrepancies for the ROC calculations should be fixed since at least a year or so

It indicates this could be the case, but it could also indicate undersampling of the original area. Of note: Larger NTrees is probably correlated with better performance for a large range of the parameter. If this is limited to, say 500, I would not at all be surprised to find the “optimal value” at the edge of the search space.

Cheers,
Kim