Memory management with RooWorkspace

astauffe · August 27, 2021, 5:03pm

Hi everybody,
I have a few general questions regarding memory management when working with RooWorkspaces. I am currently working with RooWorkspaces but can not seem to really understand the memory consumption or how to free the occupied memory properly. However, I would really like to understand the memory consumption better and keep it at a necessary minimum.
Maybe it is best if I try to describe what I would like to do in simple words: So, I have roughly 150 Rootfiles which are each about 33Mb in size (combined ~5Gb). Each of those Rootfiles consist of 1 RooWorkspace which in turn consists of 10 roughly equally large indexed RooDatasets. So, I conclude that each of these 10 RooDatasets is roughly 3.3Mb large. Now I loop over all the indeces of the RooDatasets. For each index I open all 150 files, extract the 150 RooDatasets, add them together, do some manipulation on them. store the result, free up the memory and then move on to the next index. Ideally, the memory consumption should never go much higher than roughly 150*3.3Mb or ~500Mb. However, I seem to have trouble freeing up the memory, as my memory consumption keeps increasing with each index instead of resetting.

I tried looking for similar questions but didn’t find anything that helps me. Can you please help me? I will put below a simple version of my script:

The function to extract all the RooDatasets of a given index (please notice that this is a simplified version only to show you what I do):

void AddDataToWorkspace(RooWorkspace * MetaWorkSpace, int index) {
  // I omit here how I get the name of the first file and a vector containing the names of all other files
  
  // Initialising data with the first file
  TFile * StandardFile = new TFile(first_file.c_str());  // "first_file" is the very first file out of the 150
  RooWorkspace * w_first = (RooWorkspace*) StandardFile->Get("w");
  RooDataSet * Data = (RooDataSet *) w_first->data(("Data_" + to_string(index)).c_str());
  delete StandardFile;

  // Adding all consecutive files
  TFile * ToBeAddedFile;
  RooWorkspace * w_to_be_added;
  RooDataSet * Data_ToBeAdded;
  for (std::vector<std::string>::iterator t = consecutive_files.begin();  // "consecutive_files" are the remaining 149 files
       t != consecutive_files.end(); ++t) {
    ToBeAddedFile = new TFile(t.c_str());
    w_to_be_added = (RooWorkspace *) ToBeAddedFile->Get("w");
    Data_ToBeAdded = (RooDataSet *) w_to_be_added->data(("Data_" + to_string(index)).c_str());
    Data->append(*Data_ToBeAdded);
  }
  // delete Data_ToBeAdded;
  delete w_to_be_added;  // can not delete an empty workspace! this will delete the container
  delete ToBeAddedFile;

  // Putting the histogram into the meta-workspace and returning it
  MetaWorkSpace->import(*Data);
  return;
}

As you see, my idea is to have a “meta” workspace where the combined dataset will be in. Then I take the first file, initialise the RooDataset with the corresponding Dataset in there and then I loop through all the other files adding the corresponding datasets. In the process I try to free up the memory with:

delete StandardFile;
// and
delete w_to_be_added;
delete ToBeAddedFile;

Is this the correct way? I already have some other questions here as well. When I call “delete w_to_be_added”, does this free up the memory of all constituents of the workplace? I realised for example that I can not call “delete Data_ToBeAdded” first and then delete the workspace. Strangely, I also loose the RooDataSet “DataToBeAdded” when I delete the RooWorkspace “w_to_be_added”. So I am a bit confused what object takes the memory exactly? The RooWorkspace? The constituents of the Workspace? The file? And in what order should I delete them to free the memory?

Anyway, on to my main problem: In my naive understanding, after ending the function call to “AddDataToWorkspace” this should only have left something in the “Metaworkspace” and every other memory occupied for a file, workspace or dataset from one of the 150 files should be free again. My whole script looks something like this: (again simplified version for understanding only)

void Main() {
  // Create a MetaWorkspace to manage the project.
  RooWorkspace *wks = new RooWorkspace("myWS");
  
  // Initialise some container for the results and other stuff...
  
  // Loop through the index (of the RooDataSets)
  for (int nr_index = 0; nr_index <= 9; nr_index++) {
    // Add the combined RooDataSet to the MetaWorkspace via the previous function
    AddDataToWorkspace(wks, nr_index);
    RooDataSet * Data_temp = (RooDataSet *) wks->data(("Data_" + to_string(nr_index)).c_str());
    
    // Make a histogram out of the combined RooDataSet
    TH1D *Histo_temp = (TH1D *) Data_temp->createHistogram(Data_temp->GetTitle(), variable, Binning("something"));

    // Now I only need the histogram, so I would like to free up the memory of the combined RooDataSet
    // wks->Print("v");
    wks->RecursiveRemove(Data_temp);
    // wks->Print("v");
    
    // Do other stuff with the histogram, which should not take up a lot of memory...
    // Specifically, I create some RooAbsPdfs, build a model and do some fits and plots
    // -> eventually end this step in the loop
  }
}

So, I actually try to remove the whole RooDataSet again after I have made a histogram out of it:

// wks->Print("v");
wks->RecursiveRemove(Data_temp);
// wks->Print("v");

Is this the right way to free up the memory of the RooDataSet again? I only want to delete the DataSet here, as I want to keep the MetaWorkspace. Calling “wks->Print(“v”)” confirms me that the “RecursiveRemove” removes the RooDataSet from my MetaWorkspace. So even though it seems like I always delete the hugest memory-consumer, my memory consumption keeps increasing with the for-loop in the main function. What am I doing wrong? Am I somehow deleting objects in the wrong way or is there a hidden dependence in the RooWorkspace that keeps a copy or so?

Any help is much appreciated! Thanks already in advance

jonas · August 30, 2021, 11:33am

Hi @astauffe, thanks for your question!

There are several things that you need to consider when implementing this correctly:

if you Get a RooWorkspace from a file, you have to delete it yourself afterwards (which you already did correctly)
if you import anything in a RooWorkspace, the object will be cloned and the RooWorkspace owns the object
if you remove an object from a workspace with RooWorkspace::RecursiveRemove, the object is not owned anymore by the workspace but it still exists! So you have the responsibility of deleting it after calling RecursiveRemove.

I have created a small standalone example that has no memory leaks and re-implements what you implemented in your original post. The object names are just placeholders and it only uses one file, so you would have to adapt the code the fit your use.

void addDataToWorkspace(RooWorkspace & metaWorkspace, int /*index*/) {

  // The filenames, workspace names and dataset names are just
  // placeholders in this example.
  

  // Pointer to the combined data, we use a smart pointer here so we
  // don't have to remember to call `delete`.
  std::unique_ptr<RooDataSet> data;

  for (int i = 0; i < 10; ++i) {

    // The TFile can be created on the stack
    TFile file("workspace.root");

    // We have to delete the RooWorkspace that is retrieved from the file,
    // so let's wrap it into a unique_ptr such that this happens
    // automatically.
    std::unique_ptr<RooWorkspace> workspace{
        file.Get<RooWorkspace>("workspace")
    };

    // Get a non-owning pointer to the dataset (don't delete this,
    // it is owned by the workspace).
    auto * dataToAdd = static_cast<RooDataSet*>(workspace->data("data"));

    if(i == 0) {
      // If this is the first file, we create our combined dataset by
      // cloning the dataset in the first file.
      data.reset(
          static_cast<RooDataSet*>(workspace->data("data")->Clone()));
    } else {
      // Otherwise, just append the data.
      data->append(*dataToAdd);
    }
  }

  // Putting the histogram into the meta-workspace.
  metaWorkspace.import(*data);
}

int main() {

  RooMsgService::instance().setGlobalKillBelow(RooFit::WARNING);

  RooWorkspace metaWorkspace("metaWorkspace");

  for (int i = 0; i < 1000; i++) {

    // Get the combined data into the meta-workspace.
    addDataToWorkspace(metaWorkspace, i);

    // Get a pointer to the combined data, for now the RooWorkspace
    // still owns the object.
    auto * data = static_cast<RooDataSet*>(metaWorkspace.data("data"));

    // Removing the object from the workspace means doesn't mean
    // the object is deleted! It is only un-registered from the workspace.
    // This means deleting `data` is now our responsability.
    metaWorkspace.RecursiveRemove(data);

    // Just to show you that the object still exists.
    data->Print();

    // Since we own `data` now, we have to delete it.
    delete data;
  }

  return 0;
}

By the way, why are you using this “meta workspace”? Would it not be sufficient if addDataToWorkspace just returns the unique_ptr to the merged data, and then you do what you like with the data? Like this, you would also not have to deal with the ownership subtleties of RooWorkspace. But okay, maybe it’s important for your use that the data is in a RooWorkspace

I hope this helps! Have a nice week,
Jonas

astauffe · September 1, 2021, 12:33pm

Hi @jonas

Thank you very much for your detailed answer. Very much appreciated

Now I understand the subtleties of RooWorkspaces and memory management therewith much better! Your answer covers all my questions really well. I implemented your code and it works like a charm.

And regarding the “meta workspace”: When I started writing this script I didn’t really know what the end-product would be. I just knew that I would put information into a meta workspace in the process and in the end probably save the meta workspace to a single file. This could contain pdfs, histograms and datasets etc… (of course right now I just delete the datasets after importing, but later I might want to keep them in the meta workspace). As such it just seemed nice to me to have a function that just adds the combined datasets to the metaworkspace which in the main function is then just one line (but I wrongly assumed that I could just delete it with one line again ). Of course I could also return the unique-ptr as you suggested and then add it to the workspace in a second line if needed. Maybe this is more general, I don’t know. So that was kind of the reasoning… But I think it was important to learn these subtleties anyway. Thanks again!

And have a nice week too,
Alex

system · September 15, 2021, 12:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.