RooDataSet with many columns and many entries crash to be created

Dear experts,

I am creating a roodatahist from roodataframe with the roodatasethelper.

I am encountering the following error when writing the workspace to disk :

Fatal in <TBufferFile::WriteFastArray>: Not enough space left in the buffer (1GB limit). 4952997 elements is greater than the max left of 956626
aborting

I saw few threads with similar issues but i was wondering if there is a workaround one can do.
For example can one change the default storage type of RooDataSet to make it work?

Thanks in advance

Renato

Dear @rquaglia ,

The 1GB limit is a hard limit for TKeys in a TFile currently. I am not sure how the RooDataset is incurring in this limitation, I could imagine one could try to split it in multiple separate datasets. Maybe you could send the updated reproducer, meanwhile @jonas might have further ideas on how to workaround that limitation.

Cheers,
Vincenzo

Dear @vpadulan , thanks for the reply,

I spent some time trying to figure out what was going on and it turned out that it was sufficient to change the rooabsdatastore default setup to

    ROOT.RooAbsData.setDefaultStorageType(ROOT.RooAbsData.Tree)

However when i do

ws = ROOT.RooWorkspace("myws")
ws.Import( dataset) 
ws.writeToFile( "myfile.root", "RECREATE")

I am a bit puzzled on why i get as output in the final TFile both a TTree and a RooWorkspace.
Are the workspace and ttree somewhat ‘linked’ in the final storage? I.e can the workspace saved be read regardless of the presence of TTree added when saving?

For the reproducer, i think the issue is that the RooWorkspace.writeToFile saves a normal TKey while defaulting the storage type to a vector instead, the TTree datastore type is not having the same limitation. Tough it’s a bit unclear to me the reason why both a TTree and a RooWorkspace is saved ultimately.

Here an example code to create and save a dataset and load it back, when the dataset has TOO many entries to save and too many columns the error show up.

import ROOT
import numpy as np


def save() : 
    n_entries = 1000
    ROOT.EnableImplicitMT()
    df = ROOT.RDataFrame(n_entries)
    # Define columns in the RDataFrame
    # df = df.Define("float_col" , ",".join(map(str, float_col))) #.Define("float_col2","float(float_col)")
    ncols = 80
    for i in range(ncols) :     
        df = df.Define(f"double_col_{i}", "1.5")
    ROOT.RDF.Experimental.AddProgressBar(ROOT.RDF.AsRNode(df))

    ROOT.RooAbsData.setDefaultStorageType(ROOT.RooAbsData.Tree)

    vars_list =[] 
    vars_name =[]
    for c in df.GetColumnNames():
        v = ROOT.RooRealVar( str(c),str(c),0)
        v.setConstant(0)    
        vars_list.append(v)
        vars_name.append(str(c))

    helper = ROOT.RooDataSetHelper("dataset", "Title of dataset", ROOT.RooArgSet( *vars_list))

    roo_data_set_result = df.Book( ROOT.std.move(helper), vars_name)

    df.Count()
    roo_data_set_result.Print()

    ws  = ROOT.RooWorkspace( "space", "space")
    ws.Import( roo_data_set_result.GetValue())
    ws.writeToFile( "test.root", True )
    
def load(): 
    # ROOT.RooAbsData.setDefaultStorageType(ROOT.RooAbsData.Tree)
    f = ROOT.TFile("test.root")    
    ws = f.Get("space")
    
    return ws , f 
if __name__ == "__main__"    :
    save()
    wspace ,filein = load( )
    wspace.Print()
    print( wspace["dataset"].sumEntries() ) 
    wspace["double_col_0"].Print() 
    
    v0 = wspace["double_col_0"]
    ds = wspace["dataset"]
    frame = v0.frame(ROOT.RooFit.Bins(10), ROOT.RooFit.Range( 0,4))
    ds.plotOn(frame)
    cc = ROOT.TCanvas()
    frame.Draw()
    cc.SaveAs("test.pdf")