How can we mock data?

Hello,

Sometimes we need to patch our data. For example, we have a simulated sample for 2016 but not the one for 2017. If we want to do a quick study and 2017 simulation is good enough, we would like to do something like:

df = df.mock(branch='year', original = '2016', new = '2017')

_process_data(df=df)

which would use all the entries where the data is for 2016 as if it were for 2017. This would “fake” extending/patching the dataset. Is this possible? I see that the only we we can do this is:

df_17 = df.Filter('year == 2016')
df_17 = df_17.Redefine('year', '2017')
df_17.Snapshot('2017.root', 'tree')

df = RDataFrame(['2017.root', 'original_file.root'], 'tree')

Or in other words, duplicating this 2016 dataset by actually making a ROOT file. But this is very inneficient, specially if the dataframe has thousands of columns.

Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: 6.32
Platform: linux
Compiler: gnu


I’m not sure to understand what you’re asking for, but maybe @vpadulan can take a look

For instance in a dataframe with 1000 entries, we have a column called year, and there are 100 entries for which year == 2016. I need those entries to be duplicated and the year to be made 2017. Then I need the extra entries to be appended such that I get a dataframe with 1100 entries.

But this is not really needed, because the fake 2017 data is already in the dataframe, so you probably need some sort of mocking of this 2016 entries. The way I can do this is to actually filter the 2016 entries, rename the year column and save to disk. Then load again and append it to the rest of the dataset, but that is very inneficient.

OK, then let see if @vpadulan has a better solution for this

Dear @curious_goose ,

Thanks for reaching out to the forum!

You definitely do not need to Snapshot the data to disk for this case. But I don’t understand yet, what is the use case for not just processing the 1000 entries first, and then the second time only the 100 entries with a Filter(…).Redefine(…)?

Cheers,

Vincenzo

1 Like

@vpadulan

Processing one dataframe and then getting another and processing again works if you are using a small script to make one plot. When you develop a codebase that passes a dataframe all over the place to do different things and then for a specific dataset you need to not just call:

df = _get_data()
_do_something(df)

but

df = get_data()
_do_something(df)

df = get_data()
df = df.Filter('year', '2016')
df = df.Redefine('year', '2017')
_do_something(df)
  1. You have to change the code in dozens of places.
  2. We cannot _do_something with only 2017 data. We need to do something with the full dataset, so the second call does not even make sense here.

Updating the code base to make _get_data provide the dataframe properly is possible and we did it, but it took hundreds of lines of code, also because of the presence of friend trees that needed to be taken into account in the extension. It would have been much easier to just do:

df = df.mock(branch='year', original = '2016', new = '2017')

That would be a few lines of code injected somewhere before retrieving the dataframe. These dataframes do not live in little scripts, they live in large codebases that took years to build and that we want to touch as little as possible.