Retroactively handling multiple Class versions in ROOT files

marc1uk · April 1, 2019, 10:55am

Hi ROOTTalk.
I’m trying to process a number of sets of ROOT files in a processing framework. The problem I face is that the different sets of files contain custom classes of various different versions. Unfortunately, the ClassDef specifications of these classes were not maintained.

I have the source code used to generate all file sets, and therefore can generate appropriate libraries to read them. However, currently I need to checkout the appropriate headers and build the appropriate library, before attempting to read a given file set. Wherever my processing tools have calls to methods defined only in newer versions of the files, that source code also needs to change and be rebuilt accordingly. This is of course very cumbersome, so I’m looking for a way around it.
I’ve looked into the manual schema evolution section of the ROOT guide, but am not sure what’s there can help, or if so, how.

Since I have access to all the old source code, what I could do would be to define in the libary multiple classes, one for each version. That would leave two difficulties:

finding which of the class definitions in memory match the class definitions in file
getting ROOT to use these classes, despite their different name.

e.g., if a file was written with version 2 of MyClass, I would have in my library a MyClass_v2 with a matching definition, and would like ROOT to recognise that the TTree contains a "MyClass_v2", despite it having been called "MyClass" at time of writing.

Can anyone make any suggestions on these points? Point 1 is feasibly something I could specify manually via a config file (not ideal, as it requires the user to know the correct version, but possible). But on point 2 I am completely stuck.

Or perhaps there is an alternative way of achieving this goal?

Many thanks in advance for any suggestions

ROOT Version: 6.06/08
Platform: linux
Compiler: 4.9.2

pcanal · April 1, 2019, 4:43pm

That is ‘easy’ add to the linkdef file:

#pragma read sourceClass="MyClass" versions="2" targetClass="MyClass_v2"

but if I understood correctly the problem is that you have files with are marked with the same version for the same class but have different layout … in this case, the above won’t help that much …

How many different class layout/versions do you have? How much existing data (how many files) do you have (i.e. is a conversion step feasible)?

Cheers,
Philippe.

marc1uk · April 1, 2019, 5:17pm

Hi Philippe, thanks for the assistance.
You’re right that the linkdef is close but not sufficient. I would need something that would achieve this at run/read time, rather than during compilation time. i.e., i need to be able to select whether MyClass should map to MyClass_v2, or MyClass_v3 etc at runtime.

I have considered conversion as a possibility, each file set worth keeping is several thousand files with perhaps a thousand events each. It would not be a trivial amount of work to perform conversion of all the necessary files. I wondered if perhaps there was a more straightforward solution. Just to tell ROOT: fill the object! Don’t worry about the class name!

In the event that there isn’t, would it be possible to convert the files without involving a class renaming process? It seems I would need to read the files with an old definition of MyClass in memory, but could not then construct objects of the updated MyClass type.

Wile_E_Coyote · April 1, 2019, 5:26pm

TFile::MakeProject

marc1uk · April 1, 2019, 5:40pm

Hi Wile_E_Coyote,
Thanks for the suggestion, this is also something i’ve seen before but decided didn’t meet my needs. The trouble is while this creates the necessities to read the file members, it does not load the class methods defined in the original class - e.g., MyClass has a GetTrigger method, but this isn’t present when the library is generated by MakeProject

pcanal · April 1, 2019, 5:47pm

Both are actually options. For example you could have:

#pragma read sourceClass="MyClass" versions="2" targetClass="MyClass_v2_2015";
#pragma read sourceClass="MyClass" versions="2" targetClass="MyClass_v2_2016";
#pragma read sourceClass="MyClass" versions="2" targetClass="MyClass_v2_2017";

Corrected to add necessary trailing semi-colon

In which case, you would be able to do (for example):

MyClass_v2_2015 *obj_2015 = nullptr;
MyClass_v2_2015 *obj_2016 = nullptr;
MyClass_v2_2015 *obj_2017 = nullptr;

if ( is_2015(file) ) {
   tree->SetBranchAddress(branchname, &obj_2015);
} else if  ( is_2016(file) ) {
   tree->SetBranchAddress(branchname, &obj_2016);
} else if ( is_2017(file) ) {
   tree->SetBranchAddress(branchname, &obj_2017);
}

The conversion could actually be relatively simple. If all you need is automatic conversion and you don’t have any custom obects (that don’t have a merge operation defined), you could simply do with your current release.

for file in `ls *.root`
do
   hadd -O -f refreshed_$file $file
done

The -O- is to prevent the use of ‘fast cloning’ and instead read the object in memory and rewrite them (in the new format) … [Humm depending on the type of changed in the layout, you might actually have to write a small script to do it properly (i…e for example I am not sure this will properly add new members)]

Cheers,
Philippe.

marc1uk · April 1, 2019, 6:23pm

Hi Philippe,
I hadn’t realised that it would be possible to perform multiple mappings of the same class using the linkdef option: that certainly makes that a possibility now.

The icing on the cake, in that case, would be whether one could determine whether the file contains MyClass_v2_2015, MyClass_v2_2016 or MyClass_v2_2017 objects without needing to specify them manually. It seems like this should be possible - if one tries to set use a different class with the same name, ROOT is definitely able to report it. It seems like a trial-and-error method would be one way, although perhaps not the most elegant.

Before getting too excited though, i gave this a try a moment ago (so not yet thoroughly investigated) and it seems like the target (and presumably therefore source, code etc) members of the linkdef rule are mandatory, despite the line essentially representing a simple renaming of the class. In that case effectively a copy constructor would need to be written in the code line, which could get rather extended when nested custom classes are also changed - i.e. the code to conver MyClass to MyClass_updated would also need to convert the member variables MySubClass to MySubClass_updated etc…

pcanal · April 1, 2019, 6:45pm

I am surprised that they are. See (roottest/root/io/datamodelevolution/stl/readFile.C at be467f624cb2ba613baf78d67475fe7b898bae56 · root-project/roottest · GitHub).

And if they are they can left an ‘empty’ string.

pcanal · April 1, 2019, 6:47pm

You can detect the layout by looking at:

auto si = (TStreamerInfo*)file->GetStreamerInfoCache()->FindObject(MyClass);
if ( si->GetCheckSum() == value_for_2015) {
    is_2015 = true;
...

The CheckSum is a unique identifier of the layout.

marc1uk · April 1, 2019, 8:07pm

Ah ok, if i specify a version in the rule (which isn’t necessary for me) I got the following output:

WARNING: IO rule for class MyClass_new - required parameter is missing: target
The following rule has been omitted:
   read sourceClass="MyClass" versions="1" targetClass="MyClass_new"

Omitting the version works, however. Great!

I gave the checksum comparison a try; I generated a file with the current library, then rebuilt a new library using the exact same files, only I changed the name of MyClass to MyClass_new, and i added the additional mapping line to the linkdef file.

However, it seems that when i do
gROOT->GetClass("MyClass_new")->GetCheckSum()
and compare with

TStreamerInfo* sis=(TStreamerInfo*)_file0->GetStreamerInfoCache()->FindObject("MyClass");
sis->GetCheckSum()

the results are different, even though the content (members and methods) have not changed?

pcanal · April 1, 2019, 8:47pm

Right … we are indeed not supporting ‘partial’ renaming …

only I changed the name

right … the name of the class is part of the checksum … but good news this is not what you need.

Instead you simply need to open an old file (for each of the old layouts) and get the checksum for ‘MyClass’ there.

marc1uk · April 1, 2019, 9:43pm

Of course, that would work.
Brilliant, thanks so much for the help Philippe

marc1uk · April 9, 2019, 6:16pm

This took some time to implement, for a variety of reasons - making a class hierarchy, wrappers to handle the missing methods etc.
Unfortunately now that i’ve tried it, i’ve run into the same issue as this thread: Renamed class and error "trying to read an emulated class"
but only for classes which contain TObjArrays of other renamed classes.
I have the corresponding lines for all renamed classes in my linkdef file, and everything works without any problems for renamed classes that don’t contain TObjArrays.
But for renamed classes that contain TObjArrays of another class, when I try to retrieve an entry from the branch, I get messages like the following:

Error in <TBufferFile::ReadObject>: trying to read an emulated class (MyOtherClass) to store in a compiled pointer (TObject)

To clarify, the situation is like this:

class MyClassA{
private:
TObjArray* theEvents;
...
public:
MyClassB* GetEvent(int i){ return (MyClassB*)(*theEvents[i]); }
}

class MyClassB{
...
}

now I have renamed MyClassA to MyClassA_ver0 and MyClassB to MyClassB_ver0.
I can instantiate and use both new classes, and I can set the address of a tree branch holding a MyClassA object to the address of a pointer to a MyClassA_ver0 object. So far so good. But on trying to do retrieval, i receive ReadObject errors about MyClassB.

Is there something extra that needs to be done to support TObjArrays?

marc1uk · April 15, 2019, 9:50am

Keeping the thread alive: any ideas on this?

pcanal · April 17, 2019, 6:40pm

If having the class MyClassB still around and with a dictionary is not a good option (and it sounds like it isn’t in your case) then the next ‘best’ solution is to hijack TObjArray. Ie. Create a new class TObjArraySilentStreaming that inherits from TObjArray and has a custom streamer. In the custom-streamer you would copy paster the content of TObjArray::Streamer and replace the call to

obj = (TObject*) b.ReadObjectAny(TObject::Class());

with

static TClassRef MyClassB_ver0_cl("MyClassB_ver0");
obj = (TObject*) b.ReadObjectAny(MyClassB_ver0_cl);

and add an alias/renaming rule from TObjArray to TObjArraySilentStreaming.

If you need access to the TFile to determine which version to use, you can use b.GetParent()

Cheers,
Philippe.

marc1uk · April 17, 2019, 7:07pm

Thanks for the suggestion Philippe. Before I commit further work to implement this, I’d like to double check if this will work. Sadly the situation gets more complicated. Without wanting to keep moving the goalposts, there are multiple layers of nested classes here. So I should have been more thorough in explaining that I have, for example:

MyClassA that contains a TObjArray of MyClassB
MyClassB which itself contains five TClonesArrays of different classes (MyClassC, MyClassD… each of which is at least simple).

So my implementation would then need to make a custom MyObjArrayClassA_v0 class, with a streamer that specifies the object contained in the TObjArray member of MyClass_v0 are of type MyClassB_v0.
Of course, I would also have files where the contained objects are of version MyClassB_ver1, etc. So I could several such MyObjArrayClassA_vx classes, each with a
#pragma read SourceClass="TObjArray" targetClass="MyObjArrayClassA_vx"
line in the linkdef.

Similarly for the nested classes within MyClassB, I would have a MyTClonesArrayB_v0 class with a streamer that specified the TClonesArray in MyClassB_v0 should contain objects of type MyClassC_v0.

And, through the magic of ROOT, the appropriate versions in each case would be used? At the top level, for example, I set the address of a branch containing a MyClassA object to the address of a pointer to a MyClassA_v0 object. The class MyClassA is itself not defined, but there are multiple #pragma lines that link it to the various MyClassA_v0, MyClassA_v1 classes, and somehow the correct streamer is used to read the object from disk.
Can i suppose this same magic is possible with the TObjArray/TClonesArrays? I hope very much so, otherwise I fear this may have been a long winded exercise in futility!

pcanal · April 17, 2019, 7:15pm

I don’t think that is necessary, a single one (Per class, supporting multiple version) should do and relying on b.GetParent() to detect which file (version) you are reading and then passing the ‘right’ TClass.

Overloading TClonesArray may not be necessary on whether they are split or not.

If they are split, then setting the TClonesArray inner type after the creation of the outer-object but before calling SetBranchAddress might be enough.

If they are not split, then a single derived TClonesArrays should be enough since the TClonesArray knows its content type and can then using the right conversion/alternate.

Cheers,
Philippe.

marc1uk · April 17, 2019, 7:51pm

OK, this is sounding much more positive. the TClonesArrays are not split. So in this case I would only need:

a single MyObjArrayMyClassA class, with an overridden streamer. The overridden streamer would then need to determine which specific class was contained in the TObjArray, and pass an appropriate TClassRef. For determining the class contained, it could use b.GetParent() to obtain a pointer to the TObject … which you say gives access to the TFile… I can’t see how i can go from TObject to TFile, but if the returned TObject is a MyClassA_vx object, then yes I can use that to determine the correct type.
a single overridden MyTClonesArray class? I’m not sure what you mean on this; if the TClonesArray knows its content type, what derived class is needed? I presume if i need one derived class, i need to override the streamer; does it also need to determine it’s contained class via it’s parent TObject?

pcanal · April 17, 2019, 7:59pm

I can’t see how i can go from TObject to TFile ,

dynamic_cast<TFile*>( b.GetParent() );

The TClonesArray’s Streamer is reading the ‘content’ type for the file and writing that into the in memory TClonesArray (i.e. it will always go back to be ‘MyClassB’), so you would need a class derived from TClonesArray that overrides the ::Streamer and change that behavior.

Now, there is likely another alternative … maybe … instead using class derived from TClonesArray, it might be enough to replace the type of the datamember (that are currently TClonesArray) by std::vector<MyClassB_ver0>.

Cheers,
Philippe.

marc1uk · April 17, 2019, 8:46pm

Ah so the returned Parent is a TFile pointer, that works too.

If the TClonesArray streamer is able to read it’s contents as the appropriate contained class, it seems like that could work without any extra effort? After all, when retrieving elements from the TClonesArray the parent class is casting them to the correct pointer type anyway.
i.e. in MyClassA_v1 there is a Getter which is doing:
MyClassB_v1* GetB(Int i){ return (MyClassB_v1*)fClonesArray[i]; }

So the TClonesArray::TStreamer populates fClonesArray with ‘MyClassB’ objects, but of course ‘MyClassB’ is in fact ‘MyClassB_v1’ (simply by a different name) so the cast will work fine. So the question is simply whether ROOT will complain about creating a TClonesArray of objects of a class with no in-memory definition…

If that isn’t okay and an overridden Streamer is needed… I suppose I could have a TSteamer which modifies the line TClass *cl = TClass::GetClass(classv) and instead determines what type of object it should be reading, in the same manner as the modified TObjArray.

If i understand correctly the difference here is that TObjArray only allows access to it’s owning parent file to determine the contained class, while knowing nothing about what class type it contains. The TClonesArray knows what type it should be storing (presumably classv is the name of the class).
I don’t see how this leads to needing one derived TObjArray per class, and only one TClonesArray, though. In both cases I’d be bypassing the TClass reference derivation and determining the type myself. And especially in the case of TObjArray, the array itself has no information about what type of object it contains - so why would I define multiple TObjArray overrides?

The vector solution seems much simpler, so i’ll certainly give that a try first.