Automatic schema evolution and change of member names

Nicola_Mori · October 7, 2020, 3:48pm

Can the Root automatic schema evolution handle the change of name of a member variable? For example, if I have:

class C{
  int wrongName;
  ClassDef(C,1)
}

I write an object in a data file, and then realize that I’d be better with:

class C{
  int rightName;
  ClassDef(C,2)
}

will I be able to read the correct values from the data file written using version 1 of C? I did some quick tests and it seems that he answer is no, but I’d like to hear the words of an expert (maybe @pcanal?).
Thanks.

pcanal · October 7, 2020, 4:09pm

Well the “automatic” part is missing information … i.e. is “rightName” is brand new unrelated data member or a rename ? To add the missing information you can add the following line to your linkdef file (it can also be done for a selection .xml file)

#pragma read sourceClass="C" targetClass="C" source="int wrongName" target="rightName";

Nicola_Mori · October 7, 2020, 4:27pm

It doesn’t seem to work for me. I’m trying to understand if I’m doing something wrong, but the pragma is looking strange to me:

source="int wrongName" target="rightName"

Is it correct to specify int just for wrongName?

Nicola_Mori · October 7, 2020, 4:47pm

So it works both with:

source="int wrongName" target="rightName"

and with:

source="int wrongName" target="rightName"

but I had to remove the trailing + from:

#pragma link C++ class C+;

Is this expected? Does it have side effects? I found here that adding the +

    in ROOT version 1 and 2 tells rootcling to generate a Streamer with extra byte count information. 
This adds an integer to each object in the output buffer, but it allows for powerful error correction in 
case a Streamer method is out of sync with data in the file. The + option is mutual exclusive with both 
the - and ! options.

IMPORTANT NOTE: In ROOT Version 3 and later, a “+” after the class name tells rootcling to use the 
new I/O system. The byte count check is always added. The new I/O system has many advantages 
including support automatic schema evolution, full support for STL collections and better run-time 
performance. We strongly recommend using it.

but I don’t really understand what it means, apart from the fact that I will have to manually evolve the schema for C from now on.

pcanal · October 7, 2020, 5:25pm

Yes. At the time of generating the dictionary, we do not know the type of the old data member (the information is only available from the .root file … which of course is to available at that stage) while we do know (because it is in the header file) the type of the current data member.

pcanal · October 7, 2020, 5:26pm

The ‘+’ is essential. The rule will be ignored without it.

pcanal · October 7, 2020, 5:26pm

So why did you have to remove it?

Nicola_Mori · October 7, 2020, 7:55pm

Because as far as I can tell it does not work with the trailing +. Here is a small program I use to investigate the automatic schema evolution:
ClassDef.tar.gz (1.5 KB)
It provides several versions of a TestClass; version 3 differ from version 2 just in the names of the data members.

Providing the -DCLASSVERSION=2 flag to the invocation of cmake, a library with dictionary and an executable for version 2 of TestClass are produced by compiling the code. By executing write2 a TestClass version 2 object is written on the output_2.root file, with values 20 and 30 for the fields f and d. The content can be verified by using the dictionary of version 2:

$ root
root [0] .L class2/libTestClassDef2.so
root [1] TFile *_file = TFile::Open("output_2.root")
(TFile *) 0x56388a5a57d0
root [2] TestClass *tc = (TestClass*)(_file->Get("tc"))
(TestClass *) 0x56388a812620
root [3] tc->f
(float) 20.0000f
root [4] tc->d
(double) 30.000000

Reconfiguring the build for version 3 with the cmake flag -DCLASSVERSION=3 and then building, a dictionary library for version 3 is produced. The linkdef for version 3 provides the pragmas defining the correspondence between old and new variable names. Reading the same file with this dictionary gives:

$ root
root [0] .L class3/libTestClassDef3.so
root [1] TFile *_file = TFile::Open("output_2.root")
(TFile *) 0x55ab3fc5ff70
root [2] TestClass *tc = (TestClass*)(_file->Get("tc"))
(TestClass *) 0x55ab40424b80
root [3] tc->newf
(float) 2.00000f
root [4] tc->newd
(double) 3.0000000

As you can see the printed values are not those stored in the file but the ones set by the default constructor. All of the above is with:

#pragma link C++ class TestClass+;

Removing the trailing + from the linkdef pragma above and recompiling, the file is read out correctly using version 3:

$ root
root [0] .L class3/libTestClassDef3.so
root [1] TFile *_file = TFile::Open("output_2.root")
(TFile *) 0x55e2d58023e0
root [2] TestClass *tc = (TestClass*)(_file->Get("tc"))
(TestClass *) 0x55e2d61a9080
root [3] tc->newf
(float) 20.0000f
root [4] tc->newd
(double) 30.000000

Supposing that my code is not bugged, I’d say that the trailing + must not be present for the mechanism to work.

pcanal · October 7, 2020, 9:42pm

Humm … There indeed a problem … I am investigating.

The result without the ‘+’ is an arbitrary “lucky” result. For example, If you shuffle the data member in the new version you will notice that the wrong data is filled in each of the member.

pcanal · October 7, 2020, 11:27pm

We have not yet put in production the code to support the variable renaming I was implicitly relying on. Instead you have to currently explicit the “transformation”:

#pragma read sourceClass="TestClass" targetClass="TestClass" source="float f" target="newf" code="{ newf = onfile.f; }";
#pragma read sourceClass="TestClass" targetClass="TestClass" source="double d" target="newd" code="{ newd = onfile.d; }";

Cheers,
Philippe.

Note that the ClassDef is not necessary for this to work. Its main benefit ClassDef has for this case is the fact that it give a specific number to the class layout version (there is also a default one called “CheckSum” that could also used instead). The class version (or CheckSum) can be used to restrict when the rules should apply.

Nicola_Mori · October 8, 2020, 6:50am

Thank you, this last version works with the trailing + and also when shuffling the variables. I’d be interested to better understand some points:

Keeping the ClassDef, would everything work without any additional rule if for example I make other modifications to the class layout in version 3, for example introducing a new data member or removing newf? I can try this in my test code but I fear I might not trigger some corner case behavior that will pop up in my production code…
As you have mentioned it, how can the application of the rule be limited to given class versions?
Is there an estimate Root version for when the missing code for making the “compact” version of the rule work will be released? With that in place I suppose that the code part of the rule can be removed, right?

Edit: point 2 seems to be quite important since with the readout rule for reading version 2 files using version 3 code I get a segfault when reading version 3 files with version 3 code…

Edit2: I found here how to specify the version of the on-file class for which the readout rule should be applied and it seems to work.

pcanal · October 8, 2020, 7:15pm

As you saw you can specify either a version range:
version="[4-5,7,9,12-]"
to which the rule should be applied (when reading those version) or for a set of checksum
checksum="[12345,123456]"
where checksum is the value returned by TClass::GetCheckSum for the current version or TStreamerInfo::GetCheckSum for the one you get from a file.

You can specify a version number either via a ClassDef or by adding an ‘options’ to the pragma line:

#pragma link C++ options=version(3) class MyClass+;

Nicola_Mori · October 9, 2020, 7:06am

Thank you very much, again. I didn’t know that so many options could be defined on the LinkDef, especially the class version. Usually I don’t make my classes inherit from TObject and use TFile::WriteObjectAny to write them on file, but I still use ClassDef to enable the automatic schema evolution. So if I remove the ClassDef from the C++ code and rely on #pragma link C++ options=version(3) class MyClass+; in LinkDef to define the class version, which functionalities would I eventually loose?
By the way, it’s not easy to find documentation for these “advanced” features, at least for me: is this my fault or maybe this documentation is missing/hidden?

pcanal · October 9, 2020, 9:18pm

The document indeed need to be spruce up with those info.

Not having the TObject inheritance means that you can no longer use the ROOT collection and most notable the TClonesArray (which has a slight hence for reduce number of memory allocations and constructor calls).

Not having the ClassDef means that you are missing out on the IsA member function and in general the I/O performance for those object will be slightly less (due to the extra effort needed to find the type in some circumstances.

In addition, once you set a class version, you always need to increment it when you change the (persistent) layout of the class. It tends to be easier to remember if the version is visible in the header file (via the ClassDef)

Cheers,
Philippe.

system · October 23, 2020, 9:18pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.