RNTuple - data reader specs?

sbinet · June 23, 2022, 7:59am

hi there,

I am trying to implement the RNTuple reader from the specs:

root/specifications.md at master · root-project/root · GitHub

I’ve been able to properly parse the header/footer envelopes:

header:
        {vers:1 minv:1 flags:[0] release:1 name:Staff descr: library:ROOT v6.26/04 fields:[{vers:0 typv:0 pfid:0 role:0 flag:0 nrep:0 fname:Category tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:1 role:0 flag:0 nrep:0 fname:Flag tname:std::uint32_t alias: descr:} {vers:0 typv:0 pfid:2 role:0 flag:0 nrep:0 fname:Age tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:3 role:0 flag:0 nrep:0 fname:Service tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:4 role:0 flag:0 nrep:0 fname:Children tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:5 role:0 flag:0 nrep:0 fname:Grade tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:6 role:0 flag:0 nrep:0 fname:Step tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:7 role:0 flag:0 nrep:0 fname:Hrweek tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:8 role:0 flag:0 nrep:0 fname:Cost tname:std::int32_t alias: descr:} {vers:0 typv:0 pfid:9 role:0 flag:0 nrep:0 fname:Division tname:std::string alias: descr:} {vers:0 typv:0 pfid:10 role:0 flag:0 nrep:0 fname:Nation tname:std::string alias: descr:}] cols:[{kind:11 bits:32 fieldID:0 flags:0} {kind:11 bits:32 fieldID:1 flags:0} {kind:11 bits:32 fieldID:2 flags:0} {kind:11 bits:32 fieldID:3 flags:0} {kind:11 bits:32 fieldID:4 flags:0} {kind:11 bits:32 fieldID:5 flags:0} {kind:11 bits:32 fieldID:6 flags:0} {kind:11 bits:32 fieldID:7 flags:0} {kind:11 bits:32 fieldID:8 flags:0} {kind:2 bits:32 fieldID:9 flags:5} {kind:5 bits:8 fieldID:9 flags:0} {kind:2 bits:32 fieldID:10 flags:5} {kind:5 bits:8 fieldID:10 flags:0}] aliases:[] extra:[] crc32:403897527},
footer:
        {vers:1 minv:1 flags:[0] hdr:403897527 xhdrs:[] colGroups:[] clInfos:[{firstEntry:0 nentries:3354 colGrpID:-1}] clGroups:[{n:1 pages:{size:492 locator:{pos:72208 storage:207 url:}}}] mdBlocks:[] crc32:3437551349}

(on the ntpl001_staff.root file from the tutos)

but then, the specs are a bit more blurry as for how the data is organized in the data pages and how that data is extracted by the columns (also, the indexing and split-encoding is just mentioned “in passing”).

could these be clarified? (@jblomer I guess)

thanks,
-s

jblomer · June 24, 2022, 11:03am

Hi Sebastien,

Some details of the specification are not yet merged, such as split encoding and 64bit index columns. So the data files are slightly behind specification (which is why the RNTuples in ROOT files are still marked “release candidate”).

Regarding the specs itself, can you point out where exactly they become unclear? Perhaps we can stick to the ntpl001_staff.root file as an example.

Cheers,
Jakob

jblomer · June 24, 2022, 1:08pm

(Since the ntpl001_staff.root example has no collections, we can even park the details on index columns for the time being.)

sbinet · June 24, 2022, 2:20pm

thanks for the reply.
I had a bug in the decoding of the compressed payload. (that led me astray)
and I worried this was because of missing bits about the split-encoding.

the meaning of the “32bit compression settings” (in the page list inner frame), is a bit opaque.
I assumed it’s the same than the “usual” ROOT compression algorithms:

func rootCompressAlgLvl(v uint32) (Kind, int) {
        var (
                alg = Kind(v / 100)
                lvl = int(v % 100)
        )

        return alg, lvl
}

anyways, I got it working for the ntpl_001_staff.root file:

        cluster[0,0,0]: Category
        
        00000000  ca 00 00 00 12 02 00 00  3c 01 00 00 69 01 00 00  |........<...i...|
        00000010  2e 01 00 00 2f 01 00 00  2e 01 00 00 69 01 00 00  |..../.......i...|
        00000020  54 01 00 00 69 01 00 00  69 01 00 00 2f 01 00 00  |T...i...i.../...|
        00000030  2e 01 00 00 2c 01 00 00  69 01 00 00 69 01 00 00  |....,...i...i...|
        00000040  3c 01 00 00 2f 01 00 00  69 01 00 00 69 01 00 00  |<.../...i...i...|
        00000050  a3 01 00 00 ca 00 00 00  30 01 00 00 cc 00 00 00  |........0.......|
        00000060  cc 00 00 00 30 01 00 00  30 01 00 00 ca 00 00 00  |....0...0.......|
        00000070  cc 00 00 00 ca 00 00 00  ca 00 00 00 2e 01 00 00  |................|
        
        cluster[0,1,0]: Flag
        
        00000000  0f 00 00 00 0f 00 00 00  0f 00 00 00 0f 00 00 00  |................|
        00000010  0f 00 00 00 0f 00 00 00  0f 00 00 00 0f 00 00 00  |................|
        00000020  0f 00 00 00 0f 00 00 00  0f 00 00 00 0f 00 00 00  |................|
        00000030  0f 00 00 00 0f 00 00 00  0f 00 00 00 0f 00 00 00  |................|
        00000040  0b 00 00 00 0f 00 00 00  0f 00 00 00 0f 00 00 00  |................|
        00000050  0d 00 00 00 0f 00 00 00  0f 00 00 00 0f 00 00 00  |................|
        00000060  0f 00 00 00 0f 00 00 00  0f 00 00 00 0f 00 00 00  |................|
        00000070  0f 00 00 00 0b 00 00 00  0f 00 00 00 0d 00 00 00  |................|
        
[...]
        cluster[0,12,0]: Nation
        
        00000000  44 45 43 48 46 52 46 52  44 45 49 54 43 48 49 54  |DECHFRFRDEITCHIT|
        00000010  44 45 46 52 46 52 43 48  43 48 43 48 44 45 46 52  |DEFRFRCHCHCHDEFR|
        00000020  43 48 46 52 46 52 46 52  46 52 44 45 4e 4c 44 45  |CHFRFRFRFRDENLDE|
        00000030  47 42 46 52 46 52 46 52  46 52 49 54 49 54 44 45  |GBFRFRFRFRITITDE|
        00000040  4e 4c 43 48 46 52 49 54  47 42 47 42 43 48 43 48  |NLCHFRITGBGBCHCH|
        00000050  44 45 49 54 43 48 46 52  43 48 46 52 49 54 46 52  |DEITCHFRCHFRITFR|
        00000060  49 54 41 54 43 48 4e 4c  43 48 42 45 43 48 46 52  |ITATCHNLCHBECHFR|
        00000070  43 48 46 52 47 42 41 54  4e 4f 46 52 41 54 43 48  |CHFRGBATNOFRATCH|

what are the PRs (if any) that add the split-encoding documentation stanzas?
(feel free to mention me (@sbinet on github) on such documentation PRs)

thanks again.

PS: the specs as a whole are really nice to read. I wish I had something like that for TTree

jblomer · June 27, 2022, 8:09pm

Cool that you managed to parse the format!

I added a clarification on the compression settings in a PR.

The split encoding (and more encodings) are in a separate branch. The code is branched off on an older version of RNTuple and needs to be a bit cleaned up for the PRs. That includes documentation. There was a longer discussion on Mattermost on the details. The code in the ntuple-split branch allowed us to look into the improvements we can get from “encoding before compression”, which are summarized in a Google Sheet.

Cheers,
Jakob

system · July 11, 2022, 8:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.