Reading from http very slow

I am not sure it is a cernbox issue . I just am just guessing. Yes we should try with a bigger file. I am not sure I have the right to copy that file on the root server.

With a 2.7GB file on the root server:

root -l https://root.cern.ch/files/lhcb2.root -e 'E->Draw("m_version")' -q  1.33s user 0.38s system 62% cpu 2.771 total

Not sure what do you man by “your server”. The file is on cernbox, can’t you download it?
By the way I put it on a s3 bucket. It is faster, but still 11 seconds. So it seems mainly a problem of cernbox.

time root -b -l -q -e "TFile::Open(\"http://rgw.fisica.unimi.it/test-ruggero/test_ntuples_200123.root?AWSAccessKeyId=M06HBTUGIKXVXYH1RES6&Signature=hpX%2FNzIKINZd825AWEGw%2FuVQ4nU%3D&Expires=1693581796\"); Electrons_All->Draw(\"pt__NOSYS\")"

Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1

________________________________________________________
Executed in   11.22 secs      fish           external
   usr time  426.22 millis    0.00 micros  426.22 millis
   sys time  149.82 millis  875.00 micros  148.94 millis

When reading from cernbox strace tells me most of the time is used by futex. This is not the case when reading from my disk or from rgw.fisica.unimi.it

So, there is something very wrong with the cooperation between ROOT and CERNBox.

@wiso I confirm that opening the test file from your “rgw” server takes 1.4 s and the drawing 13.6 s (which is still ten times longer than it should be, as it shouldn’t be longer than the opening time for such a small file).

Well, It seems that the problem sits in the ROOT C++ code … maybe also @linev could have some ideas.

I tried the “jsroot” and the plot comes quite fast (after 1 s I get the “jsroot” window and then after some 3 s I get the plot):

https://jsroot.gsi.de/dev/?file=https://cernbox.cern.ch/remote.php/dav/public-files/1Cy1HIf03Ca76Dm/test_ntuples_200123.root&item=Electrons_All;9/pt__NOSYS&opt=

Of course, ROOT THttpServer is not used by cernbox.

Cernbox provides directly a way to open ROOT file with JSROOT - just clicking on the file open new tab with JSROOT browser. But tree drawing is performed on client side - means all necessary data need to be load to the client. Therefore performance depends on connection speed between client and cernbox servers.

@linev 1 Gb/s ethernet connection is too slow for you? Note that both trials (native ROOT and “jsroot”) use exactly the same cernbox link and the difference is 3 minutes versus 3 seconds drawing time (for a file which is 4.5 MB long with 68k events in the tree).

3 min with normal ROOT C++ TTree::Draw? Really strange.

I tried uproot.

  1. From disk
time python -c "import uproot; uproot.open('test_ntuples_200123.root').get('Electrons_All').arrays('pt__NOSYS')"

________________________________________________________
Executed in  344.48 millis    fish           external
   usr time  368.35 millis  639.00 micros  367.71 millis
   sys time  629.68 millis   88.00 micros  629.59 millis
  1. from rgw.fisica.unimi.it
time python -c "import uproot; uproot.open('http://rgw.fisica.unimi.it/test-ruggero/test_ntuples_200123.root?AWSAccessKeyId=M06HBTUGIKXVXYH1RES6&Signature=hpX%2FNzIKINZd825AWEGw%2FuVQ4nU%3D&Expires=1693581796').get('Electrons_All').arrays('pt__NOSYS')"

________________________________________________________
Executed in  763.77 millis    fish           external
   usr time  444.30 millis  643.00 micros  443.65 millis
   sys time  669.86 millis   96.00 micros  669.76 millis
  1. from cernbox
time python -c "import uproot; uproot.open('https://cernbox.cern.ch/remote.php/dav/public-files/1Cy1HIf03Ca76Dm/test_ntuples_200123.root').get('Electrons_All').arrays('pt__NOSYS')"

it crashes

    raise uproot.deserialization.DeserializationError(
uproot.deserialization.DeserializationError: while reading

    TBasket version None as uproot.models.TBasket.Model_TBasket (? bytes)
        fNbytes: 218759168
        fObjlen: 65798144
        fDatime: 293105760
        fKeylen: 32314
        fCycle: 85
Members for TBasket: fNbytes?, fObjlen?, fDatime?, fKeylen?, fCycle?

attempting to get bytes 38380:38398
outside expected range 6085:8333 for this Chunk
in file https://cernbox.cern.ch/remote.php/dav/public-files/1Cy1HIf03Ca76Dm/test_ntuples_200123.root

Probably I have an idea.

cernbox does not accept Ranges in the requests and always return full file content.

Even when it declares Accept-Ranges in the response headers.

JSROOT has workaround - it request complete file content once and then reusing it. Of course, this does not work for large files.

ROOT does not have such workaround and for each small request gets full content again and again.
Therefore it may take very long time to process such file.

Therefore I will not recommend to use cernbox for such applications - before problem will be fixed.

I submit issue to the cernbox feedback form - let wait for their response.

If you are right then it’s not just about the CERNBox but maybe about all similar ownCloud based servers?

A possible workaround is to use xrootd rather than http(s) for cernbox, which is usually possible.

I am not sure how @wiso 's URL in particular translates to an xrootd path, but in general for an URL such as https://cernbox.cern.ch/files/spaces/eos/project/r/root-eos/public/hsimple.root the equivalent xrootd URL is root://eosproject.cern.ch//eos/project/r/root-eos/public/hsimple.root (requires valid credentials, e.g. an active kerberos ticket).

Other locations on cernbox will use eospublic.cern.ch, eosuser.cern.ch or similar rather than eosproject.

I hope this helps,
Enrico

For production EOS is a solution, but not for inspecting ROOT files on cernbox website with JSROOT

Whether some particular file from some specific site can be accessed via xrootd (e.g., from EOS) instead of https (e.g., from CERNBox) is irrelevant to this discussion (though you could be interested in this thread: “Frequent failure to update ROOTfiles at /eos/”).

What @linev reports is a serious issue that, to his knowledge, users face with many different https servers.

Apparently, https servers either explicitly say they DO NOT Accept-Ranges or, worse, they say they DO but then send the whole file upon every request.

So, I think the relevant ROOT C++ code should be protected against such cases.
At first request, it should automatically detect that it got the whole file (regardless if the server claimed it accepted Ranges) and then reuse the provided file (just like “jsroot” does now).
If the first request returned “partial content”, ROOT should use the Ranges feature.

@wiso Maybe you could ping uproot developers about it.

I disagree that pointing out a possible workaround for the user’s problem (or for other users that end up here with a similar problem) is irrelevant to the discussion, but I completely agree that the underlying issue needs attention, I raised the point in ROOT’s I/O mattermost channel yesterday.

FYI I’ve opened Login - CERN Service Portal: easy access to services at CERN

Record not found. Is that assumed to be?

Updated the link. Still, it will only be visible with CERN account :-/ It’s about potentially re-writing cernbox URLs to xrootd URLs.