Using datasets with PROOF

krasznaa · November 11, 2009, 1:12pm

Dear All,

I’m stuck with the following issue:

Since I want to use quite a lot of files in my analysis, I’m trying to update my code to use the concept of datasets instead of running over individual files. As far as I can see, I can create datasets correctly. I always call TProof::ValidateDataset(…) on the new datasets, and it succeeds in finding all the files, getting their sizes, the number of entries in them, etc.

In my analysis code, just before submitting the job with TProof::Process(…), I also check the TFileCollection object returned by TProof::GetDataSet(…). This seems to be okay. It prints the correct number of input files, the correct number of entries in these input files, etc.

But after calling TProof::Process(…), the job aborts right away. The job finishes gracefully, but it doesn’t process any events. I get something like this on the master node:

quote TXProofServ::Se… : starting query: 1
( INFO ) TPacketizerAdap… : Setting max number of workers per node to 9999999
( INFO ) TPacketizerAdap… : no valid or non-empty file found: setting invalid
( ERROR ) TProofPlayerRem… : instantiated packetizer object ‘TPacketizerAdaptive’ is invalid
( ERROR ) TProofPlayerRem… : cannot init the packetizer[/quote]

First I thought that it could be because the files are stored on an NFS volume. But that NFS volume is actually on the master node, and is visible by all the slave nodes. Since the location is also exported by xrootd, I tried creating the dataset file filenames such as “root://me@proof.server//location/of/file.root”. But this didn’t work at all. PROOF for some reason was not willing to find the files like this. I had a look at the TFileCollection code (which I fill, and give to TProof to create the dataset), and it seems to assume that the input files are local.

I tried understanding how the code works exactly, but now I’m giving up. Hopefully somebody can help me with understanding what I’m doing wrong. I can send more information if anybody’s interested.

Cheers,
Attila

ganis · November 23, 2009, 3:37pm

Hi Attila,

Sorry for the late reply.

I guess that TProof::ShowDataSets() shows correctly the datasets and the information about their contents.

You are using the method TProof::Process(<dataset_name>, <selector.,…), right?

The messages just indicate that PROOF thinks that none of the files in the list is valid. This typically happens when validation is not OK and the information is not rechecked.

Could you please post the output of TProof::ShowDataSets()?

Gerri

krasznaa · November 23, 2009, 4:01pm

Hi Gerri,

Yes, I was using that function. Unfortunately I modified my code since then, so I can’t really run the tests out of the box again.

This is what I get:

I should note at this point, that if TProof::ShowDataSets() is smart enough to “overflow” in the other columns, then it should show the number of events correctly as well. The dataset contains ~486k events. And when asking PROOF about it directly, it tells me the correct number. So it’s just a printout issue that I see here… I know this because when I ask for more verbose information using TProof::ShowDataSet(…), I get this:

TFileCollection user09.AkiraShibata.TopLightD3PD_140506.mc08.105200.T1_McAtNlo_Jimmy.e357_s462_r541.cloud.004 - File collection for making a data set contains: 392 files with a size of 4472890439 bytes, 100.0 % staged - default tree name: '/CollectionTree' The files contain the following trees: Tree /CollectionTree: 486143 events Tree /Trigger0: 486143 events Tree /FullReco0: 486143 events Tree /Truth0: 486143 events Tree /TruthAna0: 486143 events Tree /Cluster0: 486143 events Tree /Track0: 486143 events

The even more verbose information still looks okay to me, but it gives waaay to much information for a post here…

I have to admit that I’m trying a different angle in the meantime. Since I want to be able to run my code the same way both when running in PROOF and when running in a completely standalone way, I started saving “cache information” about the input files myself. My code actually needs some information about these files for the event weighting as well, which also takes some time to collect. So I started saving TFileCollection objects into my own “cache files”, which seems to work beautifully. I also noticed that if I give a TDSet object to TProof::Process(…) which is already “validated”, then PROOF doesn’t validate the dataset itself. So now I’m playing with doing all the caching myself. This way I’m still able to run my code without PROOF for debugging purposes, and then switch to using PROOF with just the change of a job configuration option.

But if you’d like to investigate my previous problem further, I’m willing to give more information on that.

Cheers,
Attila

ganis · November 24, 2009, 7:49am

Hi Attila,

Since I want to be able to run my code the same way both when running in PROOF and when running in a completely standalone way, I started saving “cache information” about the input files myself. My code actually needs some information about these files for the event weighting as well, which also takes some time to collect. So I started saving TFileCollection objects into my own “cache files”, which seems to work beautifully. I also noticed that if I give a TDSet object to TProof::Process(…) which is already “validated”, then PROOF doesn’t validate the dataset itself. So now I’m playing with doing all the caching myself. This way I’m still able to run my code without PROOF for debugging purposes, and then switch to using PROOF with just the change of a job configuration option.

I do not know at the which point you are with this, but you may want to have a look at TDataSetManagerFile which, although developed initially for PROOF, is completely independent of it. It should do exactly that, i.e. provide a way to organize your datasets in the form of TFileCollections, using a ROOT-file based database, located locally or remotely.

I did not see anything wrong in the listings that you posted, so it would be interesting to understand why it did not work; it may hide other problems.
So, at some point perhaps we should investigate it.

Cheers,
Gerri