Events Skipped

rabidec · July 28, 2008, 7:13pm

Hi,

I’ve been trying to get proof running on a few computers on our cluster here and I have managed to get it to run a basic analysis and make some histograms.

However, I still get a problem where most of the events get processed and proof just stops processing events, but it still seems to be doing something. Then, after a few seconds, it stops and says that it is done but that it skipped events and if I look at the histogram produced I see that it is indeed missing entries. I was just wondering if anyone has seen this before or anyone has an idea of a solution

I attached a tarball with my config files, some output and the analysis code. It’s just missing the data. The analysis code with the data is here.
The analysis code is from a tutorial I found somewhere, but that I can’t locate again. The tutorial explained a way of copying over the data using xrdcp, which is what I am using for that code, but I have tried doing this with just using a path (ie: /raid/…/data.root) and I get the same problem. I have also tried a few different sets of data and different analysis code so it’s probably not that.

Thanks for any help,
Charles Rabideau
Proof_Test.tar.gz (138 KB)

anna · July 29, 2008, 9:34am

Hi Charles,

Is there anything suspicious in the logs? You can get them by pressing the “ShowLogs” button on the progress dialogue, or by calling TProof::Mgr(“username@master.url”)->GetSessionLogs() from the command line. Did you try to process these files locally? I mean, give the remote urls for the files, but not call TChain::SetProof(kTRUE) before processing.

Cheers,
Anna

rabidec · July 29, 2008, 2:35pm

I did the SetProof, and I used chain->Add(“root://gaia040:1093//tmp/pool/proofpool/”+filename);
to add the files. I used xrdcp $filename root://gaia040:1093//tmp/pool/proofpool/$filename to copy over the files. (gaia040 is the master)
Also, when I add more workers it runs faster and I take some out it runs slower so I imagine it must be doing something.

The logs indicate that the nodes seem to be segmentation faulting. They look like this:
/ --------- Start of element log -----------------
// Ordinal: 0.0 (role: worker)
// Path: rabidec@gaia041.beowulf.com:1093//tmp/pool/proofbox/rabidec/session-ga
ia040-1217015185-9949/worker-0.0-gaia041-1217015186-25936.log
// # of retrieved lines: 135
// ------------------------------------------------
15:46:42 25936 Wrk-0.0 | Warning in TClass::TClass: no dictionary for class AttributeListLayout is available
15:46:42 25936 Wrk-0.0 | Warning in TClass::TClass: no dictionary for class pair<string,string> is available
15:46:48 25936 Wrk-0.0 | *** Break ***: segmentation violation
(no debugging symbols found)
Using host libthread_db library “/lib64/tls/libthread_db.so.1”.
Attaching to program: /proc/25936/exe, process 25936
(no debugging symbols found)…done.
…

However, I am getting a histogram with some entries, just not all of them.

Just now I tried doing the xrdcp $filename root://gaia0xx:1093//tmp/pool/proofpool/$filename for each node and that seems to work sometimes although I don’t change the chain->Add(“root://gaia040:1093//tmp/pool/proofpool/”+filename);
The weird thing with this is that the first time I copy over the file takes much longer than the rest of the copies, although I am running the client session on just an other node on the cluster. Also the node still seem to be seg faulting.

However, this is not really much of a solution, since we don’t really want to have to copy over the entire data set to each node…(I don’t think we even have enough space for that)

As I mentionned before I also tried to just directly access the files on a shared raid array that all the nodes have access to. For this I used chain->Add(/raid/…path to the file…/filename); and that was as bad as the first method.

Charles

anna · July 29, 2008, 3:26pm

Hi Charles,

I meant don’t do a SetProof, but still specify the files as “root://master//file.root”. This way, you can check if the problem is with the file or with the code. If the file is corrupted, you won’t be able to process it locally.

By the way, is it possible that your original data is corrupted? Could you check that?

For more hints, you could also try to add some per-event debug info in the Process() function of your selector. The format is like in printf: Info(“Process”, “Some words and numbers %s %d”, word, number); This way at least you’ll see at which events it fails, and if it’s always failing at the same events.

Cheers,
Anna

cristi · July 31, 2008, 5:18am

Hi,

I have the same problem running a different analysis on a different data set. I do get some entries in the histograms (around 10-15% of the correct number) but this number seams random within this limits, so it’s not failing at the same event.

Cristian

anna · July 31, 2008, 10:02am

Hi,

I’m sorry, are we still talking about the same installation? What do you see in the logs? If you run this analysis locally with remote files, does it work or does it produce the same error?

Anna

cristi · July 31, 2008, 10:22am

Hi,

I’m using ROOT 5.20 on a different location (infrastructure) then Charles.
The logs look very similar: I get that “no dictionary for class *** is available” warning and the segmentation violation after that.
If I run this analysis locally with remote files it does work.

Cristian

rabidec · August 1, 2008, 7:17pm

Hi,

Sorry for the delay I was occupied with some other stuff for the last couple days.

I tried without setting the proof and it runs just fine locally, while specifying the files as “root://master//file.root”.

When I copy the files over with xrdcp file root://master//path/file I noticed that they actually end up on a specific node. So then I tried a md5 checksum and it said they are OK.
However, I did notice, when I added an Info line as you suggested that it would appear that only the nodes that got a file copied to them were the only ones actually doing anything. So does that mean that I have to copy over all the files to all the nodes? Doesn’t that kinda defeat the purpose of xrootd?

Also, it processes a different number of events each time I rerun the analysis and once in a while it works, so I don’t think it’s always the same events.

Charles

ganis · August 6, 2008, 6:54pm

Dear Charles,

I have tried the code that you posted with the trunk and I did not get any problem.
I will now try to run in with 5.20.

However, I would like to ask you two (Charles and Christian) to post the full log that you get on the workers (I mean trace back of the seg violation from all threads).
Often this gives useful indications of where the problem is.

Gerri Ganis

rabidec · August 6, 2008, 7:36pm

Here is a log with the full seg fault. Here it seems to have processed some events before seg faulting. I just put a few lines of that at the start then the full seg fault.

However, you should note that it seems to seg fault pretty often even when it works, that is it manages to process all the events. Also, it seg faulted with an other completely unrelated analysis I was trying earlier.

Also, one of the workers that this was running on didn’t seg fault, however the other 15 did and they all processed at least some events. A cursory look didn’t reveal any difference between the outputs for the seg faults on any of the workers that did seg fault. In total they managed to process 13591 events out of 20000 in 2 files.

Charles
(the log is in the attachement since it’s 276 lines)
log.txt (7.25 KB)

ganis · August 11, 2008, 8:59am

Dear Charles,

I was able to reproduce your problem with 5.19.04, the version that you used for the log that you posted.
There was indeed a problem in deleting the selector object introduced just before that development version and which has been fixed in 5.20.00 .
Could you please try with 5.20 and let me know?

Christian is already using 5.20, so he must be suffering from another problem.
For that I need to see the full trace back at the seg violation and possibly to have the simplest setup possible to reproduce the problem.

Gerri Ganis

rabidec · August 11, 2008, 3:58pm

Gerri Ganis,

I had forgotten to change my proof config files when we installed the new version of root, so I updated those. However, I couldn’t get libXrdOfs.so working with 5.20 so I’m still using the 5.19 version of that.

The seg faults are gone and it seems not to drop events anymore, so version 5.20 seems to be working for me, except for libXrdOfs.so.

Thanks,
Charles