PROOF crash on re-opening session

Hello

I’m using a C++ framework which internally opens PROOF session. The code can be found at http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/groups/sframe/SFrame/core/src/SCycleController.cxx?revision=1.6.2.7&view=markup&pathrev=SFrame-PROOF-branch
Right now I’m using PROOF lite only.

In it, it opens PROOF connection like

and runs analysis on a given set of files. The SCycleController manages many instances of such analysis so, it can keep using the PROOF session that was opened first.

Over time though (say after 5 hours of continuous running) it accumulates large amount of memory possibly due to my code but maybe due to PROOF, so I’d like to close the session every time it is reasonable to do so.

Shutting down is performed like

TProofMgr* mgr = m_proof->GetManager(); delete m_proof; delete mgr;

which is OK, but on reopening after shutting down, again using

it crashes with the following

code TUnixSystem::Di… : bus error
/Users/akira/Analysis/SFrame/dilepton/config/90186: No such file or directory.
Attaching to process 90186.
Reading symbols for shared libraries . done
Reading symbols for shared libraries … done
0x91e72189 in wait4 ()

========== STACKS OF ALL THREADS ==========

Thread 1 (process 90186 thread 0x10b):
#0 0x91e72189 in wait4 ()
#1 0x91e6fcd4 in system$UNIX2003 ()
#2 0x008c5ee1 in TUnixSystem::StackTrace ()
#3 0x008c9865 in TUnixSystem::DispatchSignals ()
#4 0x008c99d8 in SigHandler ()
#5
#6 0x02be10d0 in TProofMgr::GetListOfManagers ()
#7 0x02be2a2d in TProofMgr::Create ()
#8 0x02bb05ab in TProof::Open ()
#9 0x0015e335 in SCycleController::InitProof (this=0xbfffef64, server=@0xbfffecfc) at src/SCycleController.cxx:649
[/code]

I suspected that delete only is not good enough to shut down PROOF and release everything but after trying combination of the following commands, I still see crash:

m_proof->Close(); TProofMgr* mgr = m_proof->GetManager(); mgr->ShutdownSession(m_proof); mgr->Reset();

If I quit the execution of SCycleController (at which point m_proof is deleted) and then start immediately, I don’t see this problem but I would like to have the ability to run continuously.

Could you help me on this?

Thanks
Akira

Dear Akira,

There are a couple of problems with list registration which I am currently fixing.
As a work around, this should work

You will get one or two spurious error message from GetServiceByName, which I will also fix.

Let me know if this helps.

Gerri

Hi Gerri,

I tried your suggestion (since I don’t know too much of the inner workings of PROOF, I just used your lines literally), but it didn’t really help…

Actually, the situation is even a bit more complicated. Akira told me a while ago that he was seeing this crash. But so far I couldn’t reproduce it with our code. For me the SFrame code is always successful in reopening PROOF-lite sessions. It is even successful when I ask it to first run something on a “real” PROOF cluster, then close that connection, and open a PROOF-lite connection for some more processing.

However when I want to reopen the connection to a real PROOF cluster, just after closing it, the connection hangs indefinitely. I noticed that if I let my code sleep for 2 seconds (using sleep(2)) between closing the connection and trying to re-open it again, then it works. But if I don’t, then the re-opening just hangs, waiting for something.

So I tried your lines, writing this to close the connection completely:

[code]TProofMgr* mgr = m_proof->GetManager();
mgr->DetachSession( 1, “S” );
gROOT->GetListOfProofs()->Clear();
delete m_proof;
delete mgr;

m_proof = 0;[/code]

But now my code crashes violently when trying to close the connection. The backtrace is pretty complicated (for some reason my program seems to run 5 threads at the time of the crash), but I see that the crash is initiated by the “delete m_proof;” call.

Do I really not need to delete the TProof object? Is there maybe something assumed in the TProof destructor which is not met when executing these two extra lines?

All in all, I seem to be able to run my code relatively fine on my SL5 machine, using GCC 4.3. But I’d like to understand why Akira is seeing his crash.

Cheers,
Attila

Hi Attila,

The situation with closing PROOF sessions was messy because of bad handling of the internal lists where we need to register the objects.
These problems should have been fixed in the trunk following Akira’s report, and we can surely consider the possibility to include the fixes in the 5-22-00-patches for availability in 5-22-00c, if that can help.

However, the sequence that I was proposing, it works for me with vanilla 5-22-00 both for a real cluster and PROOF-Lite. Last week I was stress-testing this hundred ties within valgrind w/o problems. The only possible annoyance is the message about the missing rootd service (which depends on the local setup in /etc/services).

For what relates your implementation of my workaround, you should not delete the TProof object ‘m_proof’, because this is already done by the call to ShutdownSession.
So if you replace ‘delete m_proof’ by ‘m_proof = 0’ you should get the correct behaviour.

Please let me know if you try.

Cheers, Gerri

Hi Gerri,

Well, if I don’t want to delete my TProof object by hand, but let your line take care of it then the crash indeed disappears. But I still see the problem that the second PROOF connection never gets opened… :-/

Looking through the xrootd logs, I see that I get the following message before my job “freezes”:

… cms_Finder: Waiting for cms path /tmp/.olb/olbd.admin

This message gets repeated a few times, then after about 4-5 minutes my program crashes, while the server’s log says that it couldn’t touch a certain file in /tmp/. As I said, this problem disappears if I let the server rest for a few moments after the first connection has been closed. I could send you more detailed log files if you’re interested.

Cheers,
Attila