TMessage:What() = 65 (!)

christos · April 1, 2005, 6:03pm

Hello,

I am runnning out of ideas on how to debug the following problem.

I have a TServerSocket that accepts connections from clients. The code is based on the example given here:
root.cern.ch/root/html/examples/hserv.C.html
except that I do not close the TServerSocket (I allow for many clients to connect).

The problem shows up when client and TServerSocket are on different machines, but not when running on the same machine (ie. “localhost”).

When they run on the same machine, the program works as expected. When running on different machines, I manage to exchange a few TMessages, but very soon I’m running into trouble. It looks like TMessage “corruption”, but I don’t know much more than that.

I tried looking at the message type that the client sends (“type sent”) and the one that TServerSocket receives (“type recv”), as well as the total # of bytes received by TServerSocket for that socket ("# of bytes").

This is the order of events when client and TServerSocket run on the same machine:

      type sent      type recv   # of bytes
(a) kMESS_STRING   kMESS_STRING     28
(b) kMESS_STRING   kMESS_STRING     60
(c) kMESS_STRING   kMESS_STRING    110
(d)     10012         10012        118     (Note: "homemade" message/integer)

Some more details (that may or may not be relevant):

TServerSocket connects with client # 2 at this point. There’s a bunch of kMESS_STRING messages exchanges w/o problems. TServerSocket then sends two messages of type “kMESS_STRING” and one message of type “10012” to client #1 w/o problems. Client #1 receives them as expected (confirmed by printouts). At this point, client #1 resumes sending stuff to TServerSocket:

      type sent      type recv      # of bytes
(e) kMESS_STRING   kMESS_STRING    139
(f) kMESS_OBJECT   kMESS_OBJECT  12286   (Note: contains one 1D + one 2D histograms)
(g) kMESS_STRING   kMESS_STRING  12310
(h) kMESS_OBJECT   kMESS_OBJECT  15295   (Note: contains four 1D histograms)

When client #1 moves into a different machine, everything works fine (and the same) up to step (f), inclusive. When I try to send (g), I get the following:

(g') kMESS_STRING     65         25602  (with TMessage::GetClass() = 0)

or occasionally

(g'') kMESS_STRING    ----       ----          null TMessage

Client #1 keeps sending stuff to TServerSocket w/o any complaints. In case (g’), TServerSocket does not know how to proceed (there is no message with type 65!). In case (g’’), TServerSocket thinks that the client has been disconnected (even though it has not).

Unfortunately, I have not managed to reproduce my problem with a smaller set of macros. Any ideas would be greatly appreciated. In particular, does anything in the functionality of TSocket change when I switch from

TSocket *sock = new TSocket(“localhost”, 9090);
to
TSocket *sock = new TSocket(“mymachine.mydomain”, 9090);

that my code does not take into account?

What else should I be checking?

Thanks a lot!

–Christos

PS which root
/afs/cern.ch/cms/external/lcg/external/root/3.10.02/slc3_ia32_gcc323/root/bin/root

christos · April 1, 2005, 6:25pm

Hi again.

Ok, some more information. I told a little lie when I said that client #1 keeps sending histograms to TServerSocket without complaints, even when running on different machines. I checked the return value of TSocket::Send(), and I get “-4”, which according to documentation means

“Returns -4 in case of kNoBlock and errno == EWOULDBLOCK.”

Client #1 does run in kNoBlock mode. What does “EWOULDBLOCK” mean? That the TServerSocket is not listening, and therefore the message gets lost? (I can’t find a reference to EWOULDBLOCK).

How do I make sure that TServerSocker will receive the message? Why is this not an issue when they both run on the same machine?

Cheers,

–Christos

christos · April 1, 2005, 9:48pm

Here’s another update:

After a couple of failed attempts to use kMESS_ACK (I may not know how to use this properly - is there an example?), I decided to try this: switch to blocking-mode before sending the TObjects and back to non-blocking-mode immediately after that.

Somehow, this seems to have solved the problem. I’m still puzzled about why this happens though. Does the size of the objects that are sent over to TServerSocket have something to do with my failure to send them? Can some expert please explain how the non-blocking mode works for small & large objects? I would like to know if I’d better do this for all objects (regardless of size) when sending, and switch back to non-blocking mode when receiving.

Thanks!

–Christos

ganis · April 2, 2005, 5:06pm

Hi Christos,

By default ROOT sockets are created in blocking mode and the hserv.C / hclient.C tutorial macros assume this.
If you choose to run your clients in non-blocking mode your code should be able to handle the EWOULDBLOCK error condition. This is equivalent to EAGAIN, and it means that the requested operation would block the process: it is up to the caller to decide what to do when this happens.
In your case, given that the client and server are synchronized, you should retry the same operation again, which is equivalent to run the clients in blocking mode.

The reason why you see the problem only when the client and the server are on different machines, is probably due to the fact that in such a case the condition “the operation would block” is more likely to happen. You’ll probably encounter the same problem with a very large number of clients on the local machine.

Hope it helps.

Gerri

christos · April 2, 2005, 5:34pm

Hi ganis.

Thanks for the answer. I think I see what you mean by that.

My problem is that in my code I want to do the following: Clients and server occasionally send string messages to each other (but not on every cycle). Therefore, I assumed that I want a non-blocking mode, to be able to catch that message if it arrives, but move on if it doesn’t.

At the same time, client #1 regularly sends objects to the server, and the server will then send them over to client #2. I thought that I could switch to a blocking mode when it’s time to send a message (whether object or not).

If what I understand it’s correct, a node should switch to blocking mode when it wants to send something, and to a non-blocking mode when it wants to see if a message (that is not expected) has come. Does this make sense?

This seems to work in my code except for when it’s time for the server to send the objects to client #2. At that point, client # 2 claims receiving a null message (therefore assumes that the server has gone down). Whereas the server claims that the TSocket::Send returns a positive number (ie. everything sent out as ought to) and at the same time reports

SysError in <TUnixSystem::UnixRecv>: recv (Connection reset by peer)
Error in <TUnixSystem::RecvRaw>: cannot receive buffer

originating from client #2.

I don’t really understand this part…

Of course these problems show up when running on different machines. When running on “localhost”, everything makes sense…

ganis · April 2, 2005, 6:54pm

Hi Christos,

Yes, it may make sense if you can handle correctly the absence of message in non-blocking mode. In particular one should try to understand what does “null message” mean, and avoid shutting down the connection unless Recv returns -5 (which indicates that the counterpart went away).

However, if I understand correctly what you are trying to do, a better solution could be to use Select with a short timeout:

// Check if there is something ready to be received
if(sock->Select(TSocket::kRead,100) > 0) {
      TMessage *mess;
      Int_t n = TSocket::Recv(mess);
      if (n > 0) {
          // Analyse the received message
          ...
      } else if (n == -5) {
          // The server has gone ... cleanup
          delete sock;
          sock = 0;
          return ERROR;   // or whatever ...
      }
}

// Move on
...

where 100 is the timeout in millisec (you should of course tune the number to your needs). This does not require switching to non-blocking mode.

Cheers, Gerri

christos · April 2, 2005, 10:07pm

Hi ganis.

I would love to try your idea, but I believe this requires root 4.0.x. For now, I’m stuck with 3.10.2. Is there any way of introducing a timeout in a TSocket-receive in my root version?

Thanks!

ganis · April 4, 2005, 9:35am

Hi Christos

Sorry, I wrongly assumed that you were using 4.02.00 where an explicit Select method for TSocket was first introduced. In version 3.10.02 you can use TMonitor to achieve the same functionality. Try the macro in attachment in the following way (assuming that you saved it in monitor.C):

In window #1 start a TServerSocket in a ROOT session

root [0] TServerSocket *ss = new TServerSocket(9090)
root [1] TSocket *s = ss->Accept()

Open a connection from another ROOT session (window #2) and run the macro

root [0] TSocket *sc = new TSocket("localhost",9090)
root [1] .x ../../test/monitor.C(sc)
Got a timeout

This is normal because there were nothing to read.
3. Now try sending something on the connection from window #1

root [2] s->Send("go 0")
(Int_t)5

4. Re-run now the macro in window #2: this time something should be found

root [2] .x ../../test/monitor.C(sc)
got: go 0

Once this works, you can adapt your construct to your needs. In the macro a new TMonitor is created each time: this is not needed, you can just create it once and add the socket; however, it is perhaps better to DeActivate the socket when not needed, to avoid spurious signals.

Hope it helps.

Gerri

monitor.C (956 Bytes)

christos · April 4, 2005, 7:46pm

Hi Gerri,

Thanks for going into the trouble of producing the macro. I’ll adapt it and let you know if I run into trouble.

I also found out how to switch to a newer ROOT version when compiling my (gcc) code, so I can also try your earlier suggestion.

Thanks again.

–C