How to work with MSS Storage?

pfernandez · January 24, 2008, 8:55am

Hi,

We’re setting up a Tier3 facility in the same physical location as our Tier2. There’s a mass storage system (in dCache) with many terabytes where the root files will be stored. My question is related to the real need and the use for another distributed storage system as oldb or ofs.

As far as I understand, the way to tell each proof job which file to work with, one has to fill in a TChain variable using chain.add(“file”); This file can be anywhere, like:

an external source in the form “root://server/directory”
a path relative to the user’s home “~/whatever/file”
a path in a storage shared with other workers “/proofpool/file”

One question is… does the worker or the master do anything with remote files before reading them remotely? Doesn’t it copy it locally before processing?

Another one: who has to put the files in the /proofpool shared storage? The user, manually (so he has to have a home directory in the master to be able to copy files into /proofpool in the master)?

And THE question, that put the last two together. I understand that the best way to analyze jobs is having the file locally in the worker, that’s why oldb is in charge for moving files to the different workers and then the master launching the jobs to the machine where the file it proccesses resides: Is there a way to detach the /proofpool from the user view (like a cache)?
I mean so that the user does not need to copy anything from the MSS to the /proofpool storage before launching analysis jobs, and so the user just says chain.add root://someserver/somefile" and the file is automatically copied from the MSS to /proofpool and the file string converted to “/proofpool/somefile” (if necessary). Am I dreaming? That’s something I’ve seen menctioning in the xrootd Wisconsin webpage but not described anywhere.

Thanks a lot for your help.

jani · January 24, 2008, 2:00pm

Hi

From the point of view of PROOF it’s quite simple.
Answer 1
PROOF just asks xrootd where the files are. It does not copy them from a mass storage or move around. Usually part of the files is stored on the PROOF workers (which by default are also xrootd servers), In that case, the packetizer tries to make the processing local, indeed. The file uri in the chain must be such that TFile::Open can resolve them on the PROOF nodes.

Answer 2
That is the solution 1: users/admins have accounts on the cluster and copy files using xrdcp. Solution 2: Xrootd is confiugured to automatically download files from MSS.

Answer 3
You are not dreaming. It’s done for instance at ALICE CAF:
aliceinfo.cern.ch/Offline/Analys … index.html
It’s a matter of writing a script which will download files from your MSS and plugging it into xrootd.

Cheers,
Jan

pfernandez · January 25, 2008, 8:00am

So, let’s see if I understood what happens in CAF:
The URIs used in the jobs are not changed at any time, and proof can be configured to stage files from MSS when required. So, when writing an analysis program for a proof cluster (let’s say one “master” and many “slaveXX”) the jobs should write something like:
chain.add(“root://master/proofpool/somedirectory/somefile.root”)
And then the master would decide what to do with the job:

If the file resides in a shared “/proofpool/somedirectory” in one of the slaves, then the job is sent to that slave for processing (depending on the queuing mechanism) and so, when the slave tries to access that file the xrootd master will redirect it to itself. Great.
If the file does not exist in any of the slaves (and here things start to be more blurred to me) then what happens? Does the master stage in the file from MSS, the file is stored randomly in a slave, and the job is then sent to that slave? Or maybe the job is sent to any slave and the slave is the one in charge for downloading the file from MSS? How is this done transparently?
I guess there will be a kind of script that translates directories this way:
/proofpool/somedirectory -> root://castor.cern.ch/nfs/data/atlas/an … edirectory and then downloads it to the /proofpool/somedirectory, am I right?

Also, what happens when the shared filesystem is full? Is there a way to delete old files when the filesystem gets full or has to be done by sysadmins?

I’ve read some of your presentations about proof, and the webpage you’ve submitted me to. There are still some things that I don’t understand:
If taken a look at the ESD82XX_30K.txt file where URIs are stored (CAF tutorial). They look like: root://lxb6046.cern.ch//alien/alice/sim … iESDs.root
But that does not seem to be an address easily translatable to a MSS address. In the tutorial offline manual they say:
The CASTOR tree structure will look like "castor/cern.ch/…/‹Year›/‹Month›/‹Day›/‹Hour›/"
But then analysis addresses look like: "/sim/‹Year›/‹ProductionType›/‹RunNumber›/"
Is there any other translation between both? Maybe LFC?

Another thing that puzzles me is a section, in the same webpage (aliceinfo.cern.ch/Offline/Analys … index.html), named “Staging files from AliEn” that I don’t quite understand. Is it a manual procedure to stage files in the distributed storage from MSS? If the files are supposed to be staged in from MSS automatically when they don’t exist in the distributed storage, then why is this needed? Maybe to allow users to think ahead and save time before they run their analysis (but it’s not mandatory)?

Also, what happens with files created by the analysis jobs, are they also staged out to MSS? Where?

I’ve also been looking for the CAF’s proof config files without success. Do you know where to look for them? I’ve also seen this webpage: xrootd.slac.stanford.edu/examples/ but CAF’s not there. There are references to external MSS with very little explanation (there is not really enough documentation to configure this anywhere I’ve been) and also a very important part is always missing: the scripts to stage in and out from MSS. Any ideas?

Thanks for all, I really appreciate this help. Tier3 sites are comming, and very shortly there will be many people asking similar questions, this forum will become essential.

BR/Pablo

jgrosseo · January 25, 2008, 1:31pm

Hi,

please have a look at this presentation that explains ALICE’s staging solution:

indico.cern.ch/getFile.py/access … nfId=23243

Let me know which questions remain.

Cheers,
Jan Fiete

ps - The talk is from the last PROOF workshop, I guess you found the link already. Anyway here it is: indico.cern.ch/conferenceDisplay.py?confId=23243

pfernandez · January 31, 2008, 2:01am

Hi again.
Well, maybe a more “technical” description would do. I mean, conferences in that link (and others) speak about theory, and so far I’ve seen two different theories:

CAF’s way: gProof->RegisterDataSet(“myDS”, ds) - as shown in the slides you link to, and the CAF offline project. It’s supposed to be a manual procedure (user view) that consists on some daemons and staging scripts (AFAIK).
The “automatic” way, very briefly described in the xrootd scalla webpage. Textually from xrootd.slac.stanford.edu/doc/ofs … config.htm reads:
“By default, MSS support is not enabled. However, if you specify the mssgwcmd or stagecmd directives, then all paths not explicitly designated as notmigratable are assumed to reside, by some name transformation, in the Mass Storage System. Conversely, designating a path as migratable, or specifically specifying the check directive requires that you also specify the mssgwcmd and the stagecmd directives.
When MSS support is enabled, special files are created in the file system to coordinate the staging files from the Mass Storage System as well as migrating files to the Mass Storage System. Additionally, the Mass Storage system is consulted whenever a file is referenced but cannot be found on local disk.”

So, this really seems to be an automatic way for staging files in/out the MSS. Is there any relation on both methods?

Anyway my real problem is not about theory. I’ve got a very small proof cluster set up with olbd and I want to attach somehow that storage (olb) to the MSS big one in dCache.

So, it would be really nice to have some examples from sites that combines both storages. For those few sites that publish their xpd.cf files, the staging scripts are not published. Does anybody have them?
And also, for the CAF registering method there seems to be a number of daemons that take care of everything, but I haven’t seen them anywhere. Any ideas on how to get this scripts/daemons?

BR/Pablo

jani · January 31, 2008, 11:15am

Hi

Thanks for your questions. You are right that there is not enough information on the web. So far we mostly worked with physisits who had cluster and some way of getting the data.
First thing:
Without the ALICE-specific deamons (running on CAF), call to
gProof->RegisterDataSet(“myDS”, ds)
just saves ‘ds’ on the PROOF master. If you have xrootd configured to stage files form MSS you can just do gProof->VerifyDataSet(ds) and the files should be staged. Otherwise it’s admin/users task to make sure the files from ‘ds’ are on the cluster.

Below I answer the PROOF questions from you previous post, but you asked many XROOTD questions for which either the xrootd team will answer on this forum or you should post the same the questions to the xrootd mailing list:
xrootd.slac.stanford.edu/xrootdlist.html. That is the best place.

In fact before the processing we do a step called “lookup” - the URIs with exact worker node names are obtained from the xrootd redirector. If a file is not present on the cluster the redirector can either download it from the MSS (to the least loaded node) or refuse to serve the file, depending on the xrootd config. In the later case, PROOF will can skip the file or abort processing depending on the user preferences.

And then the master would decide what to do with the job:

If the file resides in a shared “/proofpool/somedirectory” in one of the slaves, then the job is sent to that slave for processing (depending on the queuing mechanism) and so, when the slave tries to access that file the xrootd master will redirect it to itself. Great.

If the file does not exist in any of the slaves (and here things start to be more blurred to me) then what happens? Does the master stage in the file from MSS, the file is stored randomly in a slave, and the job is then sent to that slave? Or maybe the job is sent to any slave and the slave is the one in charge for downloading the file from MSS? How is this done transparently?
I guess there will be a kind of script that translates directories this way:
/proofpool/somedirectory → root://castor.cern.ch/nfs/data/atlas/an … edirectory and then downloads it to the /proofpool/somedirectory, am I right?

You are close.
For better performance, the unit of processing is a packet (not a file/job). A packet is a fragment of a file. Packets are distributed dynamically by the packetizer with a strategy that optimizes the completion time. The details are published here: pos.sissa.it//archive/conference … AT_022.pdf
The cooperation with MSS fully depends on the configuration of xrootd, as described above.

A script can be used to automatically remove the least recently used files. Also the new dataset manager is being prepared which will be maintaining quotas.

I’ve read some of your presentations about proof, and the webpage you’ve submitted me to. There are still some things that I don’t understand:
If taken a look at the ESD82XX_30K.txt file where URIs are stored (CAF tutorial). They look like: root://lxb6046.cern.ch//alien/alice/sim … iESDs.root
But that does not seem to be an address easily translatable to a MSS address. In the tutorial offline manual they say:
The CASTOR tree structure will look like “castor/cern.ch/…/‹Year›/‹Month›/‹Day›/‹Hour›/”
But then analysis addresses look like: “/sim/‹Year›/‹ProductionType›/‹RunNumber›/”
Is there any other translation between both? Maybe LFC?

Another thing that puzzles me is a section, in the same webpage (aliceinfo.cern.ch/Offline/Analys … index.html), named “Staging files from AliEn” that I don’t quite understand. Is it a manual procedure to stage files in the distributed storage from MSS? If the files are supposed to be staged in from MSS automatically when they don’t exist in the distributed storage, then why is this needed? Maybe to allow users to think ahead and save time before they run their analysis (but it’s not mandatory)?

Staging from CASTOR can take a long time and PROOF is designed to be interactive so prestaging is a good idea. In fact it is an ALICE-specific demon which does the prestaging.

We are working on a good solution for that. By default you just get the output back to the client machine.

You are right.
I hope it helps you putting the parts together.

Cheers,
Jan