Bad_alloc while merging

wiso · March 31, 2011, 8:50pm

I got a bad_alloc while merging object ‘PROOF_TOutputListSelectorDataMap_object’. I’m merging tons (26600) of objects (TH1F, TH2F, TProfile). If I look at the memory usage from the query progress I see than every worker is using 1.8GB without particular variation during the process. I tried to decrease the number of workers, and to use /not use mergers. The last time I used 8 workers on a cluser of 3 machines for a total of 24 cpus. Proof equally split the jobs: 3+3+2. From the ganglia monitoring I can see that the master is using all the memory (16Gb) and it is using a large amount of virtual memory. In particular the memory usare remains stable during process, increase at the end of process and remains stable during merging. On the other machines the memory usage always reamains stable.

Do you have some advice? Is the only solution to reduce the amount of object used? (And rerun proof more times?)

I’ve a more general question about proof. Why does every workers have a copy of the object? For example think about the histograms. Why I need one replica of the same histogram for every workers? In particular if the workers are on the same machine why they can’t use shared memory to store the objects and call Fill on the same variable? I understand that your approach (splitting / merging) is simpler, but I need 1.8Gb x #workers!

ganis · April 1, 2011, 9:39am

Hi,

wiso:

I got a bad_alloc while merging object ‘PROOF_TOutputListSelectorDataMap_object’. I’m merging tons (26600) of objects (TH1F, TH2F, TProfile). If I look at the memory usage from the query progress I see than every worker is using 1.8GB without particular variation during the process. I tried to decrease the number of workers, and to use /not use mergers. The last time I used 8 workers on a cluser of 3 machines for a total of 24 cpus. Proof equally split the jobs: 3+3+2. From the ganglia monitoring I can see that the master is using all the memory (16Gb) and it is using a large amount of virtual memory. In particular the memory usare remains stable during process, increase at the end of process and remains stable during merging. On the other machines the memory usage always reamains stable.

Do you have some advice?

I have not understood if using sub-mergers helped or not. Others found it helped a lot especially for histograms.

A general advise is difficult. The problem of output merging is an open one and everybody is suffering from it. In your case, if you have 26600 histograms is not to browse then one by one (I can’t believe that!) but you are probably using them as a data structure for further analysis. In that case I would try to understand if a better structure exist. If, for example, some variables appear in many TH2 correlated with different other ones, perhaps a TTree would fit better as intermediate structure. But it is difficult to say without the details of the problem.

Because PROOF does multi-process parallelism and each process has its own output. In this respect there is no difference wrt multiple batch or grid jobs.

This is in principle a nice idea, but unfortunately C++ objects have parts specific to the process (e.g. the virtual table), so it will not work; using the shared memory to stream an object back and forth will not solve the problem. Perhaps something can be done in this direction to (at least) share the largest part of the object (the internal array, in case of histograms), but this needs investigation, and it will be object specific.
I have seen that in ‘boots’ they have a way to create containers on shared memories. Perhaps that could be investigated.

Gerri

wiso · April 1, 2011, 10:27am

[quote=“ganis”]Hi,

wiso:

I got a bad_alloc while merging object ‘PROOF_TOutputListSelectorDataMap_object’. I’m merging tons (26600) of objects (TH1F, TH2F, TProfile). If I look at the memory usage from the query progress I see than every worker is using 1.8GB without particular variation during the process. I tried to decrease the number of workers, and to use /not use mergers. The last time I used 8 workers on a cluser of 3 machines for a total of 24 cpus. Proof equally split the jobs: 3+3+2. From the ganglia monitoring I can see that the master is using all the memory (16Gb) and it is using a large amount of virtual memory. In particular the memory usare remains stable during process, increase at the end of process and remains stable during merging. On the other machines the memory usage always reamains stable.

Do you have some advice?

I have not understood if using sub-mergers helped or not. Others found it helped a lot especially for histograms.
[/quote]
It seems not, in particular it seems that with submergers the amount of used memory increase.

What is PROOF_TOutputListSelectorDataMap_objecs? For simplicity I save all object inside fOutput using a loop. Is it a very big object?

I tried to comment all the histograms declarations, all the process instructions, the only variable declared are the one in the TTree, about 5000 float + their pointers < 100 Kb in principle. Why in this configuration every slave need 200Mb?

No, TTree is not good for me. I start from a TTree and with my proof program I fill a lot of histograms. Next I’ve another program that fits these histograms and do other stuff. I start from single particle MCs, I’ve 14 MCs, with 14 different energies. I need to fill one histogram for every quantity, for every energy and for every eta cell (there are 100 eta cells). Sometimes the quantity are bidimensional.

One solution is to run one proof session for every energy (1/14 of the objects), because the MC dataset are divided by energy. But at the end I need to manually merge some histograms, because I need some global histogram, without looking at the true energy.

I’ve reduced the amount of memory using THS instead of THF, now I need 1.1 Gb for every slave instead of 1.8.

Is there a way to tell the slave to save objects on files instead on memory?

[quote=“ganis”]

Because PROOF does multi-process parallelism and each process has its own output. In this respect there is no difference wrt multiple batch or grid jobs.

This is in principle a nice idea, but unfortunately C++ objects have parts specific to the process (e.g. the virtual table), so it will not work; using the shared memory to stream an object back and forth will not solve the problem. Perhaps something can be done in this direction to (at least) share the largest part of the object (the internal array, in case of histograms), but this needs investigation, and it will be object specific.
I have seen that in ‘boots’ they have a way to create containers on shared memories. Perhaps that could be investigated.
Gerri[/quote]

Do you mean boost? Like this Boost.Interprocess?

ganis · April 1, 2011, 11:22am

In the workers that become mergers … anyhow, let’s leave this for the moment.

This is an internal object used for automatic member mapping: it fills automatically the members in the client selector objects from the output list, so that you do not have to do FindObject in Terminate. The TMap contains a TNamed object per selector member; in your case, 26600 * ~100 bytes = ~2.7 MB, I would say … if you run out-of -memory even this is an issue. I will ask the author of this to add an option to disable the functionality.

The TTree declaration will only use space when you load entries, but each entry re-uses the space of the previous one, so that should not be an issue. What libraries do you load for your analysis? That could explain the 200 MB …

As I wrote, it depends case by case …

Yes, have a look at root.cern.ch/drupal/content/hand … root-files .

Yes