TTree Close failure under Hadoop Mapreduce

Hi,
I am trying to get a simple mapreduce job running using the Hadoop Pipes API that uses Root 5.32 to store some vectors in a TTree. The job fails at the point where Root attempts to close the TTree file and I seek any input in resolving the problem.
The mapreduce task receives binary data – essentially vectors, performs a transform on the vector, and writes the resulting vector to a Root TTree that is stored in a file. The job fails at the point where Root tries to write the data to a file – the Hadoop+Root job crashes shortly after the “Close” method of the TTree is called.

This same code completes when not run in the Mapreduce framework.

It looks like there is some memory clobbering at the point that Root attempts to do clean up after the file close.

I have stack trace and code details below, but it seems that the alternatives are:
Suggestions that I can see:

  1. Run the root I/O and computation in a separate process that is
    fork/exec-ed…this separation might remove the memory conflicts that seem to be happening between the two packages.
  2. Perhaps this is a “known” problem associated with a improperly configured compile of Root? This is now under Ubuntu 11.10, Root 5.32. Perhaps re-compile of Root will work?
  3. What else?

Here is the Hadoop mapreduce debug trace:
A. Stack trace.

attempt_201202230611_0003_m_000000_1: #5 <signal handler called> attempt_201202230611_0003_m_000000_1: #6 0x00007f4ff0989d49 in free () from /lib/x86_64-linux-gnu/libc.so.6 attempt_201202230611_0003_m_000000_1: #7 0x00007f4ff2061d44 in TTree::~TTree() () from /root-5.32.00/lib/libTree.so attempt_201202230611_0003_m_000000_1: #8 0x00007f4ff2061ed9 in TTree::~TTree() () from /root-5.32.00/lib/libTree.so attempt_201202230611_0003_m_000000_1: #9 0x00007f4ff17dbd99 in TCollection::GarbageCollect(TObject*) () from /root-5.32.00/lib/libCore.so attempt_201202230611_0003_m_000000_1: #10 0x00007f4ff17df765 in TList::Delete(char const*) () from /root-5.32.00/lib/libCore.so attempt_201202230611_0003_m_000000_1: #11 0x00007f4ff1234711 in TDirectoryFile::Close(char const*) () from /root-5.32.00/lib/libRIO.so attempt_201202230611_0003_m_000000_1: #12 0x00007f4ff1246883 in TFile::Close(char const*) () from /root-5.32.00/lib/libRIO.so attempt_201202230611_0003_m_000000_1: #13 0x0000000000427c33 in ProcessFile::transformRow (this=0x20442d0, arraySize=12, row=0x7fffac1a1160, rootFileName=0x2044268 "/hadoop-distro/rootFile_2.root") at ProcessFile.cpp:64

B. Code.
The above line ProcessFile.cpp:64 is at the end of the following method:

[code]void ProcessFile::transformRow(int arraySize, float row[],const char*
rootFileName){
std::cout << "ProcessFile::transformRow creating matrixes and
vectors " << std::endl;
VectorXf v(arraySize), o(arraySize);
createdMatrix = MatrixXf::Random(arraySize,arraySize);
eventArray= new Float_t(arraySize);
std::cout << "ProcessFile::transformRow creating local root file "
<< std::endl;
TFile local(rootFileName,“recreate”);

std::cout << "ProcessFile::transformRow creating root structures "
<< std::endl;
aTree= new TTree(“test_tree”,“simple_event_tree”);
copyDoubleIntoArray(row,v);

// Using Eigen math library perform multiply that models the transform
o = createdMatrix*v;
std::cout << “ProcessFile::transformRow copying into Root structures
” << std::endl;
aTree->Branch(“arraySize”,&arraySize,“arraySize/I”);
aTree->Branch(“arrays”,eventArray,“eventArray[arraySize]/F”);
std::cout << "ProcessFile::transformRow copying into Eigen structure
and do transform " << std::endl;
copyValueToRootFile(o,aTree,eventArray);
// ** Looks like this is where the problem happens **
local.Close();
}[/code]

The final line, “local.Close()” starts the error. Further, If I comment out the “
copyValueToRootFile(o,aTree,eventArray);”, then the crash does not
occur.

C.
Looking at the copyValueToRootFile method:

void ProcessFile::copyValueToRootFile(VectorXf& v,TTree * aTree, Float_t *eventArray){ std::cout << "ProcessFile::copyValueToRootFile: Verifying values for Vector" << std::endl; for (int i=0;i<v.size();i++) { std::cout << "O(" << i << "): " << v(i) << std::endl; eventArray[i]= v(i); } // Save the data to disk aTree->Fill(); aTree->FlushBaskets(); aTree->Reset(); }

As long as there is no assignment of eventArray elements, then there
is no crash. That is, if I comment out the above line
eventArray[i]=v(i);
Then the mapreduce task completes.
So it seems fair to say that there is some memory or I/O issue
associated with the write to the data associated aTree.

Hi,

Quick question: aTree->FlushBaskets(); aTree->Reset();Why did you use both those calls? Are you sure it is what you really want/need?

Philippe.

I had thought that Reset might help, but I believe the correct thing would just be to do Flush.

Hi,

‘Reset’ would tell the TTree to ‘lose’ all of its existing information. This is usually only use in monitoring application to reuse the same TTree to present just ‘recent’ information.

‘FlushBaskets’ make sure that the data is written to disk but does not write the meta data. It is usually sub-optimal to call it directly (resulting in more ‘fragmented’ file and larger amount of meta data.

In ‘ProcessFile::transformRow’, the call to Close is superfluous as it is implicitly executed as part of the TFile destructor and this destructor is also implicitly called at the end of the function since TFile was allocated on the stack.

A very important step is missing:local.Write();which is necessary to insure the meta data is stored in the file.

None of this explains the segmentation fault. Can you run you example with valgrind to pinpoint the problem?

Cheers,
Philippe.

Maybe you could try to replace:
TFile local(rootFileName,“recreate”);
with:
TFile *local = new TFile(rootFileName, “recreate”);
and then replace:
local.Close();
with:
aTree->Write();
delete aTree;
delete local;
(And, of course, do not do “aTree->FlushBaskets();” nor “aTree->Reset();”.)

Thanks!
I will run with valgrind shortly.
I did try to suggested changes:
// local.Close();
aTree->Write();
delete aTree;
delete local;
but the program now has segfaults at the Write call:
attempt_201202230611_0006_m_000001_2: #5
attempt_201202230611_0006_m_000001_2: #6 0x00007f9b0a83c15f in TObject::Write(char const*, int, int) const () from /root-5.32.00/lib/libCore.so
attempt_201202230611_0006_m_000001_2: #7 0x0000000000427bdf in ProcessFile::transformRow (this=0x22d7290, arraySize=12, row=0x7fff90b27310, rootFileName=0x22d7268 “/hadoop-distro/rootFile_1.root”) at ProcessFile.cpp:66

Philippe and Pepe,
Thanks. Mysteriously, when I run it in valgrind, the error does not occur. I attach the valgrind results.
The only difference I can see is that the executable called was stored on the Hadoop DFS cache in the segfault cases, when I ran with Valgrind the executable used is from the “native” filesystem directly. I am checking if there is anything going on in storing it on HDFS cache.
Charles
ValgrindResults.txt (148 KB)

Now I have confirmed that the program crashes when NOT called within valgrind (local or hdfs, doesn’t matter), but does not crash when called within valgrind.
Any thoughts? I will look carefully at the valgrind output.
Charles

Hi,

The first 5 valgrind reports are real problem in your code. They (especially the first 3) will lead to undefined behavior. Can you correct them and retry (i.e. ProcessRow.cpp:67 reads too much and ProcessFile.cpp:36 writes too much)?

Cheers,
Philippe.

That did the trick. I should say, at least fixing allocation and de-allocation of the array into which byte data is read and the Float_t array which which is stored in TTree seems to prevent crash.
There are still two lines in the original Hadoop distribution which can be fixed by a change from delete to delete[], and I will file a jira for correction. These are noted noted by:

==20394== Memcheck, a memory error detector
==20394== Copyright © 2002-2010, and GNU GPL’d, by Julian Seward et al.
==20394== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info
==20394== Command: /hadoop-distro/ProcessRow
==20394==
==20394== Mismatched free() / delete / delete []
==20394== at 0x4C27FF2: operator delete(void*) (vg_replace_malloc.c:387)
==20394== by 0x4328A5: HadoopPipes::runTask(HadoopPipes::Factory const&) (HadoopPipes.cc:1171)
==20394== by 0x424C33: main (ProcessRow.cpp:118)
==20394== Address 0x9c5b540 is 0 bytes inside a block of size 131,072 alloc’d
==20394== at 0x4C2864B: operator new[](unsigned long) (vg_replace_malloc.c:305)
==20394== by 0x431E5D: HadoopPipes::runTask(HadoopPipes::Factory const&) (HadoopPipes.cc:1121)
==20394== by 0x424C33: main (ProcessRow.cpp:118)
==20394==
==20394== Mismatched free() / delete / delete []
==20394== at 0x4C27FF2: operator delete(void*) (vg_replace_malloc.c:387)
==20394== by 0x4328AF: HadoopPipes::runTask(HadoopPipes::Factory const&) (HadoopPipes.cc:1172)
==20394== by 0x424C33: main (ProcessRow.cpp:118)
==20394== Address 0x9c7b580 is 0 bytes inside a block of size 131,072 alloc’d
==20394== at 0x4C2864B: operator new[](unsigned long) (vg_replace_malloc.c:305)
==20394== by 0x431E6A: HadoopPipes::runTask(HadoopPipes::Factory const&) (HadoopPipes.cc:1122)
==20394== by 0x424C33: main (ProcessRow.cpp:118)

You should not mix “new” with “delete[]” nor “new[]” with “delete”: http://www.cplusplus.com/reference/std/new/

Yes, I now have corrected the blunder in my code – new [] is matched now against delete [] and new matched to delete. The Hadoop library however contains a new [] associated with delete.
C