Build a DataFrame from a vector of strings

Hi,

I am building a DataFrame from a tree and I can’t figure out how to access the columns of the tree from the filter I have defined. The data I want to fill the TTree with comes from a vector of strings where each string has the form sensor_band_date. The data frame should have 3 columns (sensor, band, date). Here is my code:

using TDF = ROOT::Experimental::TDataFrame;

void MakeFeatureDataFrame(const std::vector<std::string>& features, TTree* tree)
{
  tree = new TTree("features","features from column names");
  struct Feature : public TObject{
    const char* sensor;
    const char* band;
    Int_t date; 
  };

  Feature feat{};

  tree->Branch("features", &feat, "sensor/C:band/C:date/i");
  for(const auto& feature : features)
    {
    //split the string with my custom function
    using cbutils::string::split;
    auto toks = split(feature, "_");
    if(toks.size() == 3)
      {
      feat.sensor = toks[0].c_str();
      feat.band = toks[1].c_str();
      feat.date = atoi(toks[2].c_str());
      }
    tree->Fill();
    }

  tree->Print(); 
  TDF df(*tree, {"features"});
  auto metCut = [](Feature x) { return std::strcmp(x.band,"ndvi")==0; }; 
  auto count = df.Filter(metCut, {"features"}).Count();
  std::cout << *count << '\n';
}

The tree->Print() statement shows

******************************************************************************
*Tree    :features  : features from column names                             *
*Entries :      237 : Total =            6140 bytes  File  Size =          0 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :features  : sensor/C:band/C:date/i                                 *
*Entries :      237 : Total  Size=       5811 bytes  One basket in memory    *
*Baskets :        0 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

But then I get the following error:

Error in <TTreeReaderValueBase::GetBranchDataType()>: The branch features was created using a leaf list and cannot be represented as a C++ type. Please access one of its siblings using a TTreeReaderArray:
Error in <TTreeReaderValueBase::GetBranchDataType()>:    features.sensor
Error in <TTreeReaderValueBase::GetBranchDataType()>:    features.band
Error in <TTreeReaderValueBase::GetBranchDataType()>:    features.date
Error in <TTreeReaderValueBase::CreateProxy()>: The template argument type T of MakeFeatureDataFrame(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TTree*)::Feature accessing branch features (which contains data of type (null)) is not known to ROOT. You will need to create a dictionary for it.

I don’t understand what is wrong with my code. Maybe that I don’t need to go through a tree to fill the dataframe with my data?

Thanks.

Garjola.

Hi,
the exact reason for this error is a bit technical. In short, TTreeReader, which is the type-safe interface that TDataFrame uses internally to read the data, does not support the kind of branch you are creating in

tree->Branch("features", &feat, "sensor/C:band/C:date/i");

Things would work if you created three separate branches, or one branch that contains a struct with those three data-members.

But there is a more direct way to get what you want: what we call a “TDataFrame from scratch”. This solution requires that you know beforehand how many rows you will have (i.e. how many entries the tree in your example has – 1 in your example). If that’s the case, you can create an TDataFrame with N empty rows and then populate the rows with Defined columns, like so:

using namespace ROOT::Experimental;
TDataFrame d(N); // a TDF with N rows (empty for now)
auto d_with_columns= d.Define("sensor", [&toks] { return toks[0].c_str(); })
                      .Define("band", [&toks] { return toks[1].c_str(); })
                      .Define("date", [&toks] { return atoi(toks[2].c_str()); });

Hope this helps, I’m available for more questions if need be :slight_smile:
Cheers,
Enrico

Update: re-reading your question I saw that you indeed now how many entries you have in advance (it’s the size of features) and I could put in some more elements in the example code. Here is the new version:

using namespace ROOT::Experimental;

void ProcessFeatures(const vector<string> &features) {
  TDataFrame empty_d(features.size());
  auto d = empty_d
    .DefineSlotEntry("tokens",
                     [&features](unsigned int slot, ULong64_t entry) {
                         return split(features[entry], "_");
                     })
    .Define("sensor", [](TokenType &t) { return t[0].c_str(); }, {"tokens"})
    .Define("band", [](TokenType &t) { return t[1].c_str(); }, {"tokens"})
    .Define("date", [](TokenType &t) { return atoi(t[2].c_str()); }, {"tokens"});
  // use d as usual -- no computation has been performed yet
}

Cheers,
Enrico

Hi,

This works perfectly for me. Thanks for your help! Just another quick question: is it a good idea to return a TDataFrame by value from this kind of function? This would allow me to factor this kind of code to prepare the data and re-use it somewhere else.

TDataFrame CreateTableFromFeatures(const FeatureVectorType& fv)
{
TDataFrame d(fv.size());
//fill the DF as in previous example
return d;
}

The thing is that, since I don’t really understand which is the scope of the proxy objects being created, I don’t know if I am leaking memory doing this.

I am mostly working with compiled C++.

Thanks again.

Garjola

1 Like

You would have to work quite hard to leak memory using TDataFrame – everything is RAII and we employ value semantics wherever possible.

You can pass TDataFrame nodes around, with two caveats:

  1. you should not assume that the type of the returned TDF node is TDataFrame: different nodes have different types (see the signatures of Filter and Define in the reference guide)
  2. the starting TDataFrame object, the “head node” of your graph of computations, must be in scope when the event loop runs (you should get an exception with an explanatory message otherwise)

One way to do what you ask, using c++14 automatic return type deduction for functions:

auto AddFeatures(TDataFrame &d, const FeatureVectorType &f) {
     return d.Define(...).Define(...).Define(...);
}

int main() {
  FeatureVectorType f = /*get your feature vector*/;
  TDataFrame empty_d(f.size());
  auto d = AddFeatures(empty_d, f);
}

If you don’t have access to C++14 you can use trailing return types or you can state the return type of AddFeatures explicitly as ROOT::Experimental::TDF::TInterface<TLoopManager> (which works specifically for this case)

Hi,

I am using c++14, so I can do as you suggest. However, I get a segmentation violation when I do:


using namespace ROOT::Experimental;

auto BuildSensorBandDateDF(TDataFrame& d, const std::vector<std::string>& features) {
  using cbutils::string::split;
  using TokenType = std::vector<std::string>;
  return d.DefineSlotEntry("tokens",
                           [&features](unsigned int slot, ULong64_t entry) {
                     return split(features[entry], "_");
                     })
     .Define("sensor", [](TokenType &t) { return t[0].c_str(); }, {"tokens"})
     .Define("band", [](TokenType &t) { return t[1].c_str(); }, {"tokens"})
     .Define("date", [](TokenType &t) { return atoi(t[2].c_str()); }, {"tokens"});

}


int main()
{
  auto fileName = "../data/input_samples.csv";
  auto tdf = ROOT::Experimental::TDF::MakeCsvDataFrame(fileName);

  //How many "events" (i.e. samples) have code 211?
  auto filteredEvents =
    tdf.Filter("code == 211").Count();

  auto columnNames = tdf.GetColumnNames();
  // build a DF from the features available in the input file
  TDataFrame empty_d(columnNames.size());
  auto colDF = BuildSensorBandDateDF(empty_d, columnNames);
  
  // How many features correspond to ndvi?
  auto isNDVI = [](const char* band) { return std::strcmp(band,"ndvi")==0; }; 
  // Limit to landsat8
  auto isLandsat8 = [](const char* sensor){return std::strcmp(sensor,"landsat8")==0; }; 
  auto ndviLandsat8Features = colDF.Filter(isNDVI,{"band"}).Filter(isLandsat8,{"sensor"});
  std::cout << *(ndviLandsat8Features.Count()) << '\n';

  return 0;
}

Executing this code, works sometimes, but sometimes I get:


===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f1178db1fba in __GI___waitpid (pid=17923, stat_loc=stat_loc
                                                     entry=0x7ffcf6c15f20, options=options
                                                     entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:29
#1  0x00007f1178d3907b in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:148
#2  0x00007f117d2ddec2 in TUnixSystem::Exec (shellcmd=<optimized out>, this=0x55bd43a26570) at /home/garjola/src/root/core/unix/src/TUnixSystem.cxx:2118
#3  TUnixSystem::StackTrace (this=0x55bd43a26570) at /home/garjola/src/root/core/unix/src/TUnixSystem.cxx:2412
#4  0x00007f117d2e034c in TUnixSystem::DispatchSignals (this=0x55bd43a26570, sig=kSigSegmentationViolation) at /home/garjola/src/root/core/unix/src/TUnixSystem.cxx:3643
#5  <signal handler called>
#6  __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:31
#7  0x000055bd428e4827 in <lambda(char const*)>::operator()(const char *) const (__closure=0x55bd489e768c, band=0x7265745f6c6c6163 <error: Cannot access memory at address 0x7265745f6c6c6163>) at /home/garjola/Dev/RootLearning/sql/csv-dataframe.cxx:42
#8  0x000055bd428ee5a3 in ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager>::CheckFilterHelper<0>(unsigned int, Long64_t, ROOT::Internal::TDF::StaticSeq<0>) (this=0x55bd489e75f0, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:647
#9  0x000055bd428eddb8 in ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager>::CheckFilters(unsigned int, Long64_t) (this=0x55bd489e75f0, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:635
#10 0x000055bd428ed926 in ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager> >::CheckFilters(unsigned int, Long64_t) (this=0x55bd475fe7c0, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:630
#11 0x000055bd428ed6e5 in ROOT::Internal::TDF::TAction<ROOT::Internal::TDF::CountHelper, ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager> >, ROOT::TypeTraits::TypeList<> >::Run(unsigned int, Long64_t) (this=0x55bd48cac270, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:402
#12 0x00007f117b10cfba in ROOT::Detail::TDF::TLoopManager::RunAndCheckFilters (this=this
                                                                               entry=0x55bd48ce4330, slot=slot
                                                                               entry=0, entry=entry
                                                                               entry=2) at /home/garjola/src/root/tree/treeplayer/src/TDFNodes.cxx:282
#13 0x00007f117b10d1db in ROOT::Detail::TDF::TLoopManager::RunEmptySource (this=0x55bd48ce4330) at /home/garjola/src/root/tree/treeplayer/src/TDFNodes.cxx:182
#14 0x00007f117b110025 in ROOT::Detail::TDF::TLoopManager::Run (this=0x55bd48ce4330) at /home/garjola/src/root/tree/treeplayer/src/TDFNodes.cxx:379
#15 0x000055bd428f42bd in ROOT::Experimental::TDF::TResultProxy<unsigned long long>::TriggerRun (this=0x7ffcf6c18bb0) at /home/garjola/src/root-build/include/ROOT/TResultProxy.hxx:283
#16 0x000055bd428f279e in ROOT::Experimental::TDF::TResultProxy<unsigned long long>::Get (this=0x7ffcf6c18bb0) at /home/garjola/src/root-build/include/ROOT/TResultProxy.hxx:123
#17 0x000055bd428f0d62 in ROOT::Experimental::TDF::TResultProxy<unsigned long long>::operator* (this=0x7ffcf6c18bb0) at /home/garjola/src/root-build/include/ROOT/TResultProxy.hxx:150
#18 0x000055bd428e4c5d in main () at /home/garjola/Dev/RootLearning/sql/csv-dataframe.cxx:46
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum.
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:31
#7  0x000055bd428e4827 in <lambda(char const*)>::operator()(const char *) const (__closure=0x55bd489e768c, band=0x7265745f6c6c6163 <error: Cannot access memory at address 0x7265745f6c6c6163>) at /home/garjola/Dev/RootLearning/sql/csv-dataframe.cxx:42
#8  0x000055bd428ee5a3 in ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager>::CheckFilterHelper<0>(unsigned int, Long64_t, ROOT::Internal::TDF::StaticSeq<0>) (this=0x55bd489e75f0, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:647
#9  0x000055bd428eddb8 in ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager>::CheckFilters(unsigned int, Long64_t) (this=0x55bd489e75f0, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:635
#10 0x000055bd428ed926 in ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager> >::CheckFilters(unsigned int, Long64_t) (this=0x55bd475fe7c0, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:630
#11 0x000055bd428ed6e5 in ROOT::Internal::TDF::TAction<ROOT::Internal::TDF::CountHelper, ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TFilter<main()::<lambda(char const*)>, ROOT::Detail::TDF::TLoopManager> >, ROOT::TypeTraits::TypeList<> >::Run(unsigned int, Long64_t) (this=0x55bd48cac270, slot=0, entry=2) at /home/garjola/src/root-build/include/ROOT/TDFNodes.hxx:402
#12 0x00007f117b10cfba in ROOT::Detail::TDF::TLoopManager::RunAndCheckFilters (this=this
                                                                               entry=0x55bd48ce4330, slot=slot
                                                                               entry=0, entry=entry
                                                                               entry=2) at /home/garjola/src/root/tree/treeplayer/src/TDFNodes.cxx:282
#13 0x00007f117b10d1db in ROOT::Detail::TDF::TLoopManager::RunEmptySource (this=0x55bd48ce4330) at /home/garjola/src/root/tree/treeplayer/src/TDFNodes.cxx:182
#14 0x00007f117b110025 in ROOT::Detail::TDF::TLoopManager::Run (this=0x55bd48ce4330) at /home/garjola/src/root/tree/treeplayer/src/TDFNodes.cxx:379
#15 0x000055bd428f42bd in ROOT::Experimental::TDF::TResultProxy<unsigned long long>::TriggerRun (this=0x7ffcf6c18bb0) at /home/garjola/src/root-build/include/ROOT/TResultProxy.hxx:283
#16 0x000055bd428f279e in ROOT::Experimental::TDF::TResultProxy<unsigned long long>::Get (this=0x7ffcf6c18bb0) at /home/garjola/src/root-build/include/ROOT/TResultProxy.hxx:123
#17 0x000055bd428f0d62 in ROOT::Experimental::TDF::TResultProxy<unsigned long long>::operator* (this=0x7ffcf6c18bb0) at /home/garjola/src/root-build/include/ROOT/TResultProxy.hxx:150
#18 0x000055bd428e4c5d in main () at /home/garjola/Dev/RootLearning/sql/csv-dataframe.cxx:46
===========================================================

The test file is here input_samples.csv.gz (111.0 KB)

Any idea on what is wrong?

Thanks.

Garjola

Hi,
thank you for providing the means to reproduce your issue. I see the crash too.
I see that sometimes you access t[1] even if t has only one element, so I think that’s your bug (easy to verify by putting some debug prints in your Define lambdas).

I also suggest that you change your const char * to std::string because otherwise TDF is storing and passing pointers to data the lifetime of which is not super well defined.

Cheers,
Enrico

Thanks for you answer, even on week ends! It’s great to have this kind of support.

I am ashamed of this error, because I was checking the length of the tokens in the original code and then I did not think of that because I was looking for complex things like the return data type, etc.

Garjola

No pressure :smiley: Happy to help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.