Home | News | Documentation | Download

Help to extend Covariance Functor for RDataFrame functionality

Dear ROOT experts,

I wrote back in time a functor for Covariance calculation between 2 colums in a RDataFrame

Here the C++ code and the example usage .
Now what i would like to do is to expand the functor to actually create a
TMatrixSymD object and pass Any set of columns.
Can anyone give me some hint how to expand the code i already have for this purpose?

Thanks
Renato

/**
 * T is the type of the Scalar Column used
 * Example of usage with RDataFrame
 * auto covi = Covariance<double>();            
 * auto covXY = df.Book<double,double>(std::move(covi), {"xColumn", "yColumn"} ) );
*/
template< typename T> 
class Covariance : public ROOT::Detail::RDF::RActionImpl<Covariance<T>>{ 
    public : 
        using Covariance_t = T;    
        using Result_t = Covariance_t;
    private : 
        std::vector<Covariance_t>  _xyproductSUM;  //one per data processing slot
        std::vector<Covariance_t>  _xStatsSUM;     //one per data processing slot
        std::vector<Covariance_t>  _yStatsSUM;     //one per data processing slot
        std::vector<int> _nEntries;                //one per data processing slot
        std::shared_ptr<Covariance_t> _covariance;        
    public : 
        Covariance( ){
            const auto nSlots = ROOT::IsImplicitMTEnabled() ?  ROOT::GetImplicitMTPoolSize() : 1;
            for (auto i : ROOT::TSeqU(nSlots)){
                 _xyproductSUM.emplace_back(0.);
                 _xStatsSUM.emplace_back(0.);
                 _yStatsSUM.emplace_back(0.);   
                 _nEntries.emplace_back(0);
                 (void)i;
            }
            _covariance =  std::make_shared<double>(0.);
        }
        Covariance( Covariance &&)= default;
        Covariance( const Covariance &) = delete;
        std::shared_ptr<Covariance_t> GetResultPtr() const { 
            return  _covariance;
        }
        void Initialize() {}
        void InitTask(TTreeReader *, unsigned int) {}
        template <typename... ColumnTypes>
        void Exec(unsigned int slot, ColumnTypes... values){
            std::array<double, sizeof...(ColumnTypes)> valuesArr{static_cast<double>(values)...};     
            _nEntries[slot] ++;
            _xyproductSUM[slot] += valuesArr[0]*valuesArr[1];
            _xStatsSUM[slot] += valuesArr[0];
            _yStatsSUM[slot] += valuesArr[1];
        }
        void Finalize(){
            for( auto  slot : ROOT::TSeqU(1, _xyproductSUM.size())){
                _xyproductSUM[0] += _xyproductSUM[slot];
                _xStatsSUM[0]    += _xStatsSUM[slot];
                _yStatsSUM[0]    += _yStatsSUM[slot];
                _nEntries[0]     += _nEntries[slot];
            }
            /*

            */
            *_covariance  =  (1./( _nEntries[0]-1.)) * (  ( _xyproductSUM[0] ) -  1./(_nEntries[0]) * ( (_xStatsSUM[0])) * ( (_yStatsSUM[0])) )  ;
        }
        std::string GetActionName(){
            return "Covariance";
        }
};

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi Renato,
about creating a TMatrixSymD: how would you do it? accumulate the results of each data processing slot and then aggregate them into a TMatrixSymD in Finalize?

About passing “any set of columns”, isn’t that already the case? Exec takes a variadic template parameter pack.

Cheers,
Enrico

Hi @eguiraud, in practice yes, rather than returning a double, i would return a TMatrixSymD, but for this i don’t know how it would be optimal to make some nested vectors etc… Basically i want that if one runs Covariance( a, b, c) it spit out a 3x3 matrix. I am failing to undertand how to best pack up the sumsX, sumsY and sumsXY, and wheter there would be any special container class which is easy to debug. I wamt to avoid arrays of arrays or vector of vectors basically

Also probably i would need to template the nxn dimensionality and throw some errors out if the operator call sees more or less than n columns

Maybe @moneta has some pointers for a multi-thread algorithm that produces a covariance matrix.

For that you could compare sizeof...(ColumnTypes) in Exec with the value you expect (sizeof... returns the size of the template parameter pack).