Help to extend Covariance Functor for RDataFrame functionality

Dear ROOT experts,

I wrote back in time a functor for Covariance calculation between 2 colums in a RDataFrame

Here the C++ code and the example usage .
Now what i would like to do is to expand the functor to actually create a
TMatrixSymD object and pass Any set of columns.
Can anyone give me some hint how to expand the code i already have for this purpose?

Thanks
Renato

/**
 * T is the type of the Scalar Column used
 * Example of usage with RDataFrame
 * auto covi = Covariance<double>();            
 * auto covXY = df.Book<double,double>(std::move(covi), {"xColumn", "yColumn"} ) );
*/
template< typename T> 
class Covariance : public ROOT::Detail::RDF::RActionImpl<Covariance<T>>{ 
    public : 
        using Covariance_t = T;    
        using Result_t = Covariance_t;
    private : 
        std::vector<Covariance_t>  _xyproductSUM;  //one per data processing slot
        std::vector<Covariance_t>  _xStatsSUM;     //one per data processing slot
        std::vector<Covariance_t>  _yStatsSUM;     //one per data processing slot
        std::vector<int> _nEntries;                //one per data processing slot
        std::shared_ptr<Covariance_t> _covariance;        
    public : 
        Covariance( ){
            const auto nSlots = ROOT::IsImplicitMTEnabled() ?  ROOT::GetImplicitMTPoolSize() : 1;
            for (auto i : ROOT::TSeqU(nSlots)){
                 _xyproductSUM.emplace_back(0.);
                 _xStatsSUM.emplace_back(0.);
                 _yStatsSUM.emplace_back(0.);   
                 _nEntries.emplace_back(0);
                 (void)i;
            }
            _covariance =  std::make_shared<double>(0.);
        }
        Covariance( Covariance &&)= default;
        Covariance( const Covariance &) = delete;
        std::shared_ptr<Covariance_t> GetResultPtr() const { 
            return  _covariance;
        }
        void Initialize() {}
        void InitTask(TTreeReader *, unsigned int) {}
        template <typename... ColumnTypes>
        void Exec(unsigned int slot, ColumnTypes... values){
            std::array<double, sizeof...(ColumnTypes)> valuesArr{static_cast<double>(values)...};     
            _nEntries[slot] ++;
            _xyproductSUM[slot] += valuesArr[0]*valuesArr[1];
            _xStatsSUM[slot] += valuesArr[0];
            _yStatsSUM[slot] += valuesArr[1];
        }
        void Finalize(){
            for( auto  slot : ROOT::TSeqU(1, _xyproductSUM.size())){
                _xyproductSUM[0] += _xyproductSUM[slot];
                _xStatsSUM[0]    += _xStatsSUM[slot];
                _yStatsSUM[0]    += _yStatsSUM[slot];
                _nEntries[0]     += _nEntries[slot];
            }
            /*

            */
            *_covariance  =  (1./( _nEntries[0]-1.)) * (  ( _xyproductSUM[0] ) -  1./(_nEntries[0]) * ( (_xStatsSUM[0])) * ( (_yStatsSUM[0])) )  ;
        }
        std::string GetActionName(){
            return "Covariance";
        }
};

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi Renato,
about creating a TMatrixSymD: how would you do it? accumulate the results of each data processing slot and then aggregate them into a TMatrixSymD in Finalize?

About passing “any set of columns”, isn’t that already the case? Exec takes a variadic template parameter pack.

Cheers,
Enrico

Hi @eguiraud, in practice yes, rather than returning a double, i would return a TMatrixSymD, but for this i don’t know how it would be optimal to make some nested vectors etc… Basically i want that if one runs Covariance( a, b, c) it spit out a 3x3 matrix. I am failing to undertand how to best pack up the sumsX, sumsY and sumsXY, and wheter there would be any special container class which is easy to debug. I wamt to avoid arrays of arrays or vector of vectors basically

Also probably i would need to template the nxn dimensionality and throw some errors out if the operator call sees more or less than n columns

Maybe @moneta has some pointers for a multi-thread algorithm that produces a covariance matrix.

For that you could compare sizeof...(ColumnTypes) in Exec with the value you expect (sizeof... returns the size of the template parameter pack).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.