Same content but different file sizes ROOT

I have two output root files generated by the same function

Output File1: generated by function run in stand-alone mode
Output File2: Run from Spark-Submit in parallel mode

They both contain the same number of entries. However, Output File 1 is 100 MB while Output File 2 is 5GB

I see that it might have to do something with the baskets. But I am not sure, how to start resolving this ?

Can you please help ?

branch_merged = tree_tracks_kine.Branch('Events', 'MergedDataKine', AddressOf(tracks_kine), 32000, 99)

Output File 1

******************************************************************************
*Tree    :kine_tracks: Tree containing merged tracks                          *
*Entries :      542 : Total =        19961375 bytes  File  Size =   10047112 *
*        :          : Tree compression factor =   1.98                       *
******************************************************************************
*Branch  :Events                                                             *
*Entries :      542 : BranchElement (see below)                              *
*............................................................................*
*Br    0 :fUniqueID : UInt_t                                                 *
*Entries :      542 : Total  Size=       8158 bytes  File Size  =       5707 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.18     *
*............................................................................*
*Br    1 :fBits     : UInt_t                                                 *
*Entries :      542 : Total  Size=      10530 bytes  File Size  =       7339 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.24     *
*............................................................................*
*Br    2 :event     : Long_t                                                 *
*Entries :      542 : Total  Size=      10090 bytes  File Size  =       6924 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.25     *
*............................................................................*
*Br    3 :timestamp : Long_t                                                 *
*Entries :      542 : Total  Size=      10326 bytes  File Size  =       8319 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.07     *
*............................................................................*
*Br    4 :initial_cluster_size : Int_t                                       *
*Entries :      542 : Total  Size=       8807 bytes  File Size  =       7338 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    5 :dbscan_cluster_size : Int_t                                        *
*Entries :      542 : Total  Size=       8748 bytes  File Size  =       7283 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    6 :gmm_cluster_size : Int_t                                           *
*Entries :      542 : Total  Size=       8571 bytes  File Size  =       7118 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    7 :diff_tot_db : Int_t                                                *
*Entries :      542 : Total  Size=       8276 bytes  File Size  =       6843 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    8 :diff_db_gmm : Int_t                                                *
*Entries :      542 : Total  Size=       8276 bytes  File Size  =       5817 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.18     *
*............................................................................*
*Br    9 :no_clusters_db : Int_t                                             *
*Entries :      542 : Total  Size=       8453 bytes  File Size  =       7006 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   10 :no_clusters_gmm : Int_t                                            *
*Entries :      542 : Total  Size=       8512 bytes  File Size  =       7063 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   11 :no_except_clusters : Int_t                                         *
*Entries :      542 : Total  Size=       8689 bytes  File Size  =       7226 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   12 :merged_cluster_size : Int_t                                        *
*Entries :      542 : Total  Size=       8748 bytes  File Size  =       7283 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   13 :merged_tracks_number : Int_t                                       *
*Entries :      542 : Total  Size=       8807 bytes  File Size  =       7285 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.01     *
*............................................................................*
*Br   14 :input_output_merge_points_diff : Int_t                             *
*Entries :      542 : Total  Size=       9397 bytes  File Size  =       7888 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   15 :event_multiplicity : Int_t                                         *
*Entries :      542 : Total  Size=       8689 bytes  File Size  =       6660 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.09     *
*............................................................................*
*Br   16 :diff_merge_list_form_merge : vector<int>                           *
*Entries :      542 : Total  Size=      15021 bytes  File Size  =       8776 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.54     *
*............................................................................*
*Br   17 :diff_merge_list_reduced_matrix : vector<int>                       *
*Entries :      542 : Total  Size=      15257 bytes  File Size  =       8996 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.53     *
*............................................................................*
*Br   18 :diff_form_merge_reduced_matrix : vector<int>                       *
*Entries :      542 : Total  Size=      15257 bytes  File Size  =       8996 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.53     *
*............................................................................*
*Br   19 :merged_tracks : Int_t merged_tracks_                               *
*Entries :      542 : Total  Size=      51734 bytes  File Size  =       8417 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.14     *
*............................................................................*
*Br   20 :merged_tracks.fUniqueID : UInt_t fUniqueID[merged_tracks_]         *
*Entries :      542 : Total  Size=      19084 bytes  File Size  =       8155 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   2.15     *
*............................................................................*
*Br   21 :merged_tracks.fBits : UInt_t fBits[merged_tracks_]                 *
*Entries :      542 : Total  Size=      18848 bytes  File Size  =       8216 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   2.10     *
*............................................................................*
*Br   22 :merged_tracks.clusterid : Int_t clusterid[merged_tracks_]          *
*Entries :      542 : Total  Size=      19084 bytes  File Size  =      10721 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.63     *
*............................................................................*
*Br   23 :merged_tracks.cluster_size : Int_t cluster_size[merged_tracks_]    *
*Entries :      542 : Total  Size=      19261 bytes  File Size  =      13722 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.29     *
*............................................................................*
*Br   24 :merged_tracks.centroid_X : Float_t centroid_X[merged_tracks_]      *
*Entries :      542 : Total  Size=      19143 bytes  File Size  =      17564 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   25 :merged_tracks.centroid_Y : Float_t centroid_Y[merged_tracks_]      *
*Entries :      542 : Total  Size=      19143 bytes  File Size  =      17497 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   26 :merged_tracks.centroid_Z : Float_t centroid_Z[merged_tracks_]      *
*Entries :      542 : Total  Size=      19143 bytes  File Size  =      17566 *
*Baskets :       55 : Basket Size=      32000 bytes  Compression=   1.00     *

*............................................................................*

Output File 2

******************************************************************************
*Tree    :kine_tracks: Tree containing Kine tracks                            *
*Entries :      542 : Total =      6861311153 bytes  File  Size = 5427165811 *
*        :          : Tree compression factor =   1.26                       *
******************************************************************************
*Branch  :Events                                                             *
*Entries :      542 : BranchElement (see below)                              *
*............................................................................*
*Br    0 :fUniqueID : UInt_t                                                 *
*Entries :      542 : Total  Size=      58319 bytes  File Size  =      47154 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    1 :fBits     : UInt_t                                                 *
*Entries :      542 : Total  Size=      62639 bytes  File Size  =      51490 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    2 :event     : Long_t                                                 *
*Entries :      542 : Total  Size=      58303 bytes  File Size  =      47154 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    3 :timestamp : Long_t                                                 *
*Entries :      542 : Total  Size=      60487 bytes  File Size  =      49322 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    4 :initial_cluster_size : Int_t                                       *
*Entries :      542 : Total  Size=      64325 bytes  File Size  =      53116 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    5 :dbscan_cluster_size : Int_t                                        *
*Entries :      542 : Total  Size=      63779 bytes  File Size  =      52574 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    6 :gmm_cluster_size : Int_t                                           *
*Entries :      542 : Total  Size=      62141 bytes  File Size  =      50948 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    7 :diff_tot_db : Int_t                                                *
*Entries :      542 : Total  Size=      59411 bytes  File Size  =      48238 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    8 :diff_db_gmm : Int_t                                                *
*Entries :      542 : Total  Size=      59411 bytes  File Size  =      48238 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br    9 :no_clusters_db : Int_t                                             *
*Entries :      542 : Total  Size=      61049 bytes  File Size  =      49864 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   10 :no_clusters_gmm : Int_t                                            *
*Entries :      542 : Total  Size=      61595 bytes  File Size  =      50406 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   11 :no_except_clusters : Int_t                                         *
*Entries :      542 : Total  Size=      63233 bytes  File Size  =      52032 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   12 :merged_cluster_size : Int_t                                        *
*Entries :      542 : Total  Size=      63779 bytes  File Size  =      52574 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   13 :merged_tracks_number : Int_t                                       *
*Entries :      542 : Total  Size=      64325 bytes  File Size  =      53116 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   14 :input_output_merge_points_diff : Int_t                             *
*Entries :      542 : Total  Size=      69785 bytes  File Size  =      58536 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   15 :event_multiplicity : Int_t                                         *
*Entries :      542 : Total  Size=      63233 bytes  File Size  =      52032 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   16 :diff_merge_list_form_merge : vector<int>                           *
*Entries :      542 : Total  Size=      77357 bytes  File Size  =      66124 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   17 :diff_merge_list_reduced_matrix : vector<int>                       *
*Entries :      542 : Total  Size=      79541 bytes  File Size  =      68292 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   18 :diff_form_merge_reduced_matrix : vector<int>                       *
*Entries :      542 : Total  Size=      79541 bytes  File Size  =      68292 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   19 :merged_tracks : Int_t merged_tracks_                               *
*Entries :      542 : Total  Size=     327119 bytes  File Size  =      55826 *
*Baskets :      542 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br   20 :merged_tracks.fUniqueID : UInt_t fUniqueID[merged_tracks_]         *
*Entries :      542 : Total  Size=      79959 bytes  File Size  =      65628 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.05     *
*............................................................................*
*Br   21 :merged_tracks.fBits : UInt_t fBits[merged_tracks_]                 *
*Entries :      542 : Total  Size=      77775 bytes  File Size  =      65304 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.02     *
*............................................................................*
*Br   22 :merged_tracks.clusterid : Int_t clusterid[merged_tracks_]          *
*Entries :      542 : Total  Size=      79959 bytes  File Size  =      68562 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   23 :merged_tracks.cluster_size : Int_t cluster_size[merged_tracks_]    *
*Entries :      542 : Total  Size=      81597 bytes  File Size  =      70261 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   24 :merged_tracks.centroid_X : Float_t centroid_X[merged_tracks_]      *
*Entries :      542 : Total  Size=      80505 bytes  File Size  =      69188 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   25 :merged_tracks.centroid_Y : Float_t centroid_Y[merged_tracks_]      *
*Entries :      542 : Total  Size=      80505 bytes  File Size  =      69188 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *
*............................................................................*
*Br   26 :merged_tracks.centroid_Z : Float_t centroid_Z[merged_tracks_]      *
*Entries :      542 : Total  Size=      80505 bytes  File Size  =      69188 *
*Baskets :      542 : Basket Size=      15174 bytes  Compression=   1.00     *

*............................................................................*

Hi,
Yes basket size and compressions are different. @pcanal should know more.

Lorenzo

The unit of work in the spark case seems too small. It results in only one entry per basket (around 100 bytes per basket instead of the usual 32K ish). Each spark “unit” of work should (probably) process around 32Mb worth of outbound data to be more efficient.

You should be handle to recover the compression after the fact (at the wastage cost of read and writing the whole file) by doing:

hadd -O -f newfile.root oldfile.root

That saids, the change in compression factor (from 2 to 1) does not explain the huge increase.

Can you provide the result of file->Map(); for both the good and bad file (this will be a long output).

Can you remind us how is the file created and merged for the “Spark-Submit” case?

Thank you for the response.map_op_files.zip (206.5 KB)

I have attached the Map output of both the files in the attached zip file.

I am using the function mappartitionwithindex of Spark.

result_track_kine = parallel_instances.mapPartitionsWithIndex(track_kine.get_kine_tracks_info).collect()

from which I use the partition index to read a particular file and write them back after processing using get_kine_tracks_info. Each partition processes one file and writes them. The partitions are executed in parallel. At the end of the function Fill and Write functions of TTree are called.

def get_kine_tracks_info(self, partition_index, partition_iterator):
 new_file_name = 'merged_'+ file_format[0] + str(run_num) + '_' + str(partition_index) + file_format[1]
        new_file_name_output = 'kine_' + file_format[0] + str(run_num) + '_' + str(partition_index) + file_format[1]
        f2 = TFile(settings.base_location + settings.output_files_loc + settings.root_files_loc + settings.merged_tracks_root_files + new_file_name)
        myTr = f2.Get("merged_tracks")
        partition_index = new_file_name
        entry = myTr.GetEntries()
        kine_output_tree = settings.base_location + settings.output_files_loc + settings.root_files_loc + settings.kine_tracks_root_files + new_file_name_output
        f3 = TFile(kine_output_tree, 'RECREATE')
        tree_tracks_kine = TTree('kine_tracks', 'Tree containing Kine tracks')
        tracks_kine = self.kine_output_data_
        branch_merged = tree_tracks_kine.Branch('Events', 'MergedDataKine', AddressOf(tracks_kine), 32000, 99)

Thank you for helping out. I have attached the info that pcanal requested,

Compared to the original post the new TTree::Print output contains:

*Br   35 :merged_tracks.centroid_circle_X :                                  *
*         | vector<float> centroid_circle_X[merged_tracks_]                  *
*Entries :      542 : Total  Size= 1140785131 bytes  File Size  =  789811895 *
*Baskets :      542 : Basket Size=    7587840 bytes  Compression=   1.44     *

which explains fully the problem.

This branch (and other similar branches) are way too big. They are std::vector and the most likely cause is that for each TTree entry the vector contains not only the entries’ data but also the data for **
*all* the previous entries.

This suggest that all that is missing is a set of

vector_data_object.clear();

at the begin (or end) of the even loop for each of the std::vector store in the TTree.

Cheers,
Philippe.

3 Likes

Dear pcanal,

Thank you for helping with this. I am sorry, I did not pay attention to this clear thing.
It solves my problem.