Understanding meaning of variables in pyTorch BatchGenerator tutorial

Dear ROOT experts,

I found the Batch Generator tutorial for pyTorch extremely useful, however, I would like to better understand what is the meaning of some variables. In particular:

  • chunk_size
  • block_size
  • target

What should I set chunk and block sizes to? Low? High? What does this mean for execution time, memory consumption?
What is the target, as it seems to be the string literal "Type" in the example, and what it can be or should be is not entirely clear.

Thanks!

Another question:

As far as I understand, the BatchGenerator does the job of both Dataset and DataLoader in pyTorch terms.
Could it be possible to use ROOT generators to construct Dataset, but still use pyTorch DataLoader for forming the batches?

@martinfoell might be of help here

Dear @FoxWise ,

Thank you for reaching out on the forum, and for your feedback on the batch generator!

First of all, this interface is experimental and under development so the input parameters may change in the future, so we welcome any feedback on the interface. Here is the description for the input parameters that you asked for, and a more detailed description below with some recommendations:

  • chunk_size: number of entries loaded into memory. A chunk is further built up of blocks.

  • block_size: number of consecutive entries from the dataframe that makes up a block.

  • target: column(s) from the dataframe that is used as target for the training, e.g the class label. The target can either be a single column or a list of columns if you have several targets.

The main motivation of the batch generator is to enable training on a datasets that do not fit in memory, so the chunk and block sizes are introduced to enable loading parts of the dataset into memory and at the same time read in the data on disk from different parts of the dataset of have a mixed chunk which is important in the training to avoid bias towards one or several classes in the dataset.

The chunk size then determines how much of the dataset that you load into memory, and should as a first recommendation be set as high as you can afford. The block size determines how shuffled the chunk that you load into memory is, and the ratio chunk_size/block_size gives the number of blocks in a chunk. More blocks in a chunk means that you read from several different parts of the dataset which gives a better mixing of the chunk . A high block size then means you will have a chunk that is less mixed and you might get some bias in the training, while a low block size will give a better mixing of the chunk which is better for the training, but you might experience a higher executing time. Once a chunk is loaded in memory the entries are further shuffled to give a better mixing before the batches are created.

It is difficult to give exact recommendations since it depends on the dataset size and how the classes are distributed in the dataset. However, one strategy that you can use is to fix the chunk size to as high as you can (meaning how much data you can afford to have in memory), and then only adjust the block size. If you see strange behaviors when e.g comparing the training and validation losses it might be an indication that you have a too high block size. You can then lower the block size until you get a stable training, but possibly at a cost of a higher execution time.

Hope this clears up the questions that you had!

Thank you again for trying it out, we are also open to further discussions/meeting if you want to show us your use case for the data loader.

Cheers,

Martin

1 Like