A possible new approach for faster repeated selections in HEP workflows

Marcin_Lesniak · June 15, 2026, 8:10am

Hello everyone,

I am new to this forum and I am not a physicist. I am the author of a computational method I call Signature Index (SI), and I joined because I’m curious whether it could be genuinely useful to physicists working with large HEP datasets and their specific style of data exploration.

SI is designed for workloads in which the same dataset is queried repeatedly through many related selections: exact conditions, changing thresholds, allowed-value sets, projections, segments, or nearby variants of previously tested states.

The core of SI is not simply that it builds another representation of the data. Its advantage comes from the specific form of that representation: it constructs a deterministic memory of the multidimensional states that were actually observed in the dataset, together with their support and optional additive payloads. Subsequent queries are evaluated against this reusable observed-support structure rather than by repeatedly scanning the original rows.

This makes SI most relevant when:

· the underlying dataset remains relatively stable,

· many related hypotheses or selections are tested,

· queries involve several dimensions or state conditions,

· the cost of building the memory can be amortized across repeated exploration.

I have tested SI on several large datasets and against different public tools and baselines. Most recently, I tested it on HEP-style event-level workloads. In the repeated and more complex query scenarios I examined, SI consistently outperformed all the solutions I tested, while the advantage was naturally much smaller or absent for simple one-off queries.

It is not intended as a replacement for ROOT, RDataFrame, storage formats, or existing HEP analysis frameworks. My question is whether this kind of reusable observed-state memory could be useful as an additional layer in the way HEP analysts repeatedly explore selections and regions over the same data.

In the technical report, I describe the HEP-style tests in more detail and discuss where SI might potentially fit within an HEP analysis workflow.

Public materials:

· Overview: https://mlesniak75.github.io/signature-index/

· Technical report: https://zenodo.org/records/20584373

· Reference implementation: https://github.com/mlesniak75/signature-index-artifact

The public implementation is intentionally general and focused on explaining the method and making its semantics reproducible. If anyone here is interested in evaluating SI more practically, I can also provide a private, more performance-oriented HEP evaluation package prepared for local testing on event-level CSV or Parquet data.

I am simply curious whether this kind of approach could genuinely help in your specific way of working with data, or whether there are important aspects of HEP analysis that make it less useful than it appears from my external perspective.

Kind regards,
Marcin