Slow rendering of large 2d histograms

jrtomps · May 8, 2015, 2:16pm

Hello,

I am writing an application based on ROOT that will routinely have to deal with displaying many histograms at once. The intent for this application is to replace an old piece of software with something that has the utilities of ROOT. The issue I have found is that the rendering of 2d histograms in ROOT is quite slow. I benchmarked this in the attached script (run in compiled mode), which creates a canvas displaying 16 separate TH2s with 1000 bins on each axis. Any time I call an Update() on the canvas it takes about 20 seconds to render them all, in the meantime blocking the main thread. To my users, this is unacceptable in the new application despite it being the standard ROOT experience. They expect similar rendering performance to the older application, which can do the same update in less than 1 second while keeping the GUI responsive. For the record, all of these tests are done over a very fast ssh connection with windows forwarding to the linux cluster at our lab. This is the standard use case for all interaction with ROOT and I must base my performance in this context.

Are there any tricks that I should consider to knock down the rendering time by about a factor of 20?

Thank you in advance,
Jeromy

P.S. the test script is attached. The top of the file describes how to run it.
P.P.S. These tests are being run on ROOT 5.34.10

jrtomps · May 8, 2015, 2:17pm

I forgot the attachment. Here it is.
root_test.C (3.65 KB)

tpochep · May 9, 2015, 1:45pm

You kidding?
Do a simple maths, there’s no miracle: 1000x1000 == 1000000 bins, if you are using scatter plot (btw you did not mention the type of plot you are using) - it’s presented as a ‘cloud’ of points, so to fill million squares with hundreds/thousands of points you call random (or whatever the name is) hundreds millions times, and you have 16 such histograms. Or, and the fact you work over ssh and it’s a fast ssh with windows forwarding is helping a lot!!!

Generating all these points for scatter plot can take time, I guess. And rendering millions of boxes if you’re using colz or another option is also not cheap or fast. And using all this mess over ssh also is not very smart.
Probably, something is very wrong with your approach.

couet · May 11, 2015, 8:49am

note that with a such number of bins the size of each bin on screen is smaller than a pixel if you use a 500x500 TCanvas.

jrtomps · May 11, 2015, 10:15am

@tpochep

Yes, I understand that the question of looking for a factor of 20 something improvement by some trick is a bit ludicrous. I was hoping that someone might provide some better insight for why the performance is slow while being clear that a factor of 20 improvement is what I need to achieve.

These performance goals are the demands of the users, not mine. It sounds like a crazy goal to achieve, but they demand 20x better rendering performance because it is the level of performance they have grown used to. It is the performance achieved in the program I am replacing. That program is the Xamine viewer, which is part of the SpecTcl analysis framework used at the NSCL.

Having to iterate through 1 million bins is certainly a chore as you said, however, our processors run on the order of a GHz. If the code is optimized for speed, it is not unreasonable to iterate through 1 million bins and perform some simple computations in fractions of a second. If Xamine can do it, ROOT should be able to as well.

I did not mention the style of drawing because it was in the script I attached. I am passing the “colz” option, which is comparative to the style produced by Xamine.

@couet
At the moment, I am profiling the performance of the “colz” style rendering that is found in THistPainter::PaintColorLevels() method to see what the slow down is. My first guess is that the rendering code is a bit bloated, but that is just a hypothesis. I need to study my profiling results and the code a bit further.

If I am able to improve performance for rendering of a “colz”, what would have to happen for it to be merged back into a deployed version of ROOT? I ask because unless my work has the option of being accepted by you all, my superiors are not so keen on me working on it (nor am I).

couet · May 11, 2015, 10:18am

We are always open to accept improvements.

tpochep · May 11, 2015, 2:55pm

[quote=“jrtomps”]@tpochep

I was hoping that someone might provide some better insight for why the performance is slow while being clear that a factor of 20 improvement is what I need to achieve.
[/quote]

Because it’s expensive to draw 16000000 filled rectangles, even running without ssh/locally.

jrtomps · May 13, 2015, 9:16pm

@couet

After some discussion here at the NSCL, we believe there are two major differences between Xamine and ROOT that produce the performance difference.

Xamine computes the image of a histogram through multiple calls to XPutPixel whereas ROOT draws a histogram via calls to XDrawRectangle. The difference between these two is that XPutPixel operates on an XImage that is local to the client of the X server. Each call to XPutPixel causes no protocol requests to be made to the X server. It is therefore fast. XDrawRectangle on the other hand appears to communicate with the server. If this is the proper understanding, it means that ROOT performance is severely hindered by the overhead of communicating with the server, since it tries to do so for every single bin in the histogram. In Xamine, the only communication that occurs with the server is after the XImage is fully drawn. It does so by calling XPutImage, which transfers the local XImage to the X server in one fell swoop.

To test the difference, in Xamine, I replaced XPutPixel calls with XDrawRectangle calls. Doing so destroyed Xamine’s performance to equivalent or even worse than the rendering performance of ROOT. Nothing else changed in Xamine, so clearly XDrawRectangle is a major culprit to the problem.

Xamine only accesses the histogram contents as many times as there are pixels to draw. So if there are only 1000 pixels available to draw in and 1M bins in the histogram, Xamine will only access bin content 1000 times. ROOT on the other hand draws a rectangle for every bin in the histogram. This is extra effort.

Because of the test I did by replacing XPutPixel with XDrawRectangle, I believe that addressing the first issue is where the most performance is to be gained. It is clear that implementing this would introduce a bit of a paradigm shift to the way ROOT handles its rendering and would likely affect a significant amount of code. How much? I am not so sure.
I am curious to know you or anyone else’s thoughts on how to proceed considering that this has the potential to affect a significant amount of code.

couet · May 18, 2015, 8:41am

Yes by definition ROOT does that… that’s what the COL option is… it renders the bins as boxes not as pixels. The COL option is not meant to produce images pixel per pixel.

ferhue · June 10, 2015, 8:55am

A colleague of my group had the same problem and he found a workaround, namely using

before creating the canvases. The update process is then much faster. Maybe jrtomps could give it a try.

couet · June 10, 2015, 9:02am

Yes OpenGL will use the local GPU. But that does not change the rendering algorithm. It is still boxes… simply drawn a bit faster.

rfoxnscl · June 19, 2015, 11:51am

As the author of Xamine I should clarify a bit for historical purposes. Both Xamine and Root draw rectangles for each channel. The difference is in how those rectangles are drawn. Dusting off my rusty Xwindows programming knowledge…

When you draw a rectangle using XDrawRectangle each channel requires a round trip interaction with the X server if that is your display engine.

What Xamine does is create an XImage object. Those are client side entities rather than server side entities. XPutPixel is called, possibly several times per channel to draw a rectangle, in the image but, since these are executed fully client local they run much faster than XDrawRectangle. Once the image is drawn a single server interaction (XPutImage) is called to transfer that image to the server and hence the display itself.

There is a simplification in this description and a potential for an improvement in Root that does not break its current model (maybe):

Simplification: If X11 batching is enabled, several XDrawRectangle operations can be batched into a single client/server interaction.

Potential optimization:
If, in general, several channels map to the same color value, the algorithm for plotting a 2d in col form could, instead of scanning the histogram in coordinate space, drawing a rectangle at a time, locate all channels that map to each color and perform an XDrawRectangles for each batch of rectangles it needs to draw with a common color. That would reduce the server interactions to one per color level rather than one per rectangle – under the assumption that XDrawRectangles is not just doing a client side loop to do a bunch of XDrawRectangle calls but sending a single message to the X11 server with several RECTANGLE atoms.
Unfortunately XDrawRectangles only takes a single GC (which is where the color is stored), hence the need in this optimization to sort the rectangles by color level.

This optimization could be done in a manner that is display engine independent by adding a drawRectangles method to the display API and then letting X11 use XDrawRectangles and windows, e.g. use several Rectangle calls to implement it.

Ron.

[quote=“jrtomps”]@couet

After some discussion here at the NSCL, we believe there are two major differences between Xamine and ROOT that produce the performance difference.

Xamine computes the image of a histogram through multiple calls to XPutPixel whereas ROOT draws a histogram via calls to XDrawRectangle. The difference between these two is that XPutPixel operates on an XImage that is local to the client of the X server. Each call to XPutPixel causes no protocol requests to be made to the X server. It is therefore fast. XDrawRectangle on the other hand appears to communicate with the server. If this is the proper understanding, it means that ROOT performance is severely hindered by the overhead of communicating with the server, since it tries to do so for every single bin in the histogram. In Xamine, the only communication that occurs with the server is after the XImage is fully drawn. It does so by calling XPutImage, which transfers the local XImage to the X server in one fell swoop.

To test the difference, in Xamine, I replaced XPutPixel calls with XDrawRectangle calls. Doing so destroyed Xamine’s performance to equivalent or even worse than the rendering performance of ROOT. Nothing else changed in Xamine, so clearly XDrawRectangle is a major culprit to the problem.

Xamine only accesses the histogram contents as many times as there are pixels to draw. So if there are only 1000 pixels available to draw in and 1M bins in the histogram, Xamine will only access bin content 1000 times. ROOT on the other hand draws a rectangle for every bin in the histogram. This is extra effort.

Because of the test I did by replacing XPutPixel with XDrawRectangle, I believe that addressing the first issue is where the most performance is to be gained. It is clear that implementing this would introduce a bit of a paradigm shift to the way ROOT handles its rendering and would likely affect a significant amount of code. How much? I am not so sure.
I am curious to know you or anyone else’s thoughts on how to proceed considering that this has the potential to affect a significant amount of code.[/quote]

rfoxnscl · June 19, 2015, 11:56am

The interesting thing about my proposed optimization is that if the communication with the X server is what’s dominating it turns the drawing algorithm to an O(n) where n is the number of channels to O(m) where m is the number of color levels…independent of the histogram size. Since the number of color levels is typically small in the X11 server this approximates to O(1) that is the time to draw a histogram becomes independent of its size.

couet · June 22, 2015, 12:20pm

The ROOT COL option is implemented in THistpainter which is an high level interface using the high level box drawing method in TPad. There is no direct interface to X11. To gain speed on should mean to reimplement this method using TImage class or a Cell Array method. But we cannot access PutPixel at this level.

rfoxnscl · June 22, 2015, 12:28pm

That I recognize and understand, having talked with Jeromy about your interactions out of the forum. The improvement I was suggesting, however was not along those lines. It involved having THistpainter sort the boxes by color levels before passing them off to the drawing methods and adding a mechanism to plot multiple rectangles of the same color in one call.
That allows you, when you get to the X11 driver level, to use XDrawRectangles to batch the drawing of all (or many in any event) rectangles of the same color, which may reduce substantially the communication overhead that you currently suffer.
In the best case if you have a 1K by 1K histogram and there are 256 color levels and, XDrawRectangles batches all of its rectangles into one server interaction you’ve gone from 10^6 server interactions to 256 server interactions which should about 4 orders of magnitude improvement in performance if, as this thread suggests, performance is now dominated by server interaction.

couet · June 22, 2015, 12:38pm

Yes that’s not a small job. If you have the code let me know.

rfoxnscl · June 22, 2015, 12:42pm

I’ll let you know if/when either Jeromy or I get the time for that.