TParallelCoord questions/problems

Dear ROOTers

Since I am interested in using TParallelCoord as another option to display my data, which are stored
in sometimes more than 100 trees, each with more than 50,000 entires, I am enclosing a macro, which
should demonstrate my questions/problems. To use this macro, do:

// create files with trees
.L macroParallelCoord.C 
CreateFile("Plot4.root", 4, 10000) 
CreateFile("Plot20.root", 20, 60000) 
CreateFile("Plot100.root", 100, 60000) 

// draw plots
DrawParallelCoord("Plot4.root", "*", "random") 
DrawParallelCoord("Plot4.root", "*", "random", kFALSE, kTRUE) 
DrawParallelCoord("Plot4.root", "Tree2", "px:random:pz", kFALSE) 
DrawParallelCoord("Plot20.root") 
DrawParallelCoord("Plot100.root") 
  1. Problem with axes labels:
    As the attached figures show, when using only 4 trees, axes labels are readable. However, for 20 trees,
    axis labels become already unreadable.
    Is there a way to display axes labels vertically?

  2. Problem with candle chart:
    In my code I have to use

   if (can) para->SetCandleChart(can);

otherwise the histograms will not be drawn when using my default setting “can=kFALSE”.
Furthermore, when using the popup menu “ParaCoord” and checking item “SetChandleChart”, the drawing
lines disappear. After un-checking item “SetChandleChart” the lines reappear but now the histograms
disappear. Is this intended or a bug?

  1. Highlight select entries:
    It would be great if I could select one or more tree entries, e.g. entries [127, 2567, 7654], and
    display these entries as bold line and/or in a different color. In principle class TParallelCoord
    has already methods SetLineWidth() and SetLineColor(), so this should not be a large problem.
    The reason is that in my case each tree entry has a certain meaining (i.e. a gene name), and the
    users would like to see how their entries of interest behave.

  2. Using scroll bars:
    Since it is hard to see how more than 100 trees behave, it would be great if the current pad size could
    be set to a fixed width/height, and displayed in the canvas using scroll bars.

Thank you in advance.
Best regards
Christian

P.S.: I am using root 5.20/00 on Intel-Mac Tiger.
macroParallelCoord.C (4.45 KB)




Christian,

Olivier will process your mail once he will be back from holidays.

Rene

Dear Rene

Thank you, I am looking forward to Olivier’s response.

BTW, I am attaching a new version of my macro, since the old one contained many bugs, which I did not immediately realize.

Best regards
Christian
macroParallelCoord.C (4.68 KB)

Hi Christian,

I am back from holidays and I looked at your example using the parrallel coordinates plots. I need a bit of time to investigate all the details but here are my first thoughts:

First of all you do not need to create yourself the TParallelCoord object. The Draw() method of TTree does it for you when you are using the option “para”. You may need a pointer to the TParallelCoord object to change its attributes like in: $ROOTSYS/tutorials/tree/parallelcoord.C.

I have produced the various plots suggested in your macro. The last one in particular is almost unreadable if the window is too small I guess it has a 100 variables, right ? do you really intend to use it that way ? I am curious to know what kind of plot you are trying to achieve with that example. I do not think that plotting the histograms on the bars is a good idea in a such case. You should not use the parallel coordinates as a way to plot several histograms next to each other. That’s not what they are made for.

I will answer your 4 points in a further post.

Dear Olivier

Thank you for investigating my problems, I hope you had nice holidays.

Maybe I should give you some background so that you can understand my intended use of the plots:

Let me use breast cancer as an example. In order to find new treatments people are interested to know which genes
are specifically active in breast tumors but not in normal breast tissue. For this purpose people use today the
so-called “DNA-chips” which contain gene fragments of all human genes on a slide (1cm x 1cm), onto which either
all genes active in the tumor or all genes active in normal tissue are spotted. A laser scanner is scanning the
DNA-chip and (in theory) only those genes on the chip will give a signal which are active in the tissue. However,
the data are really very noisy for many reasons. Thus in order to get a better statistics, people use samples
from e.g. 20 normal breast tissues and from 20 breast tumor tissues, which results in data from 40 DNA-chips.

Since you now need to find “differentially active” genes in a table consisting of 40 columns (samples) and more
than 25,000 rows (genes), which is in principle impossible, most statistics labs worldwide became interested in
this problem and have developed many new algorithms with the task to minimize false positives. The standard
statistics language for this purpose became R ( cran.r-project.org/ ) with a special section, called
“Bioconductor” ( bioconductor.org/ ) containing only R-packages used for DNA-chip analysis.
One of these packages is my package “xps”, which is based on ROOT, see:
bioconductor.org/packages/2.2/bioc/html/xps.html

In order to get a first impression about the quality of the data, people routinely use boxplots, see:
stat.ethz.ch/R-manual/R-patched/ … xplot.html
en.wikipedia.org/wiki/Box_plot
In addition, some labs also use the commercial program “Spotfire Decisionsite”, see:
spotfire.tibco.com/products/deci … alysis.cfm
which contains also parallel coordiantes, called “profile plots”.

When I have seen class TParallelCoord in ROOT I was really excited, because it contains both “profile plots”
and “boxplots”, and in addition the possibility to show the distribution of the data. (BTW, I do not quite
understand why you call the boxplots “candle charts” since this name is used mainly in the financial world).
The possibility to show boxplots and the distribution is in my opinion a real advantage when estimating the
quality of the DNA-chip data, since boxplots alone are often not sufficient to get a first impression about
the quality of the data, see also a real example on page 24 of vignette “xps.pdf” in my package, see:
bioconductor.org/packages/2. … oc/xps.pdf

As already mentioned boxplots are used to get a first impression of the quality of the raw data, but also
about the ability of different “normalization” methods to correct for differences in the raw data.
The breast cancer example above is a typical experimental setup, using between 20 and 60 samples, but people
often normalize raw data containing up to 500 samples, and they ROUTINELY use boxplots to evaluate the quality,
although they have to enlarge the width of the boxplots accordingly.
Thus my primary use of TParallelCoord will be the “candle” option to draw boxplots, but the ability to draw
histograms together with the boxplots is a major advantage compared to the R-function boxplot!
In summray, I really intend to use it that way!!! and it is a great option!!!

Sadly, TParallelCoord has an even more severe problem than the issues I have listed in my initial mail:
To understand this problem you need to know that the DNA-chips which are used by most labs do not contain
one gene fragment per gene on the chip, i.e. about 25,000 fragments, but 40 gene fragments per gene, i.e.
about 1.4 million gene fragments (=data points). The newst chips contain even 6.5 million gene fragments!

The problem with TParallelCoord is that for 1.4 million data it creates artifacts (and for 6.5 million data
it simply crashes.) In order to show you the problem of the artifacts I have updated my macro (see the
attached macro “macroParallelCoord.C”).

When you do:

.L macroParallelCoord.C 
CreateFile("Plot6.root", 6, 100000) 
DrawParallelCoord("Plot6.root", "*", "random", 0, 0, kTRUE, kTRUE, kTRUE) 
DrawParallelCoord("Plot6.root", "*", "random", 0, 0, kTRUE, kTRUE, kFALSE) 

everything is ok, see the attached figure “Plot6log.png”.

However, when you do:

.L macroParallelCoord.C+
reateFile("Plot6M.root", 6, 1500000) 
DrawParallelCoord("Plot6M.root", "*", "random", 0, 0, kTRUE, kTRUE, kTRUE) 
DrawParallelCoord("Plot6M.root", "*", "random", 0, 0, kTRUE, kTRUE, kFALSE) 

you get an artifact, which creates a wrong boxplot, see figures “Plot6Mlog.png” and “Plot6Mloghist.png”.

Due to the severity of this problem, i.e. compared to R boxplots people will get wrong boxplots, I would
appreciate if you would consider this problem to be of the highest priority.
(The crash with 6.5 million data is probably due to memory problems. R boxplots are able to hanlde this case,
but need machines with at least 16GB RAM.)

Thank you in advance.
Best regards
Christian
macroParallelCoord.C (5.4 KB)






Hi Christian,

I will now long closely at your problem specially the one you mentioned in the last post. I can already answer you the easy question you asked:

The other name of “box plots” is “candle plots”. We choose the later because the BOX option is already used as a plotting option in ROOT…

Am I right saying that you are making a TParallelCoord plot using several different TTrees ?
If it is true, that might be a problem because this configuration has never been tested and was not initially in the specifications of TParallelCoord.

As I said in my first post TParallelCoord should not be (normally) directly used by users. It is an object internally used by TTree::Draw() when there is more that 4 variables to be drawn.

Obviously when one uses TTree::Draw() only one ntuple is involved. The way you seem to use is out of the specifications and has never been tested.

Let me know if I am right saying you making a TParallelCoord plot with several TTrees.

Dear Olivier

Yes, as you can see from my macro, I am plotting data from different trees and use TTree::AddFriend()
to add the branches from different trees. As far as I understand this allows to add “virtual” branches
to the original tree, and as you see, IT REALLY WORKS :slight_smile:

As I have explained in my last mail, for every DNA-chip (i.e. sample) one tree is created containing all
data from this DNA-chip. Often users add more samples at a later point of the experiment, or they
combine samples from different experiments for analysis. Thus, it is not possible for me to create
only one tree with multiple branches. Furthermore, each tree has a different name (i.e. the name of the
sample) and this is name is displayed below each “candle chart”.

Since in principle everything works also for my setting, I hope that you will be able to solve the
problems that I mentioned, especially the last problem resulting in wrong boxplots.

Best regards
Christian

Ok I will investigate. For me, when I use Plot6M.root, I do not get a wrong box plot, it simply crashes.

Dear Olivier

Meanwhile I have created a new macro which creates only one tree with six branches (see the
attached macro “macroParallelCoordTree.C”).

When you do:

.L macroParallelCoordTree.C+
CreateFileTree("PlotTreeM15.root", 1500000) 
ParallelCoordTree("PlotTreeM15.root","log(random0):log(random1):log(random2):log(random3):log(random4):log(random5)",0,0,1,1) 

everything is ok, see the attached figure “PlotTreeM15.png”.

However, when you do:

.L macroParallelCoordTree.C+
CreateFileTree("PlotTreeM15.root", 1500000) 
DrawParallelCoord("PlotTreeM15.root","random0:random1:random2:random3:random4:random5",0,0,1,1,1) 

you get the same artifact as before, which creates a wrong boxplot, see figure “DrawTreeM15.png”.
Sometimes it simply crashes.

Thus the problem is not the use of tree friends but the code:

   TParallelCoord* para = new TParallelCoord(tree, nentries);
   para->AddVariable(varname);
   para->Draw();

Although you say that this is not recommended, it should in principle work.

Best regards
Christian





macroParallelCoordTree.C (5.27 KB)

Hi Christian,

I had already modified your macro to use only on Tree and I see the same crashes. It seems to be a size problem. I am still investigating. It may takes time.

Cheers, Olivier

Hi Christian,

I think I have located the problem. In TSelectorDraw::Begin; There is the following code:

   for(i=0;i<fDimension;++i){
      if(!fVal[i] && fVar[i]) {
         fVal[i] = new Double_t[(Int_t)fTree->GetEstimate()];
      }
   }

When I run your macro fTree->GetEstimate() returns 1000000 and fVal is created with that lenght. 1000000 is exactly the index value from which the variable value becomes wrong in TParallelCoordVar::TParallelCoordVar. A printout in the loop:

for(Long64_t ui = 0;ui<fParallel->GetNentries();++ui) fVal[ui]=val[ui];

Showed me that (Parallel->GetNentries() = 1500000 and val is dimensioned to 1000000).

So I have no solution yet but I wanted to let you know I am on the good track I think.

In the loop over the trees you should do:

         treek->SetEstimate(treek->GetEntries());

As said in the TTree doc:

Dear Olivier

Thank you for your efforts, your solution does indeed solve the problem. I have tried it even with
6.5 million tree entries and it works great.
One question:
Is there a protection in case there is no sufficient RAM, e.g. 200 trees with 6.5 million entries each?

Furthermore:
What I do not understand is why and how fEstimate is estimated?
Why is fEstimate not set to e.g. GetSelectedRows(), or in case that no cut is made to number of tree entries?

I hope that you may also be able to solve the original 4 questions, especially the first one, i.e.
having on option to draw the axis labels vertically.

BTW, interestingly for your last two replies I did not receive any email information.

Best regards
Christian

[quote]What I do not understand is why and how fEstimate is estimated? [/quote]The default value for fEstimate is one million (it used to be 10000).

[quote]
Is there a protection in case there is no sufficient RAM, e.g. 200 trees with 6.5 million entries each? [/quote]No there is not. That is actually the point of fEstimate. It give you the control on how much processed data is kept.

[quote]
Why is fEstimate not set to e.g. GetSelectedRows(), or in case that no cut is made to number of tree entries? [/quote]Because in most use cases this would result in a waste of memory (i.e. in most case the information is used only once to fill an histogram) (Note also that having fEstimate not too small of number allows the histogram fills to be done in bunches, which is more efficient; and allow for better guess of the histogram limits)

Cheers,
Philippe.

Philippe:

I understand fEstimate should be under user control but would it be possible to have a protection when it is to small and generate a error message instead of crashing like in the example Christian made ?

Christian:

I am now looking again at the original question you asked. I’ll let you know via the forum.

There is no way to change the axis labels. As one can see in the code:
root.cern.ch/root/html/src/TPara … tml#gx.8lC

but you can draw the axis horizontally as on the attached picture:


Dear Olivier, dear Philippe

Thank you for your comments to my questions about fEstimate.

Regarding axes-labels: I know that it is currently not possible to draw the text vertically,
but in principle it should be possible as the tutorial “graphs/labels1.C” demonstrates.

I know that it is possible to draw the boxplot horizontally, but most users of my package
are used to draw boxplots vertically, and I think that generally users should always have
the option to draw axes labels horizontally or vertically, independently of the type of
graph or histogram drawn.

Best regards
Christian

I did not say it is not possible to draw the labels at an other angle, I simply said it is not implemented. If you look at the code I sent you, you can see that the labels are always drawn horizontally.

Dear Olivier

It would be great if you could put this on the “to-do” list, thank you.

Best regards
Christian