Analysis of categorical variables in BDT

KarenDmn · February 4, 2018, 6:57am

Hello friends of ROOT forum

I am using ROOT to analyze a survey of a group of people that to know them age, grades, profession, race, sex, origin country and if the annual income is less o more to 50,000 dollars. I divided this data into two half, the first half is training and second, testing, I have to be capable to determine the probability if a whoever person has an annual income less or more of 50,000 dollars in the second group of data.

It would be correct mathematical and computationally if I convert (for example)…
Origin country: France ->1, Mexico ->2, …Australia->40, and with the transformed data use a BDT in ROOT?

By the way (excuse my English mistakes please)
I am looking forward your answer

Atte: Karen

Axel · February 5, 2018, 5:11pm

@moneta maybe you can help here?

kialbert · February 7, 2018, 5:05pm

Hi Karen,

This is possible to do but might lower the final accuracy of the predictor. This is because the linear coding introduces an artificial arithmetic relation. In your example France * Mexico = Mexico.

An alternative is to use a separate variable for each label and indicate its existence with a 1. So if a person can come from France, Mexico, or someplace else the resulting vector could be e.g. <1, 0> for a French person and <0, 1> for a Mexican and <0, 0> for a person originating from Norway.

This is a bit tedious to implement in TMVA if there are many labels, so try the the coding you suggest first

Some more information can be found here.

Cheers,
Kim

KarenDmn · March 5, 2018, 11:12pm

I tried and it works fine, I checked statistical articles and books where they did the same as I do.
Thanks for answering me