TThread in fitting and calculating function

This is going to be a very general question, but I hope someone will be able to answer it.

I have a 2D function that for each of its points requires calculating a time-consuming 2D integral. I fit this function to a TH2D. I decided to give TThread a try, to make use of multiply cores and make fitting faster but… without a success.

The fitting procedure asks for one point of the function at a time, so I cannot split calculations to, for example, one half of point for one thread and second half for another. I decided to split the integration process of each point into parts in separate thread. So now TF2 which is fitted to TH2D calls function fun_int(), which makes TThreads and runs function int_part() with them.

It seems that each thread finishes very fast. I think that is the reason why instead of utilizing 100% CPU as with a one main thread, now my code uses on average about 10% of cpu. I suspect that creating thread, running, joining and than deleting somehow makes my program run this way.

The question is: what would be the proper approach to such kind of a problem - splitting into threads fitting of a function which takes a long time to be calculated?

I cannot simply supply a working code of my problem, but in general the algorithm is like that:

typedef struct int_part_pars
 ...parameters passed to the function called from within a thread...

// function called from within a thread, calculating integral for a given point
void *int_part(void *pargs)
 for(int x=0; x<100; x++) for(y=0; y<100; y++)

// function called from within TF2 fitted to TH2
void fun_int(Double_t *px, Double_t *par, Double_t &re1, Double_t &im1, Double_t k)
 ...filling the structure for thread

	for(int thr_no=0; thr_no<thread_cnt; thr_no++)
		targ[thr_no] = sarg;
		targ[thr_no].st_it = 1+thr_no*pstep;
		targ[thr_no].end_it = (thr_no+1)*pstep;
		targ[thr_no].sum_r = &tsum_r[thr_no];
		targ[thr_no].sum_i = &tsum_i[thr_no];
		parg = &targ[thr_no];
		th[thr_no] = new TThread("MyThread", int_part, (void*)parg);

	for(int thr_no=0; thr_no<thread_cnt; thr_no++)

	for(int thr_no=0; thr_no<thread_cnt; thr_no++)
		delete th[thr_no];

 ...put calculated values into references and finish


instead of using TTHread, I would try to use OpenMP. With OpenMP I did not have these problems and got good scalability in parallelizing the calculation of the least square sum (or log-likelihood) in many threads