TChain::CopyTree only works the second+ time?

kratsg · July 27, 2016, 9:50pm

I have code like this:

      for treename in args.trees:
        logger.debug("Building a TChain for {0:s}".format(treename))
        tc = ROOT.TChain(treename)
        for f in files:
          tc.Add(f)
        # apply a selection/filter if defined
        if did in filters: logger.debug("Applying filter: {0:s}".format(filters[did]))
        # open output file for storing this
        output_file = ROOT.TFile.Open(tmpFile, 'UPDATE')
        copy_tree = False
        while not copy_tree:
          copy_tree = tc.CopyTree(filters.get(did, ''))
        # clone the tree now into the current output file
        logger.debug("Cloning {0:s} to temporary file {1:s}".format(treename, tmpFile))
        clone_tree = copy_tree.CloneTree(-1, "fast")
        logger.debug("Adding the sample weight for {0:s} to the tree".format(did))
        add_sampleWeight_branch(did, weights, clone_tree)
        
        logger.debug("Setting an alias for the MJSum branch")
        clone_tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10")
        clone_tree.SetDirectory(output_file)
        output_file.cd()
        clone_tree.Write()

        #output_file.Write()
        del tc
        # flush contents and update streamerinfo
        output_file.Close()

Notice this particular block of code here (my hack)

        copy_tree = False
        while not copy_tree:
          copy_tree = tc.CopyTree(filters.get(did, ''))

If I just run tc.CopyTree(filters.get(did, '') once, it returns a 0x(nil) tree.

[2016-07-27 16:38:00,149][DEBUG  ]  Building a TChain for nominal (mergeTrees.py:350)
[2016-07-27 16:38:00,496][DEBUG  ]  <ROOT.TTree object at 0x(nil)> (mergeTrees.py:381)
> /share/home/kratsg/HistFitter/analysis/analysis_multib/input/mergeTrees.py(384)<module>()
-> logger.debug("Cloning {0:s} to temporary file {1:s}".format(treename, tmpFile))
(Pdb) copy_tree
<ROOT.TTree object at 0x(nil)>
(Pdb) bool(copy_tree)
False
(Pdb) l
379  	        output_file = ROOT.TFile.Open(tmpFile, 'UPDATE')
380  	        copy_tree = tc.CopyTree(filters.get(did, ''))
381  	        logger.debug(copy_tree)
382  	        import pdb; pdb.set_trace()
383  	        # clone the tree now into the current output file
384  ->	        logger.debug("Cloning {0:s} to temporary file {1:s}".format(treename, tmpFile))
385  	        clone_tree = copy_tree.CloneTree(-1, "fast")
386  	        logger.debug("Adding the sample weight for {0:s} to the tree".format(did))
387  	        add_sampleWeight_branch(did, weights, clone_tree)
388  	
389  	
(Pdb) tc
<ROOT.TChain object ("nominal") at 0x204a6c0>
(Pdb) copy_tree = tc.CopyTree(filters.get(did, ''))
(Pdb) copy_tree
<ROOT.TTree object ("nominal") at 0x3c093f0>
(Pdb) bool(copy_tree)
True
(Pdb) q

As you can see, it’s very, very strange behavior. I would expect that the first time I call CopyTree, it returns a non-nil pointer to the copied tree that I can either clone or write elsewhere. However, it only seems to work on the second consecutive call (or sometimes third or fourth; hence the while loop) which makes the behavior seem rather strange and unstable.

Any ideas?

pcanal · July 28, 2016, 5:33pm

tc.CopyTree(filters.get(did, '') once, it returns a `0x(nil)` tree.This is strange and should not happen. There is something odd about ‘tc’.

In addition doing both:copy_tree = tc.CopyTree(filters.get(did, '')) # clone the tree now into the current output file logger.debug("Cloning {0:s} to temporary file {1:s}".format(treename, tmpFile)) clone_tree = copy_tree.CloneTree(-1, "fast") seems like a waste of time, why not just use the result of CopyTree.

I.e. apriori the following should have worked (and be more efficient):

output_file.cd() copy_tree = tc.CopyTree(filters.get(did, '')) add_sampleWeight_branch(did, weights, copy_tree) copy_tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10") output_file.Write()

Cheers,
Philippe.

PS. Note that clone_tree.SetDirectory(output_file)Only change the directory/file that the TTree will use for future operation and does not affect what has been done so far, including most importantly it does not move any of the data that might have been written to a file (and clone_tree.Write() will only write meta data (unless the TTree is an in-memory TTree))

kratsg · July 29, 2016, 2:31pm

The purpose of a CopyTree then a CloneTree is because we only want to clone a subset of the tree into a new file. If I do CopyTree – then it looks like the tree is “soft-linked” and if I delete the original file, the new file doesn’t have the full set of data. But I’ll try without CloneTree then…

kratsg · July 29, 2016, 2:38pm

Also, why do you not call CopyTree::Write() and instead opt for TFile::Write()?

pcanal · July 29, 2016, 2:43pm

Because once the TTree is properly attached to the TFile, TFile::Write will both flush/write the TTree meta-data and the TFile meta-data to the disk. (i.e. the TTree::Write is implied).

Cheers,
Philippe.

pcanal · July 29, 2016, 2:45pm

Hi,

[quote]because we only want to clone a subset of the tree into a new file. If I do CopyTree – then it looks like the tree is “soft-linked”[/quote] This is likely that because of the ‘missing’ outputfile->cd(), the copied TTree got attached to the input file and data ended-up there. If this is not enough to solve the issue then we need to dig into more as CopyTree must be sufficient to copy the TTree into a new file.

Cheers,
Philippe.

kratsg · July 29, 2016, 2:48pm

I was running into a LOT of issues of the file not getting flushed due to the number of large trees I’m writing O(100). I found that the only way to get ROOT to behave nicely was to constantly open and close the file after writing a single ttree.

But according to what you’re saying, I can do something like

out_file = ROOT.TFile.Open(..., 'UPDATE')
for tree in treenames:
  tc = ROOT.TChain(tree)
  for f in files:
    tc.Add(f)
    
  out_file.cd()
  tc.CopyTree('')
out_file.Write()
out_file.Close()

and that will be ok? Even if I can’t hold all the trees in memory? Or do I need to call a Flush or write and re-open/re-close in the loop to make sure?

pcanal · July 29, 2016, 2:57pm

But according to what you’re saying, I can do something like
Even if I can’t hold all the trees in memory?

With that code at most the TTree meta data and one basket per branch would be in memory.

One more caveat, since this is PyROOT, I do not recall the exact rule in pyROOT and it is plausible that PyROOT (because there is no reference once the loop iteration is done) might delete the CopiedTree from memory because the file as a chance to write to disk. So because of this, you might still need to write the TTree meta data explicitly.

[code]out_file = ROOT.TFile.Open(…, ‘UPDATE’)
for tree in treenames:
tc = ROOT.TChain(tree)
for f in files:
tc.Add(f)

out_file.cd()
copiedtree = tc.CopyTree(‘’)
out_file->cd() # making sure Write writes in the right file.
copiedtree->Write()
out_file.Write()
out_file.Close()[/code]

Cheers,
Philippe.

kratsg · July 29, 2016, 3:05pm

Note one little gotcha so far (which bothers me) is that when I create a TChain and then run tc.CopyTree – I see two different copies, both are sort of inaccurate.

    output_file.cd()
    for treename in args.trees:
      logger.debug('Building TChain for {0:s}'.format(treename))
      tc = ROOT.TChain(treename)
      for f in files:
        logger.debug('\t{0:s}'.format(f))
        tc.Add(f)
      logger.debug('Copying TChain now')
      tree = tc.CopyTree("")
      if args.do_backward_compatibility:
        group = group.replace('singletop', 'SingleTop')
        group = group.replace('topEW', 'TopEW')
        group = group.replace('_sherpa', 'jets')
        group = group.replace('_5000', '')
        treename = treename.replace('nominal', 'NoSys')
      # names need to be format <sample>_<systematic>
      tree.SetName("_".join([group, treename]))
      tree.SetTitle("_".join([group, treename]))
      tree.SetDirectory(output_file)
      output_file.cd()
      tree.Write()

When I do this… I actually see two TTrees. One named “nominal” and one named “ttbar_NoSys” (where group = “ttbar”). The really unfortunate thing is that “nominal” seems to have been created because of the TreeChain call while “ttbar_NoSys” is from the tree.Write() call. I don’t want “nominal”. The second problem is that the “ttbar_NoSys” is missing an entire file from it, which exists in “nominal” but not in “ttbar_NoSys” [they should be equivalent according to my code above anyway, yet they aren’t!] This is code I’m converting from using CloneTree to using CopyTree…

The code I’m working on that’s spawning this thread in the first place is here (gitlab.cern.ch/MultiBJets/HF_MB … geTrees.py) which I’ll copy below.

#!/usr/bin/env python
# -*- coding: utf-8 -*-,
# @file:    mergeTrees.py
# @purpose: merge many files into few files and copy trees and systematics
# @author:  Giordon Stark <gstark@cern.ch>
# @date:    July 2016
#

# __future__ imports must occur at beginning of file
# redirect python output using the newer print function with file description
#   print(string, f=fd)
from __future__ import print_function
import logging

BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE = range(8)
#The background is set with 40 plus the number of the color, and the foreground with 30
#These are the sequences need to get colored ouput
RESET_SEQ = "\033[0m"
COLOR_SEQ = "\033[1;%dm"
BOLD_SEQ = "\033[1m"

def formatter_message(message, use_color = True):
  if use_color:
    message = message.replace("$RESET", RESET_SEQ).replace("$BOLD", BOLD_SEQ)
  else:
    message = message.replace("$RESET", "").replace("$BOLD", "")
  return message

COLORS = {
  'WARNING': YELLOW,
  'INFO': WHITE,
  'DEBUG': BLUE,
  'CRITICAL': YELLOW,
  'ERROR': RED
}

class ColoredFormatter(logging.Formatter):
  def __init__(self, msg, use_color = True):
    logging.Formatter.__init__(self, msg)
    self.use_color = use_color

  def format(self, record):
    levelname = record.levelname
    if self.use_color and levelname in COLORS:
      levelname_color = COLOR_SEQ % (30 + COLORS[levelname]) + levelname + RESET_SEQ
      record.levelname = levelname_color
    return logging.Formatter.format(self, record)

# Custom logger class with multiple destinations
class ColoredLogger(logging.Logger):
  FORMAT = "[$BOLD%(asctime)s$RESET][%(levelname)-18s]  %(message)s ($BOLD%(filename)s$RESET:%(lineno)d)"
  #FORMAT = "[$BOLD%(name)-20s$RESET][%(levelname)-18s]  %(message)s ($BOLD%(filename)s$RESET:%(lineno)d)"
  COLOR_FORMAT = formatter_message(FORMAT, True)
  def __init__(self, name):
    logging.Logger.__init__(self, name, logging.DEBUG)
    color_formatter = ColoredFormatter(self.COLOR_FORMAT)
    console = logging.StreamHandler()
    console.setFormatter(color_formatter)
    self.addHandler(console)
    return

root_logger = logging.getLogger()
root_logger.setLevel(logging.NOTSET)
logging.setLoggerClass(ColoredLogger)
logger = logging.getLogger("mergeTrees")

# import the rest of the stuff
import argparse
import os
import subprocess
import sys
from random import choice
import tempfile
from array import array
import re
import json

try:
  import ROOT
except ImportError:
  logger.exception('Please set up ROOT (and PyROOT bindings) before continuing')
  sys.exit(1)

did_regex = re.compile('\.(?:00)?(\d{6})\.')
def get_did(filename):
  global did_regex
  global logger
  m = did_regex.search(filename)
  if m is None:
    logger.warning('Can\'t figure out the DID! Using input filename: {0}'.format(filename))
    return filename.split("/")[-1]
  return m.group(1)

def get_scaleFactor(did, weights):
  global logger
  weight = weights[did]
  logger.debug("Weights for {0:s}".format(did))
  logger.debug("\t {0:s}".format(str(weight)))
  scaleFactor = 1.0
  cutflow = weight.get('num events', 0)
  if cutflow == 0:
    logger.error("Num. events = 0 for {0:s}".format(did))
  scaleFactor /= cutflow
  logger.debug("___________________________________________________________________")
  logger.debug(" {0:8s} Type of Scaling Applied       |        Scale Factor      ".format(did))
  logger.debug("========================================|==========================")
  logger.debug("Cutflow:           {0:20.10f} | {1:0.10f}".format(cutflow, scaleFactor))
  scaleFactor *= weight.get('cross section')
  logger.debug("Cross Section:     {0:20.10f} | {1:0.10f}".format(weight.get('cross section'), scaleFactor))
  scaleFactor *= weight.get('filter efficiency')
  logger.debug("Filter Efficiency: {0:20.10f} | {1:0.10f}".format(weight.get('filter efficiency'), scaleFactor))
  scaleFactor *= weight.get('k-factor')
  logger.debug("k-factor:          {0:20.10f} | {1:0.10f}".format(weight.get('k-factor'), scaleFactor))
  logger.debug( "‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾")
  return scaleFactor

def add_sampleWeight_branch(did, weights, tree, branchname='weight_lumi'):
  scaleFactor = get_scaleFactor(did, weights)
  branch_val = array( 'f', [ scaleFactor ] )
  branch_val_up = array( 'f', [ scaleFactor*(1+weights[did].get('rel uncert', -1.0)) ] )
  branch_val_down = array( 'f', [ scaleFactor*(1-weights[did].get('rel uncert', 1.0)) ] )
  branches = []
  branches.append(tree.Branch('{0:s}'.format(branchname), branch_val, '{0:s}/F'.format(branchname)))
  branches.append(tree.Branch('{0:s}_up'.format(branchname), branch_val_up, '{0:s}_up/F'.format(branchname)))
  branches.append(tree.Branch('{0:s}_down'.format(branchname), branch_val_down, '{0:s}_down/F'.format(branchname)))
  for i in range(tree.GetEntries()):
    tree.GetEntry(i)
    for b in branches:
      b.Fill()

  #tree.Fill()

if __name__ == "__main__":

  # if we want multiple custom formatters, use inheriting
  class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter):
    pass

  __version__ = subprocess.check_output(["git", "describe", "--always"], cwd=os.path.dirname(os.path.realpath(__file__))).strip()

  parser = argparse.ArgumentParser(description='Determine if a sample has MC Weight issues',
                                   usage='\033[93m%(prog)s\033[0m files [options]',
                                   formatter_class=lambda prog: CustomFormatter(prog, max_help_position=30))
  parser.add_argument('files', type=str, nargs='+', help='HistFitter outputs from MBJ code')

  parser.add_argument('--filters', metavar='filters.json', type=str, help='JSON dictionary where key=DID and val=selection to apply')
  parser.add_argument('--weights', metavar='weights.json', type=str, help='JSON dictionary of weights Can be http:// url or local file!')
  parser.add_argument('--did-to-group', metavar='did_to_group.json', type=str, help='JSON dictionary mapping DID to group. Can be http:// url or local file!')

  parser.add_argument('--trees', metavar='tree', type=str, nargs='+', help='TTrees to merge and copy over in signal and bkgd files',
                      default=['nominal',
                               'JET_GroupedNP_1__1up',
                               'JET_GroupedNP_1__1down',
                               'JET_GroupedNP_2__1up',
                               'JET_GroupedNP_2__1down',
                               'JET_GroupedNP_3__1up',
                               'JET_GroupedNP_3__1down',
                               'JET_JER_SINGLE_NP__1up'])
  """
                               'JET_JMS_GroupedNP_1__1up',
                               'JET_JMS_GroupedNP_1__1down'])
                               'EG_RESOLUTION_ALL__1down',
                               'EG_RESOLUTION_ALL__1up',
                               'EG_SCALE_ALL__1down',
                               'EG_SCALE_ALL__1up',
                               'MET_SoftTrk_ResoPara',
                               'MET_SoftTrk_ResoPerp',
                               'MET_SoftTrk_ScaleDown',
                               'MET_SoftTrk_ScaleUp',
                               'MUONS_ID__1down',
                               'MUONS_ID__1up',
                               'MUONS_MS__1down',
                               'MUONS_MS__1up',
                               'MUONS_SCALE__1down',
                               'MUONS_SCALE__1up"*/
  """

  parser.add_argument('--include-dids', metavar='did', type=str, nargs='+', help='DIDs to use', default=[])

  parser.add_argument('--groups-bkg', metavar='group', type=str, nargs='+', help='Groups of bkgd to merge',
                      default=['ttbar',
                               'singletop',
                               'diboson',
                               'topEW',
                               'W_sherpa',
                               'Z_sherpa'])
  parser.add_argument('--groups-sig', metavar='group', type=str, nargs='+', help='Groups of signal to collect but not merge', default=['Gtt', 'Gbb'])

  parser.add_argument('--output-suffix', metavar='tag#####', type=str, default='', help='Add a suffix to output files, perhaps to specify a nominal only: --output-suffix _nominal')

  parser.add_argument('--do-data', action='store_true', help='Process and merge data files. Otherwise, skip this process.')

  parser.add_argument('--do-jms-fix', action='store_true', help='Add JMS systematic branches according to the prescription in this code.')

  parser.add_argument('--do-backward-compatibility', action='store_true', help='Rename the branch names to get things backgwards-compatible. See the code for what this entails.')

  parser.add_argument('-v', '--verbose', dest='verbose', action='count', default=0, help='Enable verbose output of various levels. Default: no verbosity')
  parser.add_argument('--version', action='version', version='\033[93m%(prog)s\033[0m \033[94m{version}\033[0m'.format(version=__version__), default='\033[94m{version}\033[0m'.format(version=__version__))

  args = parser.parse_args()

  # set verbosity for python printing
  if args.verbose < 2:
    logger.setLevel(20 - args.verbose*10)
  else:
    logger.setLevel(logging.NOTSET + 1)

  """
    Load the files in -- do not necessarily need a filter
  """
  try:
    filters = json.load(file(args.filters))
  except IOError:
    logger.warning("No filter file. Is this expected? You gave me '{0:s}'".format(args.filters))
    filters = {}

  if args.weights.startswith('http'):
    import requests
    weights = json.load(requests.get(args.weights).text)
  else:
    weights = json.load(file(args.weights))

  if args.did_to_group.startswith('http'):
    import requests
    did_to_group = json.load(requests.get(args.did_to_group).text)
  else:
    did_to_group = json.load(file(args.did_to_group))


  """
    Next, ensure we have information on all DIDs used in the files.
  """
  checkedDIDs = []
  for f in args.files:
    # skip data
    if '.data.' in f: continue
    did = get_did(f)
    if did in checkedDIDs: continue
    if did not in args.include_dids: continue
    checkedDIDs.append(did)

    logger.info("Checking {0:s}".format(did))
    if did in weights:
      logger.info("\t has weight")
    else:
      logger.error("\t does not have weight")
      sys.exit(1)
    if did in did_to_group:
      logger.info("\t has group {0:s}".format(did_to_group[did]))
    else:
      logger.error("\t does not have a group assigned")
      sys.exit(1)
    logger.info("\t has filter: {0:s}".format(str(did in filters)))

  # signal is fuzzy-matched, bkg is exact-matched
  all_groups = [group for group in (args.groups_bkg + args.groups_sig) if group]
  # check the groups and see if we have splitting information
  for group in all_groups:
    logger.info("Checking {0:s}".format(group))
    logger.info("\t has defined splits: {0:s}".format(str(any(group in name for name in filters.keys()))))
    for name in filters.keys():
      if group not in name: continue
      selection = filters[name]
      logger.info("\t\t {0:s} \t {1:s}".format(name, selection))

  """
    In this section,
      - we read in all the data files (must have 'data' in the name)
      - create a TChain of the 'nominal' branch
      - create an output file for data
      - clone the TChain as a TTree which gets saved to the output
      - close the output file

    TODO: Add protection if output file already exists.
  """
  data_files = [f for f in args.files if '.data.' in f]
  if args.do_data and len(data_files) > 0:
    logger.info("Doing the data files")
    tc = ROOT.TChain('nominal')
    for f in data_files:
      logger.debug('\t{0:s}'.format(f))
      tc.Add(f)
    logger.info('Creating output file {0:s}{1:s}.root'.format('Data', args.output_suffix))
    output_file = ROOT.TFile('Data{0:s}.root'.format(args.output_suffix), 'UPDATE')
    tree = tc.CloneTree(-1, "fast")
    tree.SetName("Data")
    tree.SetTitle("Data")
    tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10")
    tree.SetDirectory(output_file)
    output_file.cd()
    tree.Write()
    #output_file.Write()
    output_file.Close()
    logger.info('Finished creating output file')

  """
    In this section,
      - we read in all the bkg files
      - sort everything / filter by group
      - make temporary files for things that need hadding or filtering
      - for each group
        - create the output file
        - for each branch
          - create a TChain
          - apply selection if desired (TChain::CopyTree)
          - clone the TChain as a TTree which gets saved to the output
          - apply weights by making a new branch
       - clean up temporary files

    TODO: Add protection if output file already exists
  """
  files_by_did = {}
  # first group by did to figure out if we need to merge files together or not
  for f in args.files:
    if f in data_files: continue
    did = get_did(f)
    if did not in args.include_dids: continue
    if did in files_by_did:
      files_by_did[did].append(f)
    else:
      files_by_did[did] = [f]

  # here, we need to do some temporary TChains and copy the trees we want
  tmpDir = tempfile.mkdtemp(dir=os.getcwd())
  logger.info("Temporary directory created: {0:s}".format(tmpDir))

  # make a dictionary mapping the group to the files inside the group
  files_by_group = {}
  for did,files in files_by_did.iteritems():
    group = did_to_group[did]
    # skip any dids that are not in groups we care about for bkg or signal
    if group not in args.groups_bkg and not any(g in group for g in args.groups_sig if g): continue
    logger.debug("Working on {0:s} which has {1:d} files".format(did, len(files)))
    logger.debug("\t belongs to group {0:s}".format(group))

    """
      always make a temporary file since we need to do at least one of the following
      which is sorted from most likely to least likely
        - add a new branch for sample weight (weight_lumi)
        - merge multiple files of the same DID together
        - apply a selection/filter to the ttrees
    """
    logger.debug("Making a tmp file for {0:s} with {1:d} files".format(did, len(files)))
    # wrap this portion in a try statement because we need to make sure we clean up after ourselves
    try:
      tmpFile = os.path.join(tmpDir, '{0:s}.root'.format(did))
      logger.debug("Creating temporary file: {0:s}".format(tmpFile))
      for treename in args.trees:
        # open the file to update it
        output_file = ROOT.TFile.Open(tmpFile, 'UPDATE')
        # start making the tchain
        logger.debug("Building a TChain for {0:s}".format(treename))
        tc = ROOT.TChain(treename)
        for f in files:
          tc.Add(f)
        # apply a selection/filter if defined
        if did in filters: logger.debug("Applying filter: {0:s}".format(filters[did]))
        """ THIS NEEDS TO BE FIXED ASAP
            For some reason, we get
              (Pdb) tc.CopyTree(filters.get(did, ''))
              <ROOT.TTree object ("nominal") at 0x4948060>
              (Pdb) copy_tree
              <ROOT.TTree object at 0x(nil)>
              (Pdb)

            which results in errors like
              [.[1m2016-07-17 14:06:04,288.[0m][.[1;34mDEBUG.[0m  ]  Cloning nominal to temporary file /tmp/tmpgGr47d363361 (.[1mmergeTrees.py.[0m:342)
              .[?1034hTraceback (most recent call last):
                File "mergeTrees.py", line 343, in <module>
                    copy_tree.CloneTree(-1, "fast")
                    ReferenceError: attempt to access a null-pointer

            which makes it seem like ROOT is being slow at copying trees so we need to do it twice so it actually copies correctly...

            A solution:
              run CloneTree first to get the TTree into the new file, then run CopyTree on that cloned tree and use it instead if it needs to be filtered. This means we do not apply CopyTree to every tree, making the script faster since we only run CopyTree when needed and then will need to do some name mangling to get the CloneTree to have a temp name and the CopyTree to have the correct name.
        """
        # for now, a temporary patch to fix the fucking problem
        #logger.debug(tc.CopyTree(filters.get(did, '')))
        copy_tree = False
        while not copy_tree:
          copy_tree = tc.CopyTree(filters.get(did, ''))
        # clone the tree now into the current output file
        logger.debug("Cloning {0:s} to temporary file {1:s}".format(treename, tmpFile))
        clone_tree = copy_tree.CloneTree(-1, "fast")
        logger.debug("Adding the sample weight for {0:s} to the tree".format(did))
        add_sampleWeight_branch(did, weights, clone_tree)

        if args.do_jms_fix and treename == 'nominal':
          logger.debug("\tDoing the JMS fix")
          for systname in ['JET_JMS_GroupedNP_1__1up','JET_JMS_GroupedNP_1__1down']:
            logger.debug("\t\t{0:s} is being cloned".format(systname))
            syst_tree = clone_tree.CloneTree(-1, "fast")
            syst_tree.SetName(systname)
            syst_tree.SetTitle(systname)
            if systname == 'JET_JMS_GroupedNP_1__1up':
              syst_tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10_JMSUP")
            elif systname == 'JET_JMS_GroupedNP_1__1down':
              syst_tree.SetAlias("MJSum_rc_r08pt10_nominal", "MJSum_rc_r08pt10_JMSDOWN")

            syst_tree.SetDirectory(output_file)
            output_file.cd()
            syst_tree.Write()

        logger.debug("Setting an alias for the MJSum branch")
        clone_tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10")
        clone_tree.SetDirectory(output_file)
        output_file.cd()
        clone_tree.Write()

        #output_file.Write()
        del tc
        # flush contents and update streamerinfo
        output_file.Close()
      try:
        files_by_group[group].append(tmpFile)
      except KeyError:
        files_by_group[group] = [tmpFile]
    except:
      # clean up after ourselves
      import shutil
      shutil.rmtree(tmpDir, ignore_errors=True)
      raise

  # JMS fix requires adding the two new TTrees in our loop and copying them from nominal
  if args.do_jms_fix:
    args.trees.extend(['JET_JMS_GroupedNP_1__1up','JET_JMS_GroupedNP_1__1down'])

  signal_filename = 'Sig{0:s}.root'.format(args.output_suffix)
  logger.info('Signal output file {0:s}'.format(signal_filename))
  bkg_filename = 'Bkg{0:s}.root'.format(args.output_suffix)
  logger.info('Background output file {0:s}'.format(bkg_filename))
  # create the output files for signal and background

  # this contains the output file that the given group is being written to
  output_file = None
  for group, files in files_by_group.iteritems():
    if group in args.groups_bkg:
      logger.debug('Group {0:s} is identified as BACKGROUND'.format(group))
      output_filename = bkg_filename
    else:
      logger.debug('Group {0:s} is identified as SIGNAL'.format(group))
      output_filename = signal_filename

    logger.info('Group {0:s} will be written to {1:s}'.format(group, output_filename))
    # loop over the trees we need to make
    for treename in args.trees:
      logger.debug('Building TChain for {0:s}'.format(treename))
      tc = ROOT.TChain(treename)
      for f in files:
        logger.debug('\t{0:s}'.format(f))
        tc.Add(f)
      # open the file
      output_file = ROOT.TFile(output_filename, 'UPDATE')
      logger.debug('Cloning TChain now')
      tree = tc.CloneTree(-1, "fast")
      if args.do_backward_compatibility:
        group = group.replace('singletop', 'SingleTop')
        group = group.replace('topEW', 'TopEW')
        group = group.replace('_sherpa', 'jets')
        group = group.replace('_5000', '')
        treename = treename.replace('nominal', 'NoSys')
      # names need to be format <sample>_<systematic>
      tree.SetName("_".join([group, treename]))
      tree.SetTitle("_".join([group, treename]))
      tree.SetDirectory(output_file)
      output_file.cd()
      tree.Write()
      # check if we need to do some splitting
      for name,selection in filters.get(group, []):
        copy_tree = tree.CopyTree(selection)
        # we're technically creating a new sample called <sample>_<name of selection>
        copy_tree.SetName("_".join([group, name, treename]))
        copy_tree.SetTitle("_".join([group, name, treename]))
        copy_tree.Write()
      output_file.Close()

  import shutil
  shutil.rmtree(tmpDir, ignore_errors=True)

  logger.info("Don't forget to remove the temp directory {0:s}".format(tmpDir))
  logger.info("Completely finished! Enjoy.")

kratsg · July 29, 2016, 4:54pm

Updating this code to use only CopyTree, as well as TTree::Write() and TFile::Write() everywhere gives us this error

*** glibc detected *** python: malloc(): smallbin double linked list corrupted: 0x00000000035643b0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3464275f4e]
/lib64/libc.so.6[0x346427a528]
/lib64/libc.so.6(__libc_malloc+0x5c)[0x346427ab1c]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc493_x86_64_slc6/slc6/gcc49/lib64/libstdc++.so.6(_Znwm+0x18)[0x7f62121d0808]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc493_x86_64_slc6/slc6/gcc49/lib64/libstdc++.so.6(_Znam+0x9)[0x7f62121d08b9]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libRIO.so(_ZN20TStreamerInfoActions12VectorLooper18ReadCollectionBoolER7TBufferPvPKNS_14TConfigurationE+0xc5)[0x7f6212e54105]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libRIO.so(_ZN11TBufferFile13ApplySequenceERKN20TStreamerInfoActions15TActionSequenceEPv+0x75)[0x7f6212eeeac5]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libTree.so(_ZN14TBranchElement16ReadLeavesMemberER7TBuffer+0x109)[0x7f6213a6d159]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libTree.so(_ZN7TBranch8GetEntryExi+0xe2)[0x7f6213a4d822]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libTree.so(_ZN14TBranchElement8GetEntryExi+0x161)[0x7f6213a78ae1]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libTree.so(_ZN5TTree8GetEntryExi+0xa3)[0x7f6213a27a33]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libTreePlayer.so(_ZN11TTreePlayer8CopyTreeEPKcS1_xx+0x220)[0x7f620816a2b0]
[0x7f621c94609f]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libCling.so(_ZNK14TClingCallFunc4execEPvS0_+0x20f)[0x7f620fdb4cef]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libCling.so(_ZNK14TClingCallFunc23exec_with_valref_returnEPvPN5cling5ValueE+0x1aa)[0x7f620fdb59da]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libCling.so(_ZN14TClingCallFunc7ExecIntEPv+0x4b)[0x7f620fdbfe1b]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libPyROOT.so(_ZN6PyROOT18TCppObjectExecutor7ExecuteElPvPNS_12TCallContextE+0x39)[0x7f6213d36b79]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libPyROOT.so(_ZN6PyROOT13TMethodHolder8CallSafeEPvlPNS_12TCallContextE+0x6c)[0x7f6213d2e9bc]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libPyROOT.so(_ZN6PyROOT13TMethodHolder7ExecuteEPvlPNS_12TCallContextE+0x1a)[0x7f6213d2d6aa]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libPyROOT.so(_ZN6PyROOT13TMethodHolder4CallEPNS_11ObjectProxyEP7_objectS4_PNS_12TCallContextE+0xf5)[0x7f6213d2c315]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libPyROOT.so(+0x4a5f5)[0x7f6213d1b5f5]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7f621cb5eb63]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3b2e)[0x7f621cc131ae]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x830)[0x7f621cc164c0]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f621cc165e9]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/../lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x8a)[0x7f621cc3a25a]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/../lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xd7)[0x7f621cc3b7e7]
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/../lib/libpython2.7.so.1.0(Py_Main+0xc25)[0x7f621cc51605]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x346421ed5d]
python[0x4006d9]
======= Memory map: ========
00400000-00401000 r-xp 00000000 00:1b 339616074                          /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/python2.7
00600000-00601000 rw-p 00000000 00:1b 339616074                          /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.7.9p1-x86_64-slc6-gcc49/sw/lcg/releases/LCG_81b/Python/2.7.9.p1/x86_64-slc6-gcc49-opt/bin/python2.7
009c0000-1acb0000 rw-p 00000000 00:00 0                                  [heap]
3463e00000-3463e20000 r-xp 00000000 08:02 8912898                        /lib64/ld-2.12.so
346401f000-3464020000 r--p 0001f000 08:02 8912898                        /lib64/ld-2.12.so
3464020000-3464021000 rw-p 00020000 08:02 8912898                        /lib64/ld-2.12.so
3464021000-3464022000 rw-p 00000000 00:00 0 
3464200000-346438a000 r-xp 00000000 08:02 8912899                        /lib64/libc-2.12.so
346438a000-346458a000 ---p 0018a000 08:02 8912899                        /lib64/libc-2.12.so
346458a000-346458e000 r--p 0018a000 08:02 8912899                        /lib64/libc-2.12.so
346458e000-346458f000 rw-p 0018e000 08:02 8912899                        /lib64/libc-2.12.so
346458f000-3464594000 rw-p 00000000 00:00 0 
3464600000-3464617000 r-xp 00000000 08:02 8912901                        /lib64/libpthread-2.12.so
3464617000-3464817000 ---p 00017000 08:02 8912901                        /lib64/libpthread-2.12.so
3464817000-3464818000 r--p 00017000 08:02 8912901                        /lib64/libpthread-2.12.soAborted

pcanal · July 29, 2016, 11:02pm

I see that you still rely ontree.SetDirectory(output_file)rather than theoutputfile->cd();placed before the call to CopyTree … this may or may not be the core of the problem however calling SetDirectory where you have it is semantically not what you mean (it literally only mean starting writing the data and meta data for the TTree in another file … any data previously moved to a file on disk will not be affected by that statement).

Cheers,
Philippe.

pcanal · July 29, 2016, 11:04pm

[quote]Updating this code to use only CopyTree, as well as TTree::Write() and TFile::Write() everywhere gives us this error …
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libTreePlayer.so(_ZN11TTreePlayer8CopyTreeEPKcS1_xx+0x220)[0x7f620816a2b0]
[0x7f621c94609f]
[/quote]strange the TTree being used by CopyTree seems corrupted (which is odd since you already had that call …) Running with valgrind may provide more information.

Cheers,
Philippe.

kratsg · July 30, 2016, 10:58am

The current version with CopyTree seems to be performing better:

gist.github.com/198f8e9adfcdc41 … abda95afec

however I’m seeing a lot of duplicated trees.

root [1] .ls
TFile**		Bkg_2.4.15-2-0_ttbarFix.root	
 TFile*		Bkg_2.4.15-2-0_ttbarFix.root	
  KEY: TTree	nominal;13	nominal
  KEY: TTree	ttbar_NoSys;14	ttbar_NoSys
  KEY: TTree	ttbar_NoSys;13	ttbar_NoSys
  KEY: TTree	ttbar_bb_NoSys;2	ttbar_bb_NoSys
  KEY: TTree	ttbar_bb_NoSys;1	ttbar_bb_NoSys
  KEY: TTree	ttbar_cc_NoSys;2	ttbar_cc_NoSys
  KEY: TTree	ttbar_cc_NoSys;1	ttbar_cc_NoSys
  KEY: TTree	ttbar_light_NoSys;2	ttbar_light_NoSys
  KEY: TTree	ttbar_light_NoSys;1	ttbar_light_NoSys
  ...

I think it’s because I’m doing output_file.Write() and tree.Write() – and each one is rewriting. It would help to make this look a lot cleaner if it only had one copy of each. So I can probably rely on TFile::Write() instead and keep closing/reopening my files just to be safe?

kratsg · July 30, 2016, 11:07am

In fact, looking even more closely at the first two trees (nominal and ttbar_NoSys) which should be definition in my code be identical, they aren’t!

From GetEntries on the input files, I have the following numbers

kratsg@tier3:~/HistFitter/analysis/analysis_multib/input (matttest_debug)$ cat test
407009
1010232L

407010
333639L

407011
112051L

However, when I look at the tree saved and how many entries exist, “nominal” is missing entries that “ttbar_NoSys” isn’t! Which is strange as technically, “ttbar_NoSys” is a CopyTree of “nominal”!

kratsg@tier3:~/HistFitter/analysis/analysis_multib/input (matttest_debug)$ root -b Bkg_2.4.15-2-0_ttbarFix.root 
   ------------------------------------------------------------
  | Welcome to ROOT 6.04/14                http://root.cern.ch |
  |                               (c) 1995-2014, The ROOT Team |
  | Built for linuxx8664gcc                                    |
  | From tag v6-04-14, 3 February 2016                         |
  | Try '.help', '.demo', '.license', '.credits', '.quit'/'.q' |
   ------------------------------------------------------------


Applying ATLAS style settings...

root [0] 
Attaching file Bkg_2.4.15-2-0_ttbarFix.root as _file0...
(class TFile *) 0x253e830
root [1] nominal->GetEntries("channel_number==407009")
(Long64_t) 1010232
root [2] nominal->GetEntries("channel_number==407010")
(Long64_t) 331694
root [3] nominal->GetEntries("channel_number==407011")
(Long64_t) 0
root [4] ttbar_NoSys->GetEntries("channel_number==407009")
(Long64_t) 1010232
root [5] ttbar_NoSys->GetEntries("channel_number==407010")
(Long64_t) 333639
root [6] ttbar_NoSys->GetEntries("channel_number==407011")
(Long64_t) 112051
root [7]

kratsg · July 30, 2016, 11:10am

[quote=“pcanal”][quote]Updating this code to use only CopyTree, as well as TTree::Write() and TFile::Write() everywhere gives us this error …
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/6.04.14-x86_64-slc6-gcc49-opt/lib/libTreePlayer.so(_ZN11TTreePlayer8CopyTreeEPKcS1_xx+0x220)[0x7f620816a2b0]
[0x7f621c94609f]
[/quote]strange the TTree being used by CopyTree seems corrupted (which is odd since you already had that call …) Running with valgrind may provide more information.

Cheers,
Philippe.[/quote]

I was able to solve this by not doing a CopyTree of a CopyTree – and instead doing a CloneTree of a CopyTree (or vice-versa). I wonder if this is just an issue with things being stored in memory in python.

pcanal · July 30, 2016, 3:19pm

Hi,

[quote]however I’m seeing a lot of duplicated trees.[/quote]Those are not ‘duplicated’ TTrees but rather backup copies of the TTree meta-data. Indeed one is written by TTree::Write and another by TFile::Write (if the TTree is still in memory at the time). For example doing exactly (i.e. one just after the other):tree.Write() output_file.Write()is superfluous, doing just ‘output_file.Write()’ will be enough.

You also pass the option TObject::kOverwrite to Write, instead of doing a backup, it will first ‘mark as removed’ the old version and write over it (the ‘risk’ is that if the process crash during this operation, the file is not recoverable). So you can do either

Cheers,
Philippe.

pcanal · July 30, 2016, 3:22pm

[quote]Crash… I was able to solve this by not doing a CopyTree of a CopyTree … I wonder if this is just an issue with things being stored in memory in python.[/quote]This is unfortunate as this is a waste of our CPU time. I suspect this is indeed likely to be due to the way python/PyROOT manage the memory. Running valgrind on the failing example might help pin-point the issue. Another way to work-around it might to keep the handle to the copied trees into a python Collection whose litefime is the same as the outputfile.

Cheers,
Philippe.

kratsg · July 30, 2016, 3:56pm

Even more updates: it looks like having this thing in python is entirely a pipedream. For whatever reason, events are being dropped and the entire tree is not being copied over to the files correctly.

Current code: gist.github.com/8698607e2372a1c … 0e1c35c0a0 . In trying to validate input to output, we’re still seeing duplicated trees, and missing entries for no reason. There’s no errors in the entire code so I’m not sure what’s going on here.

#!/usr/bin/env python
# -*- coding: utf-8 -*-,
# @file:    mergeTrees.py
# @purpose: merge many files into few files and copy trees and systematics
# @author:  Giordon Stark <gstark@cern.ch>
# @date:    July 2016
#

# __future__ imports must occur at beginning of file
# redirect python output using the newer print function with file description
#   print(string, f=fd)
from __future__ import print_function
import logging

BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE = range(8)
#The background is set with 40 plus the number of the color, and the foreground with 30
#These are the sequences need to get colored ouput
RESET_SEQ = "\033[0m"
COLOR_SEQ = "\033[1;%dm"
BOLD_SEQ = "\033[1m"

def formatter_message(message, use_color = True):
  if use_color:
    message = message.replace("$RESET", RESET_SEQ).replace("$BOLD", BOLD_SEQ)
  else:
    message = message.replace("$RESET", "").replace("$BOLD", "")
  return message

COLORS = {
  'WARNING': YELLOW,
  'INFO': WHITE,
  'DEBUG': BLUE,
  'CRITICAL': YELLOW,
  'ERROR': RED
}

class ColoredFormatter(logging.Formatter):
  def __init__(self, msg, use_color = True):
    logging.Formatter.__init__(self, msg)
    self.use_color = use_color

  def format(self, record):
    levelname = record.levelname
    if self.use_color and levelname in COLORS:
      levelname_color = COLOR_SEQ % (30 + COLORS[levelname]) + levelname + RESET_SEQ
      record.levelname = levelname_color
    return logging.Formatter.format(self, record)

# Custom logger class with multiple destinations
class ColoredLogger(logging.Logger):
  FORMAT = "[$BOLD%(asctime)s$RESET][%(levelname)-18s]  %(message)s ($BOLD%(filename)s$RESET:%(lineno)d)"
  #FORMAT = "[$BOLD%(name)-20s$RESET][%(levelname)-18s]  %(message)s ($BOLD%(filename)s$RESET:%(lineno)d)"
  COLOR_FORMAT = formatter_message(FORMAT, True)
  def __init__(self, name):
    logging.Logger.__init__(self, name, logging.DEBUG)
    color_formatter = ColoredFormatter(self.COLOR_FORMAT)
    console = logging.StreamHandler()
    console.setFormatter(color_formatter)
    self.addHandler(console)
    return

root_logger = logging.getLogger()
root_logger.setLevel(logging.NOTSET)
logging.setLoggerClass(ColoredLogger)
logger = logging.getLogger("mergeTrees")

# import the rest of the stuff
import argparse
import os
import subprocess
import sys
from random import choice
import tempfile
from array import array
import re
import json

try:
  import ROOT
except ImportError:
  logger.exception('Please set up ROOT (and PyROOT bindings) before continuing')
  sys.exit(1)

did_regex = re.compile('\.(?:00)?(\d{6})\.')
def get_did(filename):
  global did_regex
  global logger
  m = did_regex.search(filename)
  if m is None:
    logger.warning('Can\'t figure out the DID! Using input filename: {0}'.format(filename))
    return filename.split("/")[-1]
  return m.group(1)

def get_scaleFactor(did, weights):
  global logger
  weight = weights[did]
  logger.debug("Weights for {0:s}".format(did))
  logger.debug("\t {0:s}".format(str(weight)))
  scaleFactor = 1.0
  cutflow = weight.get('num events', 0)
  if cutflow == 0:
    logger.error("Num. events = 0 for {0:s}".format(did))
  scaleFactor /= cutflow
  logger.debug("___________________________________________________________________")
  logger.debug(" {0:8s} Type of Scaling Applied       |        Scale Factor      ".format(did))
  logger.debug("========================================|==========================")
  logger.debug("Cutflow:           {0:20.10f} | {1:0.10f}".format(cutflow, scaleFactor))
  scaleFactor *= weight.get('cross section')
  logger.debug("Cross Section:     {0:20.10f} | {1:0.10f}".format(weight.get('cross section'), scaleFactor))
  scaleFactor *= weight.get('filter efficiency')
  logger.debug("Filter Efficiency: {0:20.10f} | {1:0.10f}".format(weight.get('filter efficiency'), scaleFactor))
  scaleFactor *= weight.get('k-factor')
  logger.debug("k-factor:          {0:20.10f} | {1:0.10f}".format(weight.get('k-factor'), scaleFactor))
  logger.debug( "‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾")
  return scaleFactor

def add_sampleWeight_branch(did, weights, tree, branchname='weight_lumi'):
  scaleFactor = get_scaleFactor(did, weights)
  branch_val = array( 'f', [ scaleFactor ] )
  branch_val_up = array( 'f', [ scaleFactor*(1+weights[did].get('rel uncert', -1.0)) ] )
  branch_val_down = array( 'f', [ scaleFactor*(1-weights[did].get('rel uncert', 1.0)) ] )
  branches = []
  branches.append(tree.Branch('{0:s}'.format(branchname), branch_val, '{0:s}/F'.format(branchname)))
  branches.append(tree.Branch('{0:s}_up'.format(branchname), branch_val_up, '{0:s}_up/F'.format(branchname)))
  branches.append(tree.Branch('{0:s}_down'.format(branchname), branch_val_down, '{0:s}_down/F'.format(branchname)))
  for i in range(tree.GetEntries()):
    tree.GetEntry(i)
    for b in branches:
      b.Fill()

  #tree.Fill()

if __name__ == "__main__":

  # if we want multiple custom formatters, use inheriting
  class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter):
    pass

  __version__ = subprocess.check_output(["git", "describe", "--always"], cwd=os.path.dirname(os.path.realpath(__file__))).strip()

  parser = argparse.ArgumentParser(description='Determine if a sample has MC Weight issues',
                                   usage='\033[93m%(prog)s\033[0m files [options]',
                                   formatter_class=lambda prog: CustomFormatter(prog, max_help_position=30))
  parser.add_argument('files', type=str, nargs='+', help='HistFitter outputs from MBJ code')

  parser.add_argument('--filters', metavar='filters.json', type=str, help='JSON dictionary where key=DID and val=selection to apply')
  parser.add_argument('--weights', metavar='weights.json', type=str, help='JSON dictionary of weights Can be http:// url or local file!')
  parser.add_argument('--did-to-group', metavar='did_to_group.json', type=str, help='JSON dictionary mapping DID to group. Can be http:// url or local file!')

  parser.add_argument('--trees', metavar='tree', type=str, nargs='+', help='TTrees to merge and copy over in signal and bkgd files',
                      default=['nominal',
                               'JET_GroupedNP_1__1up',
                               'JET_GroupedNP_1__1down',
                               'JET_GroupedNP_2__1up',
                               'JET_GroupedNP_2__1down',
                               'JET_GroupedNP_3__1up',
                               'JET_GroupedNP_3__1down',
                               'JET_JER_SINGLE_NP__1up'])
  """
                               'JET_JMS_GroupedNP_1__1up',
                               'JET_JMS_GroupedNP_1__1down'])
                               'EG_RESOLUTION_ALL__1down',
                               'EG_RESOLUTION_ALL__1up',
                               'EG_SCALE_ALL__1down',
                               'EG_SCALE_ALL__1up',
                               'MET_SoftTrk_ResoPara',
                               'MET_SoftTrk_ResoPerp',
                               'MET_SoftTrk_ScaleDown',
                               'MET_SoftTrk_ScaleUp',
                               'MUONS_ID__1down',
                               'MUONS_ID__1up',
                               'MUONS_MS__1down',
                               'MUONS_MS__1up',
                               'MUONS_SCALE__1down',
                               'MUONS_SCALE__1up"*/
  """

  parser.add_argument('--include-dids', metavar='did', type=str, nargs='+', help='DIDs to use', default=[])

  parser.add_argument('--groups-bkg', metavar='group', type=str, nargs='+', help='Groups of bkgd to merge',
                      default=['ttbar',
                               'singletop',
                               'diboson',
                               'topEW',
                               'W_sherpa',
                               'Z_sherpa'])
  parser.add_argument('--groups-sig', metavar='group', type=str, nargs='+', help='Groups of signal to collect but not merge', default=['Gtt', 'Gbb'])

  parser.add_argument('--output-suffix', metavar='tag#####', type=str, default='', help='Add a suffix to output files, perhaps to specify a nominal only: --output-suffix _nominal')

  parser.add_argument('--do-data', action='store_true', help='Process and merge data files. Otherwise, skip this process.')

  parser.add_argument('--do-jms-fix', action='store_true', help='Add JMS systematic branches according to the prescription in this code.')

  parser.add_argument('--do-backward-compatibility', action='store_true', help='Rename the branch names to get things backgwards-compatible. See the code for what this entails.')

  parser.add_argument('-v', '--verbose', dest='verbose', action='count', default=0, help='Enable verbose output of various levels. Default: no verbosity')
  parser.add_argument('--version', action='version', version='\033[93m%(prog)s\033[0m \033[94m{version}\033[0m'.format(version=__version__), default='\033[94m{version}\033[0m'.format(version=__version__))

  args = parser.parse_args()

  # set verbosity for python printing
  if args.verbose < 2:
    logger.setLevel(20 - args.verbose*10)
  else:
    logger.setLevel(logging.NOTSET + 1)

  """
    Load the files in -- do not necessarily need a filter
  """
  try:
    filters = json.load(file(args.filters))
  except IOError:
    logger.warning("No filter file. Is this expected? You gave me '{0:s}'".format(args.filters))
    filters = {}

  if args.weights.startswith('http'):
    import requests
    weights = json.load(requests.get(args.weights).text)
  else:
    weights = json.load(file(args.weights))

  if args.did_to_group.startswith('http'):
    import requests
    did_to_group = json.load(requests.get(args.did_to_group).text)
  else:
    did_to_group = json.load(file(args.did_to_group))


  """
    Next, ensure we have information on all DIDs used in the files.
  """
  checkedDIDs = []
  for f in args.files:
    # skip data
    if '.data.' in f: continue
    did = get_did(f)
    if did in checkedDIDs: continue
    if did not in args.include_dids: continue
    checkedDIDs.append(did)

    logger.info("Checking {0:s}".format(did))
    if did in weights:
      logger.info("\t has weight")
    else:
      logger.error("\t does not have weight")
      sys.exit(1)
    if did in did_to_group:
      logger.info("\t has group {0:s}".format(did_to_group[did]))
    else:
      logger.error("\t does not have a group assigned")
      sys.exit(1)
    logger.info("\t has filter: {0:s}".format(str(did in filters)))

  # signal is fuzzy-matched, bkg is exact-matched
  all_groups = [group for group in (args.groups_bkg + args.groups_sig) if group]
  # check the groups and see if we have splitting information
  for group in all_groups:
    logger.info("Checking {0:s}".format(group))
    logger.info("\t has defined splits: {0:s}".format(str(any(group in name for name in filters.keys()))))
    for name in filters.keys():
      if group not in name: continue
      selection = filters[name]
      logger.info("\t\t {0:s} \t {1:s}".format(name, selection))

  """
    In this section,
      - we read in all the data files (must have 'data' in the name)
      - create a TChain of the 'nominal' branch
      - create an output file for data
      - clone the TChain as a TTree which gets saved to the output
      - close the output file

    TODO: Add protection if output file already exists.
  """
  data_files = [f for f in args.files if '.data.' in f]
  if args.do_data and len(data_files) > 0:
    logger.info("Doing the data files")
    tc = ROOT.TChain('nominal')
    for f in data_files:
      logger.debug('\t{0:s}'.format(f))
      tc.Add(f)
    logger.info('Creating output file {0:s}{1:s}.root'.format('Data', args.output_suffix))
    output_file = ROOT.TFile('Data{0:s}.root'.format(args.output_suffix), 'UPDATE')
    output_file.cd()
    tree = tc.CopyTree('')
    #tree = tc.CloneTree(-1, "fast")
    tree.SetName("Data")
    tree.SetTitle("Data")
    tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10")
    #tree.SetDirectory(output_file)
    #output_file.cd()
    tree.Write()
    #output_file.Write()
    output_file.Close()
    logger.info('Finished creating output file')

  """
    In this section,
      - we read in all the bkg files
      - sort everything / filter by group
      - make temporary files for things that need hadding or filtering
      - for each group
        - create the output file
        - for each branch
          - create a TChain
          - apply selection if desired (TChain::CopyTree)
          - clone the TChain as a TTree which gets saved to the output
          - apply weights by making a new branch
       - clean up temporary files

    TODO: Add protection if output file already exists
  """
  files_by_did = {}
  # first group by did to figure out if we need to merge files together or not
  for f in args.files:
    if f in data_files: continue
    did = get_did(f)
    if did not in args.include_dids: continue
    if did in files_by_did:
      files_by_did[did].append(f)
    else:
      files_by_did[did] = [f]

  # here, we need to do some temporary TChains and copy the trees we want
  tmpDir = tempfile.mkdtemp(dir=os.getcwd())
  logger.info("Temporary directory created: {0:s}".format(tmpDir))

  # make a dictionary mapping the group to the files inside the group
  files_by_group = {}
  for did,files in files_by_did.iteritems():
    group = did_to_group[did]
    # skip any dids that are not in groups we care about for bkg or signal
    if group not in args.groups_bkg and not any(g in group for g in args.groups_sig if g): continue
    logger.debug("Working on {0:s} which has {1:d} files".format(did, len(files)))
    logger.debug("\t belongs to group {0:s}".format(group))

    """
      always make a temporary file since we need to do at least one of the following
      which is sorted from most likely to least likely
        - add a new branch for sample weight (weight_lumi)
        - merge multiple files of the same DID together
        - apply a selection/filter to the ttrees
    """
    logger.debug("Making a tmp file for {0:s} with {1:d} files".format(did, len(files)))
    # wrap this portion in a try statement because we need to make sure we clean up after ourselves
    try:
      tmpFile = os.path.join(tmpDir, '{0:s}.root'.format(did))
      logger.debug("Creating temporary file: {0:s}".format(tmpFile))
      for treename in args.trees:
        # start making the tchain
        logger.debug("\tBuilding a TChain for {0:s}".format(treename))
        tc = ROOT.TChain(treename)
        for f in files:
          tc.Add(f)
        # apply a selection/filter if defined
        if did in filters: logger.debug("\tApplying filter: {0:s}".format(filters[did]))
        # open the file to update it
        output_file = ROOT.TFile.Open(tmpFile, 'UPDATE')
        output_file.cd()
        copy_tree = False
        while not copy_tree:
          logger.debug("\tCopying {0:s} to temporary file {1:s}".format(treename, tmpFile))
          copy_tree = tc.CopyTree(filters.get(did, ''))
        # clone the tree now into the current output file
        #logger.debug("Cloning {0:s} to temporary file {1:s}".format(treename, tmpFile))
        #clone_tree = copy_tree.CloneTree(-1, "fast")
        logger.debug("\tAdding the sample weight for {0:s} to the tree".format(did))
        #add_sampleWeight_branch(did, weights, clone_tree)
        add_sampleWeight_branch(did, weights, copy_tree)

        if args.do_jms_fix and treename == 'nominal':
          logger.debug("\tDoing the JMS fix")
          for systname in ['JET_JMS_GroupedNP_1__1up','JET_JMS_GroupedNP_1__1down']:
            logger.debug("\t\t{0:s} is being cloned".format(systname))
            #syst_tree = clone_tree.CloneTree(-1, "fast")
            syst_tree = copy_tree.CloneTree(-1, "fast")
            #syst_tree = copy_tree.CopyTree('')
            syst_tree.SetName(systname)
            syst_tree.SetTitle(systname)
            if systname == 'JET_JMS_GroupedNP_1__1up':
              syst_tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10_JMSUP")
            elif systname == 'JET_JMS_GroupedNP_1__1down':
              syst_tree.SetAlias("MJSum_rc_r08pt10_nominal", "MJSum_rc_r08pt10_JMSDOWN")

            #syst_tree.SetDirectory(output_file)
            #output_file.cd()
            syst_tree.Write()

        logger.debug("\tSetting an alias for the MJSum branch")
        #clone_tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10")
        copy_tree.SetAlias("MJSum_rc_r08pt10_nominal","MJSum_rc_r08pt10")
        #clone_tree.SetDirectory(output_file)
        #copy_tree.SetDirectory(output_file)
        #output_file.cd()
        #clone_tree.Write()
        copy_tree.Write()

        #output_file.Write()
        #del tc
        # flush contents and update streamerinfo
        #output_file.Write()
        output_file.Close()
      try:
        files_by_group[group].append(tmpFile)
      except KeyError:
        files_by_group[group] = [tmpFile]
    except:
      # clean up after ourselves
      import shutil
      shutil.rmtree(tmpDir, ignore_errors=True)
      raise

  # JMS fix requires adding the two new TTrees in our loop and copying them from nominal
  if args.do_jms_fix:
    args.trees.extend(['JET_JMS_GroupedNP_1__1up','JET_JMS_GroupedNP_1__1down'])

  signal_filename = 'Sig{0:s}.root'.format(args.output_suffix)
  logger.info('Signal output file {0:s}'.format(signal_filename))
  bkg_filename = 'Bkg{0:s}.root'.format(args.output_suffix)
  logger.info('Background output file {0:s}'.format(bkg_filename))
  # create the output files for signal and background

  # this contains the output file that the given group is being written to
  output_file = None
  for group, files in files_by_group.iteritems():
    if group in args.groups_bkg:
      logger.debug('Group {0:s} is identified as BACKGROUND'.format(group))
      output_filename = bkg_filename
    else:
      logger.debug('Group {0:s} is identified as SIGNAL'.format(group))
      output_filename = signal_filename

    logger.info('\tGroup {0:s} will be written to {1:s}'.format(group, output_filename))
    # loop over the trees we need to make
    for treename in args.trees:
      logger.debug('\tBuilding TChain for {0:s}'.format(treename))
      tc = ROOT.TChain(treename)
      for f in files:
        logger.debug('\t\t{0:s}'.format(f))
        tc.Add(f)
      # open the file
      output_file = ROOT.TFile(output_filename, 'UPDATE')
      output_file.cd()
      logger.debug('\tCopying TChain now')
      tree = tc.CopyTree('')
      #tree = tc.CloneTree(-1, "fast")

      out_group = group
      out_treename = treename
      if args.do_backward_compatibility:
        out_group = group.replace('singletop', 'SingleTop')
        out_group = group.replace('topEW', 'TopEW')
        out_group = group.replace('_sherpa', 'jets')
        out_group = group.replace('_5000', '')
        out_treename = treename.replace('nominal', 'NoSys')
      # names need to be format <sample>_<systematic>
      tree.SetName("_".join([out_group, out_treename]))
      tree.SetTitle("_".join([out_group, out_treename]))
      #tree.SetDirectory(output_file)
      #output_file.cd()
      tree.Write()
      logger.info('\tTree {0:s} was written to file'.format("_".join([out_group, out_treename])))
      # check if we need to do some splitting
      for name,selection in filters.get(group, []):
        logger.info('\tApplying filter {0:s} with selection {1:s}'.format(name, selection))
        copy_tree = tree.CopyTree(selection)
        # we're technically creating a new sample called <sample>_<name of selection>
        copy_tree.SetName("_".join([out_group, name, out_treename]))
        copy_tree.SetTitle("_".join([out_group, name, out_treename]))
        copy_tree.Write()
        logger.info('\t\tTree {0:s} was written to file'.format("_".join([out_group, name, out_treename])))
      #output_file.Write()
      output_file.Close()

  import shutil
  shutil.rmtree(tmpDir, ignore_errors=True)

  logger.info("Don't forget to remove the temp directory {0:s}".format(tmpDir))
  logger.info("Completely finished! Enjoy.")

The end conclusion is that we will most likely need to drop trying to do this in python (the equivalent in C-code works perfectly) due to instability in pyroot and how it handles trees.