Python-ROOT-Snippets [ENG]

Useful snippets and code-examples for evaluating data from CSV with ROOT

This Article features some Python and pyROOT (root.cern.ch) snippets which can be useful for evaluating sets of data from CSV files.

The typical scenario is an Excel-Sheet containing rows of data from a questionnaire for which one wants a quick insight. To be brutally honest, theres probably a thousand better ways to do this with sklearn and pandas – but what you gain by going the „hard way“, is probably a better understanding of the math behind the functions. If you want a fast way, i may much rather refer to the excellent book

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems“ by Aurélien Géron.

0.) Data preparation

While not actually a part, it is important to format the data correctly so Python and ROOT can interpret it easily. Usually a dataset is placed in a row. Each column contains a question and column 1 (or A in Excel) contains the ID of the dataset. It is important to remember to set your Excel to English or the decimal separator (.) can vary depending on your locale (it is „,“ in German, which really doesn’t make much sense when trying to parse a CSV (comma-separated-values) list!). A simple layout fulfilling these criteria would be:

ID Question 1 Question 2 Question 3
0 1 2 1
1 2 3 3

1.) The imports

Later on we will make use of the packages, so we create a jupyter cell for all of them. Typically, we can set the gROOT-options directly there

###Imports, inputs, global variables
## imports
import csv
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from matplotlib import cm
import numpy as np
import scipy
from scipy import stats
import ROOT
myCanvas = ROOT.TCanvas(„main“,“mainCanvas“,1280,720)
## input
dataFile = „./CSV_echtdaten.csv“
dataFileControl = „./CSV_kontrollgruppe.csv“

styling for the root plots
ROOT.gStyle.SetStatX(0.35)
ROOT.gStyle.SetStatY(.9)

ROOT.gROOT.Macro(‚./phd_style.C‘)
ROOT.gROOT.SetStyle(„STD“)
#ROOT.gStyle.SetPalette(ROOT.kPigeon)
ROOT.gStyle.SetPalette(ROOT.kViridis)

ROOT.gStyle.SetTitleAlign(33)
ROOT.gStyle.SetTitleX(.8)
ROOT.gStyle.SetOptStat(1111)

2.) The standardificator

We will create a lot of very similar plots – one for each question! So we can define a function to set-up all the margins and labels. We put this into another cell since we probably want to experiment with this function a lot.

###apply all your styling code here!
def standardify(RootHisto):
RootHisto.SetFillColor(1)
RootHisto.SetFillStyle(3345)
RootHisto.GetXaxis().SetTitleOffset(1.0)
RootHisto.GetXaxis().SetTitleSize(.05)

RootHisto.GetXaxis().SetLabelSize(0.05)
RootHisto.GetYaxis().SetLabelSize(0.05)
RootHisto.SetLineColor(1)
RootHisto.SetLineWidth(3)
RootHisto.GetYaxis().SetTitle(„Number of entries“)
RootHisto.GetYaxis().SetTitleOffset(1.0)
RootHisto.GetYaxis().SetTitleSize(0.05)

3.) Reading CSV Data into your numpy arrays

This does the actual work of reading CSV into a numpy array – a big internal matrix or table if you will. Python is simple, almost everything is done for us already. For easily changing data in between, we can create another cell just containing this part!

###Read the same data in a numpy array - easier to handle for plots
csvnpR = np.genfromtxt(dataFile,delimiter=",",skip_header=0)
csvnp = np.transpose(csvnpR)
## we take the transpose for some of the evaluation later on
csvnpRCG = np.genfromtxt(dataFileControl,delimiter=",",skip_header=0)
csvnpCG = np.transpose(csvnpRCG)

4.) Setting the conditions and texts for each question

In the next cell we create arrays/lists containing the question-texts, ranges of possible answers and – if needed – strings for each number.

#This should really be self-explaining, right? By the way: Our example would allow for some GREAT correlations! 😉
questionsTexts = [
„Age“,
„Porn consumption per week“,
„annual income“,
…..]
questionsAxisRanges = [
[80,0.0,80.0], #Okay, maybe this binning is too fine?
[6,1.0,7.0], #We start at 1.0, because honestly who watches less porn than once a week?
[5,1.0,6.0],
…..]

# in the last question (income) we want to use a scale from 1 to 5 – each standing for a special string-bin, see next list!

questionWantAxisLabels = [
False,
False,
True,
…..]

questionsAxisLabels = [
[„DO_NOT_WANT“],
[„DO_NOT_WANT“],
[„below 10k“,“10-20k“,“20-40k“,“40-80k“,“more than 80k“],
….]

5.) The basic histogramming magic for each column

The magic now happens in loops, which takes all the clicky-clicky-work from you and creates beautiful (hopefully) plots from each column

for questionId in range(0,len(questionsTexts)-2): ##loop over questions
  plotName = questionsTexts[questionId]
  myActPlot = ROOT.TH1D(plotName,plotName,questionsAxisRanges[questionId][0],questionsAxisRanges[questionId][1],questionsAxisRanges[questionId][2])
  myActPlot.SetTitle("")
  for entry in range(0,len(csvnp[questionId+1])):
    myActPlot.Fill(csvnp[questionId+1][entry])
    #set axis labels
    if questionWantAxisLabels[questionId]:

      for binNum in range(1,questionsAxisRanges[questionId][0]+1):
        myActPlot.GetXaxis().SetBinLabel(binNum,questionsAxisLabels[questionId][binNum-1])
      standardify(myActPlot)
      myActPlot.Draw()
      myCanvas.Draw()
      #set your own path here!
      myCanvas.SaveAs("./"+str(questionId)+".png")

6.) This was too easy! Normalization please!

Normalizing a histogram or a distribution is a very useful method for comparing data sets of different statistical power. Doing it correctly usually means to remember that the statistical uncertainties for each bin have to be scaled accordingly. Take this as a warning when working with uncertainties! If ROOT already assumes them right, the normalization procedure will do this for you. The normalization of a histogram is as easy as follows:

#Integrate the histrogram

int1=myActPlot.GetEntries()

#scale by the inverse of the integral

myActPlot.Scale(1.0/int1)

Hence, the complete code for drawing a normalized distribution from one of the columns is:

plotName="Normalized Data from Question 2"
myActPlot = ROOT.TH1D(plotName,plotName,5,1,6)
indexToPlot = 1
for entry in range(0,len(csvnp[indexToPlot+1])):
myActPlot.Fill(csvnp[indexToPlot+1][entry])
standardify(myActPlot)
myActPlot.SetTitle("")
myActPlot.GetYaxis().SetTitle("Normalized entries")

int1=myActPlot.GetEntries()
myActPlot.Scale(1.0/int1)
myActPlot.GetXaxis().SetTitle("")

myActPlot.Draw("hist")

# with errors one could use
# myActPlot.Draw()

myCanvas.Draw()
myCanvas.SaveAs("./normalized.png")

Not too difficult!

7.) Plot differences of questions

This may sound silly at first – why would we ever want this? Typically, this is used to display the difference of two identical questions referring to different times. This can be a measure for the effectiveness for something that happened in between these questions. The plotting relatively simple:

#difference histograms

plotName = "Difference annual income"

wwh = 42 #question 1 column number

wwhac = 42 #question 2 column number

# syntax is: TH1D(NAME,TITLE, NUMBER_OF_BINS,LOWER_RANGE_BORDER,UPPER_RANGE_BORDER)
myActPlot = ROOT.TH1D(plotName,plotName,61,-30,30)
for entry in range(0,len(csvnpCG[wwh+1])):
myActPlot.Fill(csvnpCG[wwhac+1][entry]-csvnpCG[wwh+1][entry])
standardify(myActPlot)
myActPlot.GetXaxis().SetTitle("#Delta t [h]")
myActPlot.Draw()

myCanvas.SetLogy(1)

myCanvas.Draw()

myCanvas.SaveAs("./difference_after_time.png"

8.) t-test for two independant samples

The t-test for two independant samples can quickly be implemented in Python. The equation for it has been taken from wikipedia (https://de.wikipedia.org/wiki/Zweistichproben-t-Test)

sumX=sumY=0.0
n = 0
m = 0

#sums and means
for entry in range(0,len(csvnp[wwh+1])):
  if not math.isnan(csvnp[wwhac+1][entry]) and not math.isnan(csvnp[wwh+1][entry]):
    sumX=sumX+(csvnp[wwhac+1][entry]-csvnp[wwh+1][entry])
    m=m+1

sumX = sumX / m

for entry in range(0,len(csvnpCG[wwh+1])):
  if not math.isnan(csvnpCG[wwhac+1][entry]) and not math.isnan(csvnpCG[wwh+1][entry]):
    n=n+1
    sumY=sumY+(csvnpCG[wwhac+1][entry]-csvnpCG[wwh+1][entry])

sumY=sumY / n

# std deviations

tmpX=tmpY =0.0
for entry in range(0,len(csvnp[wwh+1])):
  if not math.isnan(csvnp[wwhac+1][entry]) and not math.isnan(csvnp[wwh+1][entry]):
    tmpX=tmpX+((csvnp[wwhac+1][entry]-csvnp[wwh+1][entry])-sumX)**2

tmpX = tmpX / (n-1)

for entry in range(0,len(csvnpCG[wwh+1])):
  if not math.isnan(csvnpCG[wwhac+1][entry]) and not math.isnan(csvnpCG[wwh+1][entry]):
    tmpY=tmpY+((csvnpCG[wwhac+1][entry]-csvnpCG[wwh+1][entry])-sumX)**2

tmpY = tmpY / (n-1)

stdDevX = math.sqrt(tmpX)
stdDevY = math.sqrt(tmpY)

 

print(sumX,"+-",stdDevX," nMeas=",m)
print(sumY,"+-",stdDevY," nMeas=",n)

##calculate the mean variance for the double-t-test
s=((m-1)*stdDevX**2+(n-1)*stdDevY**2) / (m+n-2)
s=math.sqrt(s)

entropyPrefix = math.sqrt((m*n)/(m+n))
t=entropyPrefix* (sumX-sumY)/s

print("Mean variance",s,"t-value",t)

9.) Pearson-correlations

The typical correlations, perfectly implemented in scipy already – we just use them and plot a TH2D out of it:

ROOT.gStyle.SetOptStat(0)
nQuestions = len(csvnp)
korrelationsPlot = ROOT.TH2D("pearson correlations","pearson correlations #rho(a,b)",nQuestions,0,nQuestions,nQuestions,0,nQuestions)
pearsonWCM = np.zeros([nQuestions, nQuestions])
## PEARSON correlations, implemented in scipy
for fragenId in range(1,len(csvnp)):
  for korrPartnerId in range(fragenId,len(csvnp)):
    q1=csvnp[fragenId]
    q2=csvnp[korrPartnerId]
    q1Proof = np.nan_to_num(q1)
    q2Proof = np.nan_to_num(q2)
    pearsonWCM[fragenId][korrPartnerId]=scipy.stats.pearsonr(q1Proof,q2Proof)[0]
    korrelationsPlot.SetBinContent(korrPartnerId,fragenId,scipy.stats.pearsonr(q1Proof,q2Proof)[0])
#a little different styling for TH2D
korrelationsPlot.GetYaxis().SetTitle("Question a")
korrelationsPlot.GetXaxis().SetTitle("Question b")
korrelationsPlot.GetXaxis().SetTitleSize(.05)
korrelationsPlot.GetYaxis().SetTitleSize(.05)
#save to csv for maybe further processing?
np.savetxt("pearson_correlations_nocm.csv",pearsonWCM)
korrelationsPlot.Draw("colz")

myCanvas.Draw()
#make z-color-range visible
palette = korrelationsPlot.GetListOfFunctions().FindObject(„palette“)
print(palette)
palette.SetX2(37.09)
myCanvas.Draw()
myCanvas.SaveAs(„./pearson_correlations.png“)

10.) Cronbach’s alpha

The equation for a standardised Cronbach’s alpha can be taken from wikipedia (https://de.wikipedia.org/wiki/Cronbachs_Alpha). The implementation is relatively easy, if the correlations are already known – see 9.).

#cronbach alpha
#calculate the mean
mean=0.0
normFac=0
for i in range(1,len(CSVNP)):
  for a in range(1,len(CSVNP)):
    ## dont use 0 or 1, since they are typically badly formulated questions!
    if pearsonWCM[i][a] != 1 and pearsonWCM[i][a] != 0:
      mean = mean+pearsonWCM[i][a]
      normFac = normFac+1

mean = mean/normFac

print(„r bar“,mean)

N=len(CSVNP)
#calculate the alphas as in wikipedia
alpha = (mean*N) / (1.+(N-1.)*mean)
print(„Alpha_Cronbach“,alpha)

 

 

Schreibe einen Kommentar