Useful snippets and code-examples for evaluating data from CSV with ROOT
This Article features some Python and pyROOT (root.cern.ch) snippets which can be useful for evaluating sets of data from CSV files.
The typical scenario is an Excel-Sheet containing rows of data from a questionnaire for which one wants a quick insight. To be brutally honest, theres probably a thousand better ways to do this with sklearn and pandas – but what you gain by going the „hard way“, is probably a better understanding of the math behind the functions. If you want a fast way, i may much rather refer to the excellent book
„Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems“ by Aurélien Géron.
0.) Data preparation
While not actually a part, it is important to format the data correctly so Python and ROOT can interpret it easily. Usually a dataset is placed in a row. Each column contains a question and column 1 (or A in Excel) contains the ID of the dataset. It is important to remember to set your Excel to English or the decimal separator (.) can vary depending on your locale (it is „,“ in German, which really doesn’t make much sense when trying to parse a CSV (comma-separated-values) list!). A simple layout fulfilling these criteria would be:
ID | Question 1 | Question 2 | Question 3 |
0 | 1 | 2 | 1 |
1 | 2 | 3 | 3 |
…
1.) The imports
Later on we will make use of the packages, so we create a jupyter cell for all of them. Typically, we can set the gROOT-options directly there
###Imports, inputs, global variables
## imports
import csv
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from matplotlib import cm
import numpy as np
import scipy
from scipy import stats
import ROOT
myCanvas = ROOT.TCanvas(„main“,“mainCanvas“,1280,720)
## input
dataFile = „./CSV_echtdaten.csv“
dataFileControl = „./CSV_kontrollgruppe.csv“
styling for the root plots
ROOT.gStyle.SetStatX(0.35)
ROOT.gStyle.SetStatY(.9)
ROOT.gROOT.Macro(‚./phd_style.C‘)
ROOT.gROOT.SetStyle(„STD“)
#ROOT.gStyle.SetPalette(ROOT.kPigeon)
ROOT.gStyle.SetPalette(ROOT.kViridis)
ROOT.gStyle.SetTitleAlign(33)
ROOT.gStyle.SetTitleX(.8)
ROOT.gStyle.SetOptStat(1111)
2.) The standardificator
We will create a lot of very similar plots – one for each question! So we can define a function to set-up all the margins and labels. We put this into another cell since we probably want to experiment with this function a lot.
###apply all your styling code here!
def standardify(RootHisto):
RootHisto.SetFillColor(1)
RootHisto.SetFillStyle(3345)
RootHisto.GetXaxis().SetTitleOffset(1.0)
RootHisto.GetXaxis().SetTitleSize(.05)
RootHisto.GetXaxis().SetLabelSize(0.05)
RootHisto.GetYaxis().SetLabelSize(0.05)
RootHisto.SetLineColor(1)
RootHisto.SetLineWidth(3)
RootHisto.GetYaxis().SetTitle(„Number of entries“)
RootHisto.GetYaxis().SetTitleOffset(1.0)
RootHisto.GetYaxis().SetTitleSize(0.05)
3.) Reading CSV Data into your numpy arrays
This does the actual work of reading CSV into a numpy array – a big internal matrix or table if you will. Python is simple, almost everything is done for us already. For easily changing data in between, we can create another cell just containing this part!
###Read the same data in a numpy array - easier to handle for plots
csvnpR = np.genfromtxt(dataFile,delimiter=",",skip_header=0)
csvnp = np.transpose(csvnpR)
## we take the transpose for some of the evaluation later on
csvnpRCG = np.genfromtxt(dataFileControl,delimiter=",",skip_header=0)
csvnpCG = np.transpose(csvnpRCG)
4.) Setting the conditions and texts for each question
In the next cell we create arrays/lists containing the question-texts, ranges of possible answers and – if needed – strings for each number.
#This should really be self-explaining, right? By the way: Our example would allow for some GREAT correlations! 😉
questionsTexts = [
„Age“,
„Porn consumption per week“,
„annual income“,
…..]
questionsAxisRanges = [
[80,0.0,80.0], #Okay, maybe this binning is too fine?
[6,1.0,7.0], #We start at 1.0, because honestly who watches less porn than once a week?
[5,1.0,6.0],
…..]
# in the last question (income) we want to use a scale from 1 to 5 – each standing for a special string-bin, see next list!
questionWantAxisLabels = [
False,
False,
True,
…..]
questionsAxisLabels = [
[„DO_NOT_WANT“],
[„DO_NOT_WANT“],
[„below 10k“,“10-20k“,“20-40k“,“40-80k“,“more than 80k“],
….]
5.) The basic histogramming magic for each column
The magic now happens in loops, which takes all the clicky-clicky-work from you and creates beautiful (hopefully) plots from each column
for questionId in range(0,len(questionsTexts)-2): ##loop over questions
plotName = questionsTexts[questionId]
myActPlot = ROOT.TH1D(plotName,plotName,questionsAxisRanges[questionId][0],questionsAxisRanges[questionId][1],questionsAxisRanges[questionId][2])
myActPlot.SetTitle("")
for entry in range(0,len(csvnp[questionId+1])):
myActPlot.Fill(csvnp[questionId+1][entry])
#set axis labels
if questionWantAxisLabels[questionId]:
for binNum in range(1,questionsAxisRanges[questionId][0]+1):
myActPlot.GetXaxis().SetBinLabel(binNum,questionsAxisLabels[questionId][binNum-1])
standardify(myActPlot)
myActPlot.Draw()
myCanvas.Draw()
#set your own path here!
myCanvas.SaveAs("./"+str(questionId)+".png")
6.) This was too easy! Normalization please!
Normalizing a histogram or a distribution is a very useful method for comparing data sets of different statistical power. Doing it correctly usually means to remember that the statistical uncertainties for each bin have to be scaled accordingly. Take this as a warning when working with uncertainties! If ROOT already assumes them right, the normalization procedure will do this for you. The normalization of a histogram is as easy as follows:
#Integrate the histrogram
int1=myActPlot.GetEntries()
#scale by the inverse of the integral
myActPlot.Scale(1.0/int1)
Hence, the complete code for drawing a normalized distribution from one of the columns is:
plotName="Normalized Data from Question 2"
myActPlot = ROOT.TH1D(plotName,plotName,5,1,6)
indexToPlot = 1
for entry in range(0,len(csvnp[indexToPlot+1])):
myActPlot.Fill(csvnp[indexToPlot+1][entry])
standardify(myActPlot)
myActPlot.SetTitle("")
myActPlot.GetYaxis().SetTitle("Normalized entries")
int1=myActPlot.GetEntries()
myActPlot.Scale(1.0/int1)
myActPlot.GetXaxis().SetTitle("")
myActPlot.Draw("hist")
# with errors one could use
# myActPlot.Draw()
myCanvas.Draw()
myCanvas.SaveAs("./normalized.png")
Not too difficult!
7.) Plot differences of questions
This may sound silly at first – why would we ever want this? Typically, this is used to display the difference of two identical questions referring to different times. This can be a measure for the effectiveness for something that happened in between these questions. The plotting relatively simple:
#difference histograms
plotName = "Difference annual income"
wwh = 42 #question 1 column number
wwhac = 42 #question 2 column number
# syntax is: TH1D(NAME,TITLE, NUMBER_OF_BINS,LOWER_RANGE_BORDER,UPPER_RANGE_BORDER)
myActPlot = ROOT.TH1D(plotName,plotName,61,-30,30)
for entry in range(0,len(csvnpCG[wwh+1])):
myActPlot.Fill(csvnpCG[wwhac+1][entry]-csvnpCG[wwh+1][entry])
standardify(myActPlot)
myActPlot.GetXaxis().SetTitle("#Delta t [h]")
myActPlot.Draw()
myCanvas.SetLogy(1)
myCanvas.Draw()
myCanvas.SaveAs("./difference_after_time.png"
8.) t-test for two independant samples
The t-test for two independant samples can quickly be implemented in Python. The equation for it has been taken from wikipedia (https://de.wikipedia.org/wiki/Zweistichproben-t-Test)
sumX=sumY=0.0
n = 0
m = 0
#sums and means
for entry in range(0,len(csvnp[wwh+1])):
if not math.isnan(csvnp[wwhac+1][entry]) and not math.isnan(csvnp[wwh+1][entry]):
sumX=sumX+(csvnp[wwhac+1][entry]-csvnp[wwh+1][entry])
m=m+1
sumX = sumX / m
for entry in range(0,len(csvnpCG[wwh+1])):
if not math.isnan(csvnpCG[wwhac+1][entry]) and not math.isnan(csvnpCG[wwh+1][entry]):
n=n+1
sumY=sumY+(csvnpCG[wwhac+1][entry]-csvnpCG[wwh+1][entry])
sumY=sumY / n
# std deviations
tmpX=tmpY =0.0
for entry in range(0,len(csvnp[wwh+1])):
if not math.isnan(csvnp[wwhac+1][entry]) and not math.isnan(csvnp[wwh+1][entry]):
tmpX=tmpX+((csvnp[wwhac+1][entry]-csvnp[wwh+1][entry])-sumX)**2
tmpX = tmpX / (n-1)
for entry in range(0,len(csvnpCG[wwh+1])):
if not math.isnan(csvnpCG[wwhac+1][entry]) and not math.isnan(csvnpCG[wwh+1][entry]):
tmpY=tmpY+((csvnpCG[wwhac+1][entry]-csvnpCG[wwh+1][entry])-sumX)**2
tmpY = tmpY / (n-1)
stdDevX = math.sqrt(tmpX)
stdDevY = math.sqrt(tmpY)
print(sumX,"+-",stdDevX," nMeas=",m)
print(sumY,"+-",stdDevY," nMeas=",n)
##calculate the mean variance for the double-t-test
s=((m-1)*stdDevX**2+(n-1)*stdDevY**2) / (m+n-2)
s=math.sqrt(s)
entropyPrefix = math.sqrt((m*n)/(m+n))
t=entropyPrefix* (sumX-sumY)/s
print("Mean variance",s,"t-value",t)
9.) Pearson-correlations
The typical correlations, perfectly implemented in scipy already – we just use them and plot a TH2D out of it:
ROOT.gStyle.SetOptStat(0)
nQuestions = len(csvnp)
korrelationsPlot = ROOT.TH2D("pearson correlations","pearson correlations #rho(a,b)",nQuestions,0,nQuestions,nQuestions,0,nQuestions)
pearsonWCM = np.zeros([nQuestions, nQuestions])
## PEARSON correlations, implemented in scipy
for fragenId in range(1,len(csvnp)):
for korrPartnerId in range(fragenId,len(csvnp)):
q1=csvnp[fragenId]
q2=csvnp[korrPartnerId]
q1Proof = np.nan_to_num(q1)
q2Proof = np.nan_to_num(q2)
pearsonWCM[fragenId][korrPartnerId]=scipy.stats.pearsonr(q1Proof,q2Proof)[0]
korrelationsPlot.SetBinContent(korrPartnerId,fragenId,scipy.stats.pearsonr(q1Proof,q2Proof)[0])
#a little different styling for TH2D
korrelationsPlot.GetYaxis().SetTitle("Question a")
korrelationsPlot.GetXaxis().SetTitle("Question b")
korrelationsPlot.GetXaxis().SetTitleSize(.05)
korrelationsPlot.GetYaxis().SetTitleSize(.05)
#save to csv for maybe further processing?
np.savetxt("pearson_correlations_nocm.csv",pearsonWCM)
korrelationsPlot.Draw("colz")
myCanvas.Draw()
#make z-color-range visible
palette = korrelationsPlot.GetListOfFunctions().FindObject(„palette“)
print(palette)
palette.SetX2(37.09)
myCanvas.Draw()
myCanvas.SaveAs(„./pearson_correlations.png“)
10.) Cronbach’s alpha
The equation for a standardised Cronbach’s alpha can be taken from wikipedia (https://de.wikipedia.org/wiki/Cronbachs_Alpha). The implementation is relatively easy, if the correlations are already known – see 9.).
#cronbach alpha
#calculate the mean
mean=0.0
normFac=0
for i in range(1,len(CSVNP)):
for a in range(1,len(CSVNP)):
## dont use 0 or 1, since they are typically badly formulated questions!
if pearsonWCM[i][a] != 1 and pearsonWCM[i][a] != 0:
mean = mean+pearsonWCM[i][a]
normFac = normFac+1
mean = mean/normFac
print(„r bar“,mean)
N=len(CSVNP)
#calculate the alphas as in wikipedia
alpha = (mean*N) / (1.+(N-1.)*mean)
print(„Alpha_Cronbach“,alpha)