In this notebook, we look at UNEB results for PLE, UCE and UACE. We provide some analysis to try and getter a better understanding of the results. The results are listed by school and show the performace in terms of how many students passed in a particular division.
This dataset is available from the Data dot UG website. This website contains a number of datasets about Uganda that can be used for anlaysis.
We can then dive into the analysis
ls
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline
import numpy as np
ple = pd.read_csv('ple-results-by-school-2010-2015.csv')
uce = pd.read_csv('uce-results-by-school-2011-2016.csv')
uace = pd.read_csv('uace-results-2011-2015.csv')
ple.head()
uce.head()
uace.head()
To make the use of the different columns easier, I will change them to lower case and also remove any white space around them
for dataset in [ple, uce, uace]:
dataset.columns = dataset.columns.map(str.lower)
dataset.columns = dataset.columns.map(str.strip)
ple.columns
I decided to create a score that would aggregate the percentages of the different divisions, attaching different weights to each of them. The score ranges from 0 to 100
def score(x):
return x['% div 1'] + x['% div 2']*0.5 + x['% div 3']/float(3) +x['% div 4']*0.25 + x['% u']*0.1 + x['% x']*0
ple['score'] = ple.apply(score, axis=1)
Using this score we can then obtain the top performing schools per year.
ple_by_year = ple.groupby('year')
ple_year_ranks={}
for year, group in ple_by_year:
ple_year_ranks[year]= group[['school', 'score']].sort_values(by='score',ascending=False).drop_duplicates()
ple_year_ranks[2014].iloc[:20]
We notice some schools are repeated however with different integer values prepended to their names, seemingly making them unique. This is why they were not removed by the drop_duplicates function in the previous statement. We make a function called trim_name that takes off the integer prefix, so that we can then easily drop the duplicates.
def trim_name(x):
words = x.split()
try:
a=int(words[0])
n= ' '.join(words[1:])
return n.strip()
except:
return x.strip()
for year in ple_year_ranks:
ple_year_ranks[year]['school']= ple_year_ranks[year]['school'].apply(trim_name)
ple_year_ranks[year].drop_duplicates(inplace=True)
ple_year_ranks[2014].iloc[:20]
We now have the duplicates removed and this gives us a clearer picture of the school's performances. We can now see the best performig schools for each year.
for year in ple_year_ranks:
print 'The best schools in ' + str(year)
print ple_year_ranks[year][:10].reset_index().drop('index', axis=1)
print '\n'
We can also look at the overall best performing schools over the course of the 6 years of the survey. First we shall use the trim_name method on the original list of schools
ple['school']=ple['school'].apply(trim_name)
We can then drop the schools that are duplicated in a particular year
ple.drop_duplicates(['school', 'year'], inplace=True)
We can then proceeed with our analysis. We get the average performance of each school over the entire study and then figure out which schools have performed best over this period.
ple_schools = ple.groupby('school')
overall_schools= ple_schools['score'].mean()
overall_schools.sort_values(ascending=False)[:20]
We can see that the schools that feature here are mainly the schools we saw when we did a year by year performance analysis.
We can also look at the representation of the different districts among the top performing schools.
We can also look at this in terms of the school that are perenially among the top schools every year. For this we shall check to see which schools are among the top 100 schools each year from 2010 to 2015 and then order them by their score.
We already have the ple_year_ranks dictionary that ranks the schools in order of their score per year, so we shall use this and a set operation to obtain these schools.
best_schools=[]
for year in ple_year_ranks:
best_schools.append(set(ple_year_ranks[year][:100].school.tolist()))
perenial_best = list(set.intersection(*(best_schools)))
perenial_best
Interestingly there are only 10 schools that are in the top 100 schools every year. It is important to note that the score we came up with relies on the percentages of students in each division attaching different weights to each. As a result, a school that has many students that really excel as well as most getting very good grades but all has some students scoring in the lower divisions may not be consistently be at the top even though it is considered to do well perenially.
This should explain why many of the schools that we know to be among the best do not show up in the list above. Also this kind of scoring using the percentages in each division means that a school can easily fall out of the top schools in a particular. In addition to that, PLE is relatively competitive as evidenced from the data. A very huge number of schools have many students in the top divisions.
After that explanation we can plot them on a map to see where these schools are located and the distribution of the perenially well performing schools.
indexer =[]
ple2015 = ple[ple['year']==2015]
for school in perenial_best:
i = ple2015[ple2015['school']== school][['school', 'district']].index.values.tolist()
indexer = indexer + i
perennial_df = ple2015.ix[indexer][['school', 'district']]
perennial_df
import geocoder
def add_coordinates(district):
coords = geocoder.google(district.split()[0] + ', UGANDA').latlng
return coords
perennial_df['coordinates'] = perennial_df['district'].apply(add_coordinates)
perennial_df
import folium
map_ug = folium.Map(location=[1.373333, 32.290275], zoom_start=7)
ind= perennial_df.index
for i in range(len(ind)):
folium.Marker(perennial_df['coordinates'].iloc[i],
popup= perennial_df['school'].iloc[i]).add_to(map_ug)
map_ug
sorted_ple = ple[['school', 'district', 'score']].sort_values(by= 'score', ascending=False)
sorted_ple[sorted_ple['score']==100].head()
ple90_districts = list(set(sorted_ple[sorted_ple['score']>=90]['district'].tolist()))
districts_map ={}
for district in ple90_districts:
districts_map[district] = add_coordinates(district)
ple90 = sorted_ple[sorted_ple['score']>=90]
jitter = np.random.random(len(ple90))*0.2
def add_jitter(x):
jittered = [x[0]+ np.random.choice(jitter), x[1] + np.random.choice(jitter)]
return jittered
ple90.loc[:, 'coords'] = ple90.loc[:, 'district'].map(districts_map)
ple90.loc[:, 'jittered coords'] = ple90.loc[:, 'coords'].apply(add_jitter)
ple90.head()
map_ug = folium.Map(location=[1.373333, 32.290275], zoom_start=7)
ind= perennial_df.index
for i in range(len(ple90)):
folium.Marker(ple90['jittered coords'].iloc[i],
popup= ple90['school'].iloc[i]).add_to(map_ug)
map_ug
As initally suspected we can see that many of the schools that perform best are in the Central and Western regions primarily with the Eastern regions posting a good number as well.
It is easy to see and is also almost expected that the Northern region posts very low numbers in comparison to the rest and large expanses of districts to not have a top performing school for the entire period of 6 years. Given the history of poor quality schools due to underfunding, instability among other reasons, this is not entirely surprising. However it does raise serious issues of governance and resource distribution.
Lastly will shall examine the performance of females contrasted against that of their maele counterparts. We create a score column that is similar to the one we createda above though this one is for a particular sex. We create one for male and another for females. The weights attached to the different divisions remain unchanged.
For this we shall use the same score function as before with a few adjustments made to fit the purpose.
ple.head()
We look at the above
def male_score(x):
return x['male % div1'] + x['male % div2']*0.5 + x['male % div3']/float(3) +x['male % div4']*0.25 + x['male % u']*0.1 + x['male % x']*0
def female_score(x):
return x['female % div1'] + x['female % div2']*0.5 + x['female % div3']/float(3) +x['female % div4']*0.25 + x['female % u']*0.1 + x['female % x']*0
ple['mscore'] = ple.apply(male_score, axis=1)
ple['fscore'] = ple.apply(female_score, axis =1)
year_gender_avg = ple.groupby('year')[['mscore','fscore']].mean()
year_gender_avg
ind = range(len(year_gender_avg.index))
f, ax = plt.subplots(figsize=(8,6))
plt.plot(ind, year_gender_avg['mscore'].values, marker='o')
plt.plot(ind, year_gender_avg['fscore'].values, marker='o')
plt.title('Male vs Female performance over the years')
plt.xlabel('Year')
plt.ylabel('Score')
plt.xticks(ind, year_gender_avg.index);
From the plot above we can see that(as well as the table before) we can see that boys on average perform better than girls, country-wide year-after-year. The reasons for this could be various and this mainly because of the pressures on girls in rural areas. We can make the same plot for girls in Kampala and some of the other urban areas to investigate this.
It is also important to note that the shapes of the plots for boys and girls have very similar shapes and the distance between them is almost the same throughout. This suggests that within each gender the score varies by the same amount each year. Probably reacting to the difficualty of the exams or strictness of the examiners. Whichever factor it is, it affects the sexes equally.
kla_gender_avg = ple[ple['district']=='KAMPALA'].groupby('year')[['mscore','fscore']].mean()
kla_gender_avg
f, ax = plt.subplots(figsize=(8,6))
plt.plot(ind, kla_gender_avg['mscore'].values, marker='o')
plt.plot(ind, kla_gender_avg['fscore'].values, marker ='o')
plt.title('Male vs Female performance over the years in Kampala')
plt.xlabel('Year')
plt.ylabel('Score')
plt.xticks(ind, kla_gender_avg.index);
The graphs don't differ greatly from the ones we had before. They also share a similar shape with each other. We can not that as would be expected the scores in Kampala are better than the average country scores. The difference beween boys and girls is a bit smaller in Kampala than the results fromthe analysis of the national results. This may suggest that the issues that cause girls to perform worse than boys around the country are not uniwue to those areas, at least some of them.
Of course this is still open to further analysis possibly combined with other datasets. We have also only used the PLE dataset of the three that we had. We can dive deeper and search for trends and insights using the UCE and UACE results as well as check to see if our findings here show up in those results as well.