What is Simpon's paradox (Python version)?

Simpons paradox

Simpson's paradox is a phenomenon in probability and statistics, in which a trend appears in several 
different groups of data but disappears or reverses when these groups are combined.

Example: UC Berkeley gender bias

One of the best-known examples of Simpson's paradox is a study of gender bias among graduate 
school admissions to University of California, Berkeley. The admission figures for the fall of 1973 
showed that men applying were more likely than women to be admitted, and the difference was so large
 that it was unlikely to be due to chance.

import pandas as pd
file = 'https://raw.githubusercontent.com/xie186/Coursera_StatisticsWithPython/
master/data/UCBGradAdmData1973.csv'

df = pd.DataFrame(pd.read_csv(file))
df

AdmitGenderDeptFreq
0AdmittedMaleA512
1RejectedMaleA313
2AdmittedFemaleA89
3RejectedFemaleA19
4AdmittedMaleB353
5RejectedMaleB207
6AdmittedFemaleB17
7RejectedFemaleB8
8AdmittedMaleC120
9RejectedMaleC205
10AdmittedFemaleC202
11RejectedFemaleC391
12AdmittedMaleD138
13RejectedMaleD279
14AdmittedFemaleD131
15RejectedFemaleD244
16AdmittedMaleE53
17RejectedMaleE138
18AdmittedFemaleE94
19RejectedFemaleE299
20AdmittedMaleF22
21RejectedMaleF351
22AdmittedFemaleF24
23RejectedFemaleF317


Gender bias

The admission figures for the fall of 1973 showed that men applying were more likely than women to be 
admitted, and the difference was so large that it was unlikely to be due to chance.

import numpy as np
df_sum = pd.pivot_table(df, values="Freq", index=["Gender"], 
                          columns=["Admit"], aggfunc=np.sum)

df_sum["%"] = 100*df_sum["Admitted"] / (df_sum["Admitted"] + df_sum["Rejected"])
df_sum

AdmitAdmittedRejected%
Gender
Female557127830.354223
Male1198149344.518766

df_sum["%"].plot(kind="bar", y="%")
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f36f4600e48>

Department level summary

When examining the individual departments, it appeared that six out of 85 departments were 
significantly biased against men, whereas only four were significantly biased against women.
 In fact, the pooled and corrected data showed a "small but statistically significant bias in
 favor of women".[15] The data from the six largest departments are listed below, the top two
 departments by number of applicants for each gender italicised.

df_pivot = df.pivot_table(values="Freq", index=["Dept"], 
                          columns=["Gender", "Admit"])

df_pivot["%(Female)"] = 100*df_pivot["Female"]["Admitted"]/(df_pivot["Female"]["Rejected"] + df_pivot["Female"]["Admitted"])
df_pivot["%(Male)"] = 100*df_pivot["Male"]["Admitted"]/(df_pivot["Male"]["Rejected"] + df_pivot["Male"]["Admitted"])
df_pivot

GenderFemaleMale%(Female)%(Male)
AdmitAdmittedRejectedAdmittedRejected
Dept
A891951231382.40740762.060606
B17835320768.00000063.035714
C20239112020534.06408136.923077
D13124413827934.93333333.093525
E942995313823.91857527.748691
F24317223517.0381235.898123
df_perc = df_pivot[["%(Female)", "%(Male)"]]
print(df_perc.columns)
df_perc.columns = ['%(Female)', '%(Male)']
print(df_perc.columns)
df_perc
df_perc.plot(kind="bar")

The research paper by Bickel et al. concluded that women tended to apply to competitive 
departments with low rates of admission even among qualified applicants (such as in the
 English Department), whereas men tended to apply to less-competitive departments with 
high rates of admission among the qualified applicants (such as in engineering and chemistry).






















Comments

Popular posts from this blog

gspread error:gspread.exceptions.SpreadsheetNotFound

Miniconda installation problem: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

P and q values in RNA Seq