What is Simpon's paradox (Python version)?

April 28, 2019

Simpons paradox

Simpson's paradox is a phenomenon in probability and statistics, in which a trend appears in several

different groups of data but disappears or reverses when these groups are combined.

Example: UC Berkeley gender bias

One of the best-known examples of Simpson's paradox is a study of gender bias among graduate

school admissions to University of California, Berkeley. The admission figures for the fall of 1973

showed that men applying were more likely than women to be admitted, and the difference was so large

that it was unlikely to be due to chance.

import pandas as pd
file = 'https://raw.githubusercontent.com/xie186/Coursera_StatisticsWithPython/

master/data/UCBGradAdmData1973.csv'

df = pd.DataFrame(pd.read_csv(file))
df

	Admit	Gender	Dept	Freq
0	Admitted	Male	A	512
1	Rejected	Male	A	313
2	Admitted	Female	A	89
3	Rejected	Female	A	19
4	Admitted	Male	B	353
5	Rejected	Male	B	207
6	Admitted	Female	B	17
7	Rejected	Female	B	8
8	Admitted	Male	C	120
9	Rejected	Male	C	205
10	Admitted	Female	C	202
11	Rejected	Female	C	391
12	Admitted	Male	D	138
13	Rejected	Male	D	279
14	Admitted	Female	D	131
15	Rejected	Female	D	244
16	Admitted	Male	E	53
17	Rejected	Male	E	138
18	Admitted	Female	E	94
19	Rejected	Female	E	299
20	Admitted	Male	F	22
21	Rejected	Male	F	351
22	Admitted	Female	F	24
23	Rejected	Female	F	317

Gender bias

The admission figures for the fall of 1973 showed that men applying were more likely than women to be

admitted, and the difference was so large that it was unlikely to be due to chance.

import numpy as np
df_sum = pd.pivot_table(df, values="Freq", index=["Gender"], 
                          columns=["Admit"], aggfunc=np.sum)

df_sum["%"] = 100*df_sum["Admitted"] / (df_sum["Admitted"] + df_sum["Rejected"])
df_sum

Admit	Admitted	Rejected	%
Gender
Female	557	1278	30.354223
Male	1198	1493	44.518766

df_sum["%"].plot(kind="bar", y="%")

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f36f4600e48>

Department level summary

When examining the individual departments, it appeared that six out of 85 departments were

significantly biased against men, whereas only four were significantly biased against women.

In fact, the pooled and corrected data showed a "small but statistically significant bias in

favor of women".[15] The data from the six largest departments are listed below, the top two

departments by number of applicants for each gender italicised.

df_pivot = df.pivot_table(values="Freq", index=["Dept"], 
                          columns=["Gender", "Admit"])

df_pivot["%(Female)"] = 100*df_pivot["Female"]["Admitted"]/(df_pivot["Female"]["Rejected"] + df_pivot["Female"]["Admitted"])
df_pivot["%(Male)"] = 100*df_pivot["Male"]["Admitted"]/(df_pivot["Male"]["Rejected"] + df_pivot["Male"]["Admitted"])
df_pivot

Gender	Female		Male		%(Female)	%(Male)
Admit	Admitted	Rejected	Admitted	Rejected
Dept
A	89	19	512	313	82.407407	62.060606
B	17	8	353	207	68.000000	63.035714
C	202	391	120	205	34.064081	36.923077
D	131	244	138	279	34.933333	33.093525
E	94	299	53	138	23.918575	27.748691
F	24	317	22	351	7.038123	5.898123

df_perc = df_pivot[["%(Female)", "%(Male)"]]
print(df_perc.columns)
df_perc.columns = ['%(Female)', '%(Male)']
print(df_perc.columns)
df_perc
df_perc.plot(kind="bar")

The research paper by Bickel et al. concluded that women tended to apply to competitive

departments with low rates of admission even among qualified applicants (such as in the

English Department), whereas men tended to apply to less-competitive departments with

high rates of admission among the qualified applicants (such as in engineering and chemistry).

Search This Blog

Omics Academy

What is Simpon's paradox (Python version)?

Simpons paradox

Example: UC Berkeley gender bias

Gender bias

Department level summary

Comments

Post a Comment

Popular posts from this blog

gspread error:gspread.exceptions.SpreadsheetNotFound

P and q values in RNA Seq

Miniconda installation problem: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.