일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 |
- https://stackoverflow.com/questions/38976217/what-is-the-meaning-of-include-lowest-in-reclassify-raster-package-r
- https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
- https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/nrow
- Latex is suck
- Today
- Total
LearnMore
Statistical inference with the GSS data 본문
Setup
Load packages
library(ggplot2)
library(dplyr)
library(statsr)
library(magrittr)
library(doBy)
Load data
load("gss.Rdata")
Part 1: Data
I think the data set is generalizable. Because the data set was sampled from GSS Data though it was pre-processed. In the ‘GSS.html’ documentation,"Unlike the full General Social Survey Cumulative File, we have removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R.", thus, the form of the data set was a little changed, but the contents of the data set was not changed. In the data set, there is no causality. This data set was collecting data by conducting survay, so there is no assignment.
Part 2: Research question
Question
In the samples, if the people are satisfied their job, would they continue their job though they become rich? one of reasons people are working is to prepare their rest of life when they are too old to work. I was wondering if people got enough money for their rest of life, would they stop working although they are satisfied their job?
Part 3: Exploratory data analysis
satjob<-gss%>%select(satjob)
satjob%<>%filter(satjob!="NA")
richwork<-gss%>%select(richwork)%>%na.omit
satjobRichwork<-gss%>%select(satjob,richwork)%>%na.omit
summary.satjobRichwork<-satjobRichwork%>%group_by(satjob,richwork)%>%summarize(count=n())
summary.satjobRichwork%<>%group_by(satjob)%>%mutate(totalCount=sum(count))
summary.satjobRichwork%<>%mutate(prop=round((count/totalCount)*100,2))
summary.satjobRichwork%>%select(-prop)
## # A tibble: 8 x 4
## # Groups: satjob [4]
## satjob richwork count totalCount
## <fct> <fct> <int> <int>
## 1 Very Satisfied Continue Working 7708 10245
## 2 Very Satisfied Stop Working 2537 10245
## 3 Mod. Satisfied Continue Working 5647 8538
## 4 Mod. Satisfied Stop Working 2891 8538
## 5 A Little Dissat Continue Working 1394 2163
## 6 A Little Dissat Stop Working 769 2163
## 7 Very Dissatisfied Continue Working 544 878
## 8 Very Dissatisfied Stop Working 334 878
summary.satjobRichwork%>%select(-c(count,totalCount))
## # A tibble: 8 x 3
## # Groups: satjob [4]
## satjob richwork prop
## <fct> <fct> <dbl>
## 1 Very Satisfied Continue Working 75.2
## 2 Very Satisfied Stop Working 24.8
## 3 Mod. Satisfied Continue Working 66.1
## 4 Mod. Satisfied Stop Working 33.9
## 5 A Little Dissat Continue Working 64.4
## 6 A Little Dissat Stop Working 35.6
## 7 Very Dissatisfied Continue Working 62.0
## 8 Very Dissatisfied Stop Working 38.0
ggplot(satjobRichwork,aes(x=satjob,fill=richwork))+geom_bar(position = "dodge")
Above bar plot shows counts of continue working or not of each levels in ‘satjob’ variable.
ggplot(satjobRichwork,aes(x=satjob,fill=richwork))+geom_bar(position = "fill")
Above bar plot shows proportion of continue working or not of each levels in ‘satjob’ variable.
Interpretation and conclusion
I used two variables, ‘satjob’ and ‘richwork’, which are categorial values. The satjob variable describes how I satisfy my current job and the richwork variable describes if she or he got enough money for rest of my life, would they have stopped their work. I removed ‘NA’ values in two variables. The satjob had 41288 objects and the richwork had 21948 objects. Thus, I could only use 21948 objects of two variables. I made ‘count’ variable that is getting Continue Working count and Stop Working count in the richwork variable and grouped by satjob variable. I should get ratio of the count of Coninue Working via Stop Working count of each level in the satjob variable. Because, The number of total working(i.e Continue Working and Stop Working) objects of each levels in the satjob variable are different. Thus, I made ‘totalCount’ variable to get total working count of the each levels in the satjob variable. Finally, I divided ‘count’ to ‘totalCount’ to get the ratio of Continue Working versus the ratio of Stop Working of the each levels in the ‘satjob’ variable. In conclusion, people who are more satisfied their job have higher ratio of Continue Working than ratio of Stop Working.
Part 4: Inference
Hypotheses
People who are satisfied their job want to continue their working more than people who are not satisfied their job.
H0 : The population proportion of people who are satisfied their job want to continue their working is equal to the population proportion of people who are not satisfied their job want to continue their working
HA : The population proportion of people who are satisfied their job want to continue their working is not equal to the population proportion of people who are not satisfied their job want to continue their working
Varaibles
satjob: categorial variable, 4 levels
satjob.f: categorial variable, 2 levels. I reduced satjob levels. The original level was 4,but I combined “Very Satisfied” and “Mod. Satisfied” values as “Satisfied” and “A Little Dissat” and “Very Dissatisfied” as “Dissatified”.
gss$satjob.f<-factor(recodeVar(gss$satjob,src = list(c("Very Satisfied","Mod. Satisfied"),c("A Little Dissat","Very Dissatisfied")),tgt = list("Satisfied","Dissatisfied")))
richwork: categorial variable, 2 levels
Check condition
1.Each group is a simple random sample from less than 10% of the population, the observation are independent, both within the samples and between the samples.
2.Success-failure condition are met Thus, the normal model can be used for the point estimate of the difference. Method
I used theoretical method. Because, the variables that I used were categorial values and the my sample was met the conditions. For testing the hypothesis,I used the pooled proportion for p1-p2 when H0 is true.
modiSatjob<-gss%>%select(satjob.f,richwork)%>%na.omit
modiSatjob%<>%group_by(satjob.f,richwork)%>%summarize(count=n())
modiSatjob%<>%group_by(satjob.f)%>%mutate(totalCount=sum(count))
modiSatjob%<>%mutate(prop=round((count/totalCount)*100,2))
modiSatjob
## # A tibble: 4 x 5
## # Groups: satjob.f [2]
## satjob.f richwork count totalCount prop
## <fct> <fct> <int> <int> <dbl>
## 1 Dissatisfied Continue Working 1938 3041 63.7
## 2 Dissatisfied Stop Working 1103 3041 36.3
## 3 Satisfied Continue Working 13355 18783 71.1
## 4 Satisfied Stop Working 5428 18783 28.9
pHat<-round((modiSatjob$count[1]+modiSatjob$count[3])/(modiSatjob$totalCount[1]+modiSatjob$totalCount[3]),2)
# # of people who choose continue working/# of people in the entire study
Sat.pHat<-round(modiSatjob$count[3]/modiSatjob$totalCount[3],4)
#The point estimate of Satisfied people
Dis.pHat<-round(modiSatjob$count[1]/modiSatjob$totalCount[1],4)
#The point estimate of Dissatisfied people
pointEstimate<-Sat.pHat-Dis.pHat
#The point estimate of the difference of two point estimate
standardError<-round(sqrt((pHat*(1-pHat)/modiSatjob$totalCount[1])+(pHat*(1-pHat)/modiSatjob$totalCount[3])),5)
# Standard error is calculated using pooled proportion
z<-round((pointEstimate-0)/standardError,4)
# Getting test score
result<-c("PHat"=pHat,"Sat.pHat"=Sat.pHat,"Dis.pHat"=Dis.pHat,"pointEstimate"=pointEstimate,"standardError"=standardError,"z"=z)
result
## PHat Sat.pHat Dis.pHat pointEstimate standardError
## 0.70000 0.71100 0.63730 0.07370 0.00896
## z
## 8.22540
Inference conclusion
The test score(Z-score) was too big not to get p-value with ‘qnorm’ function or appendix sheet of z-score. The p-value was less than significance level(0.05) thus, I rejected H0 in favor of HA.
'Coursera > Statistics' 카테고리의 다른 글
Multi variative linear regression (0) | 2018.10.16 |
---|---|
Exploring the BRFSS data (0) | 2018.10.15 |