일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 |
- https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
- https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/nrow
- Latex is suck
- https://stackoverflow.com/questions/38976217/what-is-the-meaning-of-include-lowest-in-reclassify-raster-package-r
- Today
- Total
LearnMore
Exploring the BRFSS data 본문
Setup
Load packages
library(ggplot2)
library(dplyr)
Load data
load("brfss2013.RData")
Part 1: Data
BRFSS data are collected by conducting both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. So I think these sample data can be generalizable. However, The interviews conducted people who under the surveillance system. Thus it couldn’t be infered.
Part 2: Research questions
Research quesion 1: I heard that it is hard to get an own residence in New York because of the house price. So I wondered if people who earn high salary, they have their own residence than other people. I selected four variables. To limit the state as New York, I used ‘X_state’ variable, grouped by ‘income2’, filtered people who have own residence by ‘renthom1’variable and to calculate a ’own house’ proportion I created new variable.
Research quesion 2: There are Various racies in New York. I wondered a proportion of each racies and which race is living the most in New York. I used three variables. To limit the state as New York, I used ‘X_state’ variable and grouped by ‘X_race’ variable. To get a proportion of each race, I created a variable that calculated total race count.
Research quesion 3: I wanted to know about a propoltion of sleep time and a avarage sleep time of New York peoeple. So I picked two variables and created one new variable. To limit the state as New York, I used ‘X_state’ variable and grouped by ‘sleptim1’ that contained a information about sleep time. I created new varibel that is total count of ‘sleptim1’ for New York people.
* * *
Part 3: Exploratory data analysis
Research quesion 1:
NY_income_stateres_not_own<-brfss2013%>%group_by(income2)%>%filter(X_state=="New York",!is.na(income2),renthom1!="Own")%>%summarize(count=n())
NY_income_stateres_own<-brfss2013%>%group_by(income2)%>%filter(X_state=="New York",!is.na(income2),renthom1=="Own")%>%summarize(count=n())
total_count<-NY_income_stateres_not_own$count+NY_income_stateres_own$count
NY_income_stateres_own<-cbind(NY_income_stateres_own,total_count)
NY_income_stateres_own<-NY_income_stateres_own%>%mutate(per=round(count/total_count,2))
NY_income_stateres_own
## income2 count total_count per
## 1 Less than $10,000 74 507 0.15
## 2 Less than $15,000 113 505 0.22
## 3 Less than $20,000 209 624 0.33
## 4 Less than $25,000 311 710 0.44
## 5 Less than $35,000 358 784 0.46
## 6 Less than $50,000 566 992 0.57
## 7 Less than $75,000 662 1008 0.66
## 8 $75,000 or more 1858 2481 0.75
ggplot(NY_income_stateres_own,aes(x=per,y=factor(income2)))+geom_point(colour='red')+geom_text(aes(label=per),hjust=0,vjust=0)+ylab("Income level")+xlab("Percentage")+ggtitle("NY Income point")
Author’s summary statistics and narrative:
Used variables: income2,X_state,renthom1,NY_income_stateres_not_own(Author defined),NY_income_stateres_own(Author defined),total_count(Author defined)
I got percentage, having own residence of total residence in New York, of each income level. I calculated three factors. All of factors, there were calculated except ‘NA’ value. First, I got residence count of each income levels that have own residence in New York. Second I did same process like first one, but I changed condition that not own residence in New York. Third, I added first and second calculated count to get total residence count. I divided the own residence count by the total residence count of the each income levels and plotted it by point plot. In the plot chart, every points are denoted between 0 and 1. each decimal point value means a proportion that how many people have their own house. The income level,$75,000 or more,is the highest level. The opposite is the income level Less than $ 10,000. As a result, the proportion is gradually increased with income levels.
Research quesion 2:
race_NY<-brfss2013%>%filter(X_state=="New York",!is.na(X_race))%>%group_by(X_race)%>%summarize(count=n())
people_total<-sum(race_NY$count)
race_NY<-race_NY%>%mutate(per=round((count/people_total*100),2))
race_NY
## # A tibble: 8 x 3
## X_race count per
## <fct> <int> <dbl>
## 1 White only, non-Hispanic 5913 67.9
## 2 Black only, non-Hispanic 1014 11.6
## 3 American Indian or Alaskan Native only, Non-Hispanic 57 0.65
## 4 Asian only, non-Hispanic 359 4.12
## 5 Native Hawaiian or other Pacific Islander only, Non-Hispanic 31 0.36
## 6 Other race only, non-Hispanic 17 0.2
## 7 Multiracial, non-Hispanic 86 0.99
## 8 Hispanic 1227 14.1
ggplot(race_NY,aes(x=X_race,y=per,fill=X_race))+geom_bar(stat="identity")+theme(axis.text.x=element_blank())+ylab("Percentage")+xlab("Race")+ggtitle("NY Race bar")+ labs(fill='Race')
Author’s summary statistics and narrative:
Used variables: X_race,X_state,race_NY(Author defined)
I wanted to know the race proportion in New York. I excluded ‘NA’ sample and counted each race number in New york. Next, I added all race count to get the proportion of the race in New York. I divided each race count by total race count and multiplied 100 to represent persentage. I used a bar plot chart. The X variable represented race and the y variable represented the ratio of each race in New York. Total ratio was 100. The highest proportion variable was ‘White only,non-Hispanic’.
Research quesion 3:
sleep<-brfss2013%>%filter(X_state=="New York",!is.na(sleptim1))%>%group_by(sleptim1)%>%summarize(count=n())
sleep_total<-sum(sleep$count)
sleep<-sleep%>%mutate(per=round(count/sleep_total*100,2))
a<-sum(sleep$sleptim1*sleep$count)
avg_sleep_time<-c("Average sleep time",round(a/sleep_total,2))
sleep
## # A tibble: 20 x 3
## sleptim1 count per
## <int> <int> <dbl>
## 1 1 4 0.05
## 2 2 22 0.25
## 3 3 94 1.06
## 4 4 305 3.45
## 5 5 737 8.34
## 6 6 2204 25.0
## 7 7 2667 30.2
## 8 8 2199 24.9
## 9 9 320 3.62
## 10 10 179 2.03
## 11 11 14 0.16
## 12 12 57 0.65
## 13 13 1 0.01
## 14 14 7 0.08
## 15 15 3 0.03
## 16 16 7 0.08
## 17 17 1 0.01
## 18 18 5 0.06
## 19 20 5 0.06
## 20 22 1 0.01
avg_sleep_time
## [1] "Average sleep time" "6.88"
ggplot(sleep,aes(x=sleptim1,y=per,fill=sleptim1))+geom_bar(stat="identity")+ylab("Percentage")+xlab("Sleep time")+ggtitle("NY Sleep time histogram")+ labs(fill='Sleep time')
Author’s summary statistics and narrative:
Used variables:X_state,sleptim1,sleep_total(Author defined),avg_sleep_time(Author defined)
The purpose of third question is the proportion of sleep time and avarage sleep time. I counted sleep time. The interval was an one hour and calculated total number of people. I divided sleep count of each section by total count and multiplied 100 to represent the percentage type. I used bar plot. The X-axis represented the sleep time, the binwidth is 1, and the Y- axis represented the percentage values. The most sleeping time for New York people is 7 hour. To get avarage sleeping time, I multiplied each hour and the number of people who belonging to the interval and added all the values and divided total people count. So the avarage time was 6.88 hour. Show in New WindowClear OutputExpand/Collapse Output
'Coursera > Statistics' 카테고리의 다른 글
Multi variative linear regression (0) | 2018.10.16 |
---|---|
Statistical inference with the GSS data (0) | 2018.10.15 |