Exploring the BRFSS data

Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

LearnMore

Exploring the BRFSS data 본문

Coursera/Statistics

Exploring the BRFSS data

zionadd 2018. 10. 15. 22:23

Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

BRFSS data are collected by conducting both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. So I think these sample data can be generalizable. However, The interviews conducted people who under the surveillance system. Thus it couldn’t be infered.

Part 2: Research questions

Research quesion 1: I heard that it is hard to get an own residence in New York because of the house price. So I wondered if people who earn high salary, they have their own residence than other people. I selected four variables. To limit the state as New York, I used ‘X_state’ variable, grouped by ‘income2’, filtered people who have own residence by ‘renthom1’variable and to calculate a ’own house’ proportion I created new variable.

Research quesion 2: There are Various racies in New York. I wondered a proportion of each racies and which race is living the most in New York. I used three variables. To limit the state as New York, I used ‘X_state’ variable and grouped by ‘X_race’ variable. To get a proportion of each race, I created a variable that calculated total race count.

Research quesion 3: I wanted to know about a propoltion of sleep time and a avarage sleep time of New York peoeple. So I picked two variables and created one new variable. To limit the state as New York, I used ‘X_state’ variable and grouped by ‘sleptim1’ that contained a information about sleep time. I created new varibel that is total count of ‘sleptim1’ for New York people.
* * *

Part 3: Exploratory data analysis

Research quesion 1:

NY_income_stateres_not_own<-brfss2013%>%group_by(income2)%>%filter(X_state=="New York",!is.na(income2),renthom1!="Own")%>%summarize(count=n())
NY_income_stateres_own<-brfss2013%>%group_by(income2)%>%filter(X_state=="New York",!is.na(income2),renthom1=="Own")%>%summarize(count=n())
total_count<-NY_income_stateres_not_own$count+NY_income_stateres_own$count
NY_income_stateres_own<-cbind(NY_income_stateres_own,total_count)
NY_income_stateres_own<-NY_income_stateres_own%>%mutate(per=round(count/total_count,2))
NY_income_stateres_own

##             income2 count total_count  per
## 1 Less than $10,000    74         507 0.15
## 2 Less than $15,000   113         505 0.22
## 3 Less than $20,000   209         624 0.33
## 4 Less than $25,000   311         710 0.44
## 5 Less than $35,000   358         784 0.46
## 6 Less than $50,000   566         992 0.57
## 7 Less than $75,000   662        1008 0.66
## 8   $75,000 or more  1858        2481 0.75

ggplot(NY_income_stateres_own,aes(x=per,y=factor(income2)))+geom_point(colour='red')+geom_text(aes(label=per),hjust=0,vjust=0)+ylab("Income level")+xlab("Percentage")+ggtitle("NY Income point")

Author’s summary statistics and narrative:

Used variables: income2,X_state,renthom1,NY_income_stateres_not_own(Author defined),NY_income_stateres_own(Author defined),total_count(Author defined)

I got percentage, having own residence of total residence in New York, of each income level. I calculated three factors. All of factors, there were calculated except ‘NA’ value. First, I got residence count of each income levels that have own residence in New York. Second I did same process like first one, but I changed condition that not own residence in New York. Third, I added first and second calculated count to get total residence count. I divided the own residence count by the total residence count of the each income levels and plotted it by point plot. In the plot chart, every points are denoted between 0 and 1. each decimal point value means a proportion that how many people have their own house. The income level,$75,000 or more,is the highest level. The opposite is the income level Less than $ 10,000. As a result, the proportion is gradually increased with income levels.

Research quesion 2:

race_NY<-brfss2013%>%filter(X_state=="New York",!is.na(X_race))%>%group_by(X_race)%>%summarize(count=n())
people_total<-sum(race_NY$count)
race_NY<-race_NY%>%mutate(per=round((count/people_total*100),2))
race_NY

## # A tibble: 8 x 3
##   X_race                                                       count   per
##   <fct>                                                        <int> <dbl>
## 1 White only, non-Hispanic                                      5913 67.9 
## 2 Black only, non-Hispanic                                      1014 11.6 
## 3 American Indian or Alaskan Native only, Non-Hispanic            57  0.65
## 4 Asian only, non-Hispanic                                       359  4.12
## 5 Native Hawaiian or other Pacific Islander only, Non-Hispanic    31  0.36
## 6 Other race only, non-Hispanic                                   17  0.2 
## 7 Multiracial, non-Hispanic                                       86  0.99
## 8 Hispanic                                                      1227 14.1

ggplot(race_NY,aes(x=X_race,y=per,fill=X_race))+geom_bar(stat="identity")+theme(axis.text.x=element_blank())+ylab("Percentage")+xlab("Race")+ggtitle("NY Race bar")+ labs(fill='Race')

Author’s summary statistics and narrative:

Used variables: X_race,X_state,race_NY(Author defined)

I wanted to know the race proportion in New York. I excluded ‘NA’ sample and counted each race number in New york. Next, I added all race count to get the proportion of the race in New York. I divided each race count by total race count and multiplied 100 to represent persentage. I used a bar plot chart. The X variable represented race and the y variable represented the ratio of each race in New York. Total ratio was 100. The highest proportion variable was ‘White only,non-Hispanic’.

Research quesion 3:

sleep<-brfss2013%>%filter(X_state=="New York",!is.na(sleptim1))%>%group_by(sleptim1)%>%summarize(count=n())
sleep_total<-sum(sleep$count)
sleep<-sleep%>%mutate(per=round(count/sleep_total*100,2))
a<-sum(sleep$sleptim1*sleep$count)
avg_sleep_time<-c("Average sleep time",round(a/sleep_total,2))
sleep

## # A tibble: 20 x 3
##    sleptim1 count   per
##       <int> <int> <dbl>
##  1        1     4  0.05
##  2        2    22  0.25
##  3        3    94  1.06
##  4        4   305  3.45
##  5        5   737  8.34
##  6        6  2204 25.0 
##  7        7  2667 30.2 
##  8        8  2199 24.9 
##  9        9   320  3.62
## 10       10   179  2.03
## 11       11    14  0.16
## 12       12    57  0.65
## 13       13     1  0.01
## 14       14     7  0.08
## 15       15     3  0.03
## 16       16     7  0.08
## 17       17     1  0.01
## 18       18     5  0.06
## 19       20     5  0.06
## 20       22     1  0.01

avg_sleep_time

## [1] "Average sleep time" "6.88"

ggplot(sleep,aes(x=sleptim1,y=per,fill=sleptim1))+geom_bar(stat="identity")+ylab("Percentage")+xlab("Sleep time")+ggtitle("NY Sleep time histogram")+ labs(fill='Sleep time')

Author’s summary statistics and narrative:

Used variables:X_state,sleptim1,sleep_total(Author defined),avg_sleep_time(Author defined)

The purpose of third question is the proportion of sleep time and avarage sleep time. I counted sleep time. The interval was an one hour and calculated total number of people. I divided sleep count of each section by total count and multiplied 100 to represent the percentage type. I used bar plot. The X-axis represented the sleep time, the binwidth is 1, and the Y- axis represented the percentage values. The most sleeping time for New York people is 7 hour. To get avarage sleeping time, I multiplied each hour and the number of people who belonging to the interval and added all the values and divided total people count. So the avarage time was 6.88 hour. Show in New WindowClear OutputExpand/Collapse Output

'Coursera > Statistics' 카테고리의 다른 글

Multi variative linear regression (0)	2018.10.16
Statistical inference with the GSS data (0)	2018.10.15

'Coursera/Statistics' Related Articles

Comments

LearnMore

Exploring the BRFSS data 본문

Exploring the BRFSS data

Setup

Load packages

Load data

Part 1: Data

Part 2: Research questions

Part 3: Exploratory data analysis

'Coursera > Statistics' 카테고리의 다른 글

티스토리툴바