0.1 Youth Risk Behavior Surveillance

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

0.1.1 Load the data

data(yrbss)
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…

0.1.2 Exploratory Data Analysis

You will first start with analyzing the weight of participants in kilograms. Using visualization and summary statistics, describe the distribution of weights. How many observations are we missing weights from?

  • The distribution of ‘weight’ is skewed to the right, with a mean of 67.91 and a standard deviation of 16.898. In this dataset, We are missing weights from 1004 observations.
skimr::skim(yrbss) #skimmed data
Data summary
Name yrbss
Number of rows 13583
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 12 1.00 4 6 0 2 0
grade 79 0.99 1 5 0 5 0
hispanic 231 0.98 3 8 0 2 0
race 2805 0.79 5 41 0 5 0
helmet_12m 311 0.98 5 12 0 6 0
text_while_driving_30d 918 0.93 1 13 0 8 0
hours_tv_per_school_day 338 0.98 1 12 0 7 0
school_night_hours_sleep 1248 0.91 1 3 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 77 0.99 16.16 1.26 12.00 15.0 16.00 17.00 18.00 ▁▂▅▅▇
height 1004 0.93 1.69 0.10 1.27 1.6 1.68 1.78 2.11 ▁▅▇▃▁
weight 1004 0.93 67.91 16.90 29.94 56.2 64.41 76.20 180.99 ▆▇▂▁▁
physically_active_7d 273 0.98 3.90 2.56 0.00 2.0 4.00 7.00 7.00 ▆▂▅▃▇
strength_training_7d 1176 0.91 2.95 2.58 0.00 0.0 3.00 5.00 7.00 ▇▂▅▂▅
# plot distribution of weights
yrbss %>% 
  group_by(weight) %>%
  ggplot(aes(weight)) +
    geom_histogram() +
    theme_bw() +
  labs(
    title = "Distribution of Weights",
    x = "Weight",
    y = NULL
  )

Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Let’s create a new variable in the dataframe yrbss, called physical_3plus, which will be yes if they are physically active for at least 3 days a week, and no otherwise. You may also want to calculate the number and % of those who are and are not active for more than 3 days. Use the count() function and see if you get the same results as group_by()… summarise()

The number and % are shown above. We get the same results with 'count()' and 'group_by()…summarise()'.

tidy_yrbss <- yrbss %>%
  drop_na(physically_active_7d) %>% # delete missing values in 'physically_active_7d'
  mutate(physical_3plus = ifelse(physically_active_7d >= 3, "yes", "no")) 

# using 'count()'
tidy_yrbss %>%  
  count(physical_3plus) %>%
  mutate(percent = n/sum(n))
## # A tibble: 2 × 3
##   physical_3plus     n percent
##   <chr>          <int>   <dbl>
## 1 no              4404   0.331
## 2 yes             8906   0.669
# using 'group_by()...summarise()'
tidy_yrbss %>%
  group_by(physical_3plus) %>%
  summarise(n = n()) %>%
  mutate(percent = n/sum(n))
## # A tibble: 2 × 3
##   physical_3plus     n percent
##   <chr>          <int>   <dbl>
## 1 no              4404   0.331
## 2 yes             8906   0.669

Can you provide a 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week?

#calculate CI using formula at 95% confidence level
CI_yrbss <- tidy_yrbss %>% 
  count(physical_3plus) %>%
  mutate(sum = sum(n), 
         p = n/sum) %>%
  filter(physical_3plus == "no") %>%
  summarise(se = sqrt(p*(1-p)/sum),
            lower_CI = p - 1.96 * se,
            upper_CI = p + 1.96 * se) 
CI_yrbss
## # A tibble: 1 × 3
##        se lower_CI upper_CI
##     <dbl>    <dbl>    <dbl>
## 1 0.00408    0.323    0.339

Make a boxplot of physical_3plus vs. weight. Is there a relationship between these two variables? What did you expect and why?

  • The box plots overlap and hence, it is difficult to establish a relationship between the two. We have often been told that exercise is good for a healthy lifestyle and weight is said to be a big indicator of the same. So, we expected that the ones who exercise more would have an affect on the weight, but this graph shows that there is no significant difference and hence, it was surprising.
ggplot(
  tidy_yrbss, 
  mapping=aes(weight, physical_3plus)) +
  geom_boxplot() +
  theme_bw() +
  labs(
    title = "Relationship between Weight and Activeness for atleast 3 Days",
    x = "Weight",
    y = "Physical Activity for atleast 3 Days"
  )

0.1.3 Confidence Interval

Boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using either a confidence interval or a hypothesis test. Note that when we calculate the mean, SD, etc. weight in these groups using the mean function, we must ignore any missing values by setting the na.rm = TRUE.

#CI using formula at 95% confidence level
tidy_yrbss %>% 
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm=TRUE),
            SD_weight = sd(weight, na.rm=TRUE),
            n = n(),
            SE_weight = SD_weight/sqrt(n),
            lower_CI = mean_weight - SE_weight,
            upper_CI = mean_weight + SE_weight)
## # A tibble: 2 × 7
##   physical_3plus mean_weight SD_weight     n SE_weight lower_CI upper_CI
##   <chr>                <dbl>     <dbl> <int>     <dbl>    <dbl>    <dbl>
## 1 no                    66.7      17.6  4404     0.266     66.4     66.9
## 2 yes                   68.4      16.5  8906     0.175     68.3     68.6
  • There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.

0.1.4 Hypothesis test with formula

Write the null and alternative hypotheses for testing whether mean weights are different for those who exercise at least times a week and those who don’t.

t.test(weight ~ physical_3plus, data = tidy_yrbss) # changing name of the data from 'yrbss' to 'tidy_yrbss'
## 
##  Welch Two Sample t-test
## 
## data:  weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -2.42 -1.12
## sample estimates:
##  mean in group no mean in group yes 
##              66.7              68.4

0.1.5 Hypothesis test with infer

Next, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.

But first, we need to initialize the test, which we will save as obs_diff.

obs_diff <- tidy_yrbss %>% 
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

After you have initialized the test, you need to simulate the test on the null distribution, which we will save as null.

null_dist <- tidy_yrbss %>%
  # specify variables
  specify(weight ~ physical_3plus) %>% 
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = c("yes", "no"))

null_dist
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.106 
##  2         2  0.0620
##  3         3  0.481 
##  4         4  0.138 
##  5         5 -0.435 
##  6         6 -0.224 
##  7         7 -0.0733
##  8         8 -0.275 
##  9         9 -0.407 
## 10        10  0.412 
## # … with 990 more rows

Here, hypothesis is used to set the null hypothesis as a test for independence, i.e., that there is no difference between the two population means. In one sample cases, the null argument can be set to point to test a hypothesis relative to a point estimate.

We can visualize this null distribution with the following code:

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram(binwidth = 0.2) +
  theme_bw() +
  labs(
    title = "Null Distribution",
    x = "Stat",
    y = NULL
  )

Now that the test is initialized and the null distribution formed, we can visualise to see how many of these null permutations have a difference of at least obs_stat of r obs_diff %>% pull() %>%round(2)?

set.seed(1234)
null_dist %>%
  visualise() +
  theme_bw()

We can also calculate the p-value for your hypothesis test using the function infer::get_p_value().

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided") +
    theme_bw()

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

This the standard workflow for performing hypothesis tests.

0.2 IMDB ratings: Differences between directors

Recall the IMBD ratings data. I would like you to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not.

First, I would like you to reproduce this graph. You may find geom_errorbar() and geom_rect() useful.

movies <- read_csv(here::here("data", "movies.csv"))
movies %>%
  filter(director%in% c("Steven Spielberg","Tim Burton")) %>%
  group_by(director) %>%
  summarise(net=mean(rating), 
            std=sd(rating),
            count=n(),
            t_critical=qt(0.975, count-1),
            se=std/sqrt(count),
            margin_of_error=t_critical*se,
            rating_low=net-margin_of_error,
            rating_high=net+margin_of_error) %>%
  mutate(director=fct_reorder(director,net)) %>%
  
  #plot the CI
  ggplot(aes(net,director,color=director)) +
  geom_point(aes(colour = director, size = 10)) +
  geom_text(aes(label = round(net,2),vjust=-1))+
  #
  geom_rect(aes( 
    xmin = max(rating_low), 
    xmax = min(rating_high),
    ymin = 0, 
    ymax = 3),
    alpha = 0.2,
    color = "grey") +
  geom_errorbar(aes(
    xmin = rating_low, 
    xmax = rating_high), 
    lwd = 2.0, 
    width = 0.2) +
  geom_text(aes(label = round(rating_high,2),
                hjust = -3.75,
                vjust = -1.25)) +
  geom_text(aes(label = round(rating_low,2),
                hjust = 4.5,
                vjust=-1.25)) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(
    title = "Do Spielberg and Burton have the same IMDB rating?",
    subtitle = "95% confidence interval overlap",
    x = "Mean IMDB Rating",
    y = NULL
  )

In addition, you will run a hypothesis test. You should use both the t.test command and the infer package to simulate from a null distribution, where you assume zero difference between the two.

Before anything, write down the null and alternative hypotheses, as well as the resulting test statistic and the associated t-stat or p-value. At the end of the day, what do you conclude?

final <- movies %>%
  filter(director%in% c("Steven Spielberg","Tim Burton"))
t.test(rating ~director, data=final)
## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93
set.seed(1234)

ratings <- final %>%
  specify(rating ~ director) %>%
  hypothesise(null="independence") %>%
  generate(reps=1000, type="permute") %>%
  calculate(
    stat="diff in means", 
    order=c("Steven Spielberg","Tim Burton"))

observed_difference <- final %>%
  specify(rating ~ director) %>%
  
  #hypothesise(null="independence")
  calculate(stat="diff in means", order=c("Steven Spielberg","Tim Burton"))

ratings %>%
  visualise() +
  theme_bw()

ratings %>%
  get_pvalue(obs_stat=observed_difference, direction="both")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.008
  • Looking at the p-value, it’s less than 5%, hence, we can reject the null hypothesis.

0.3 Omega Group plc- Pay Discrimination

At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.

You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.

0.3.1 Loading the data

omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…

0.3.2 Relationship Salary - Gender ?

The data frame omega contains the salaries for the sample of 50 executives in the company.

Calculate summary statistics on salary by gender. Also, create and print a dataframe where, for each gender, you show the mean, SD, sample size, the t-critical, the SE, the margin of error, and the low/high endpoints of a 95% condifence interval

# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
##   gender   min    Q1 median    Q3   max  mean   sd  n missing
## 1 female 47033 60338  64618 70033 78800 64543 7567 26       0
## 2   male 54768 68331  74675 78568 84576 73239 7463 24       0
summary_df <- omega %>%
  group_by(gender) %>% 
  # Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size
  summarise(mean_salary = mean(salary), 
            SD_salary = sd(salary), 
            sample_size = n(), 
            # the t-critical value, the standard error, the margin of error,
            t_critical = qt(0.975,sample_size-1), 
            SE = SD_salary/sqrt(sample_size),
            # and the low/high endpoints of a 95% condifence interval 
            low_CI = mean_salary - SE*t_critical,
            UP_CI = mean_salary + SE*t_critical) 

summary_df 
## # A tibble: 2 × 8
##   gender mean_salary SD_salary sample_size t_critical    SE low_CI  UP_CI
##   <chr>        <dbl>     <dbl>       <int>      <dbl> <dbl>  <dbl>  <dbl>
## 1 female      64543.     7567.          26       2.06 1484. 61486. 67599.
## 2 male        73239.     7463.          24       2.07 1523. 70088. 76390.

What can you conclude from your analysis?

  • There is no overlap between the CIs of male and female salary. Thus, the true mean salary of males is higher than that of females.

You can also run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money. You should tun your hypothesis testing using t.test() and with the simulation method from the infer package.

# hypothesis testing using t.test()

t.test(salary~gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239
# hypothesis testing using infer package
set.seed(1235)

observed_difference <- omega %>% 
  specify(salary ~ gender) %>% 
  calculate(stat = "diff in means",
            order = c("male","female")) 

salary_in_null_world <- omega %>% 
  specify(salary ~ gender) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("male","female"))

# calculating p-value for two-sided hypothesis testing
salary_in_null_world %>% visualise() + 
  shade_p_value(obs_stat = observed_difference,direction = "two-sided")

salary_in_null_world %>% 
  get_p_value(obs_stat = observed_difference,direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

What can you conclude from your analysis?

  • Here we can see that the p-value is less than 5%, hence there is a significant difference between male and female salary. But we don’t know if its a causal relationship or influenced by confounding factors.

0.3.3 Relationship Experience - Gender?

At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).

# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0

Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Perform similar analyses as in the previous section. Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?

  • As mentioned earlier, we don’t know if there is a causal relationship or influence of confounding factors. But, through this analysis we see that there is a difference in the sample mean experience of males and females.

0.3.4 Relationship Salary - Experience ?

Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.

Analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data

ggplot(omega,
       aes(x=salary,y=experience)) + 
  geom_point(aes(color = gender)) + 
  labs(title ="Relationship between salary and experience",
       subtitle ="Scatterplot colored by gender",
       x = "Salary",
       y = "Experience") + 
  theme_bw() 

0.3.5 Check correlations between the data

You can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

Look at the salary vs experience scatterplot. What can you infer from this plot?

  • There is a positive correlation between salary and experience. Coefficient of correlation is greater for females and males.

0.4 Challenge 1: Brexit plot

Using your data manipulation and visualisation skills, please use the Brexit results dataframe (the same dataset you used in the pre-programme assignement) and produce the following plot.

brexit <- read_csv("https://raw.githubusercontent.com/kostis-christodoulou/am01/master/data/brexit_results.csv")

# renaming the columns
brexit <- brexit %>%
  rename(Conservative = con_2015,
         Labour = lab_2015,
         LibDems = ld_2015,
         UKIP = ukip_2015)

# pivoting the table longer
brexit_longer <- brexit %>%
  pivot_longer(
    cols = Conservative:UKIP,
    names_to = "party",
    values_to = "count"
  )

color_party = c("#0087DC", "#E32636", "#FAA61A", "#6D3177") #assigning party colors to a vector

ggplot(brexit_longer,
  aes(
    x = count,
    y = leave_share
  )) +
  geom_point(aes(color = party), alpha = 0.4)  +  
  scale_color_manual(values = color_party) +
  #adding trend lines
  geom_smooth(data = brexit,
      aes(x = Conservative, y = leave_share), 
      color = color_party[1],
      method = "lm",
      se = TRUE) +
  geom_smooth(data = brexit,
      aes(x = Labour, y = leave_share),
      color = color_party[2],
      method = "lm",
      se = TRUE) +
  geom_smooth(data = brexit,
      aes(x = LibDems, y = leave_share),
      color = color_party[3],
      method = "lm",
      se = TRUE) +
  geom_smooth(data = brexit,
      aes(x = UKIP, y = leave_share),
      color = color_party[4],
      method = "lm",
      se = TRUE) +
  
  theme_bw() +
  theme(legend.position = "bottom") +
  theme(legend.title = element_blank()) +
  
  
  labs(
    title = "How political affiliation translated to Brexit Voting",
    x = "Party % in the UK 2015 General Elections",
    y = "Leave % in 2016 Brexit Referedum"
  ) 

0.5 Challenge 2 : GDP components over time and among countries

At the risk of oversimplifying things, the main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports). You can read more about GDP and the different approaches in calculating at the Wikipedia GDP page.

The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries from 1970 to today. We will look at how GDP and its components have changed over time, and compare different countries and how much each component contributes to that country’s GDP.

UN_GDP_data  <-  read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
                sheet="Download-GDPconstant-USD-countr", # Sheet name
                skip=2) # Number of rows to skip

The first thing you need to do is to tidy the data, as it is in wide format and you must make it into long, tidy format. Please express all figures in billions (divide values by 1e9, or \(10^9\)), and you want to rename the indicators into something shorter.

tidy_GDP_data  <-  UN_GDP_data %>% 
  pivot_longer(4:51,names_to = "year",values_to ="value") %>% 
  mutate(value = value/1e9) %>% 
  # Renaming the columns
  mutate(IndicatorName = dplyr::recode(
    IndicatorName,
    'Exports of goods and services' = 'Exports',
    'General government final consumption expenditure' = 'Government_Expenditure',
    'Imports of goods and services' = 'Imports', 
    'Household consumption expenditure (including Non-profit institutions serving households)' = 'Household_expenditure',
    'Gross capital formation' = 'Gross_capital_formation',
    'Gross Domestic Product (GDP)' = 'real_GDP'))
  
glimpse(tidy_GDP_data)
## Rows: 176,880
## Columns: 5
## $ CountryID     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ Country       <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
## $ IndicatorName <chr> "Final consumption expenditure", "Final consumption expe…
## $ year          <chr> "1970", "1971", "1972", "1973", "1974", "1975", "1976", …
## $ value         <dbl> 5.56, 5.33, 5.20, 5.75, 6.15, 6.32, 6.37, 6.90, 7.09, 6.…
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")

First, can you produce this plot?

Indicator_list = c('Gross_capital_formation','Exports','Government_Expenditure','Household_expenditure','Imports','real_GDP')
years = c(seq(1970,2020,by = 10))

tidy_GDP_data %>% 
  filter(Country %in% country_list,
         IndicatorName %in% Indicator_list) %>% 
  # generate plot
  ggplot(aes(x = year,y = value,color = IndicatorName,group = IndicatorName)) +
  geom_line(size = 1.2) +
  facet_wrap(~Country) + 
  theme_bw() + 
  # adjusting x axis
  scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
  labs(
    title = "GDP components over time",
    subtitle = "In constant 2010 USD",
    x = NULL,
    y = "Billion US$", 
    color = "Components of GDP"
  )

Secondly, recall that GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in your dataframe, I would like you to calculate it given its components discussed above.

What is the % difference between what you calculated as GDP and the GDP figure included in the dataframe? - The % difference between the GDP figure we calculated and that in the dataframe is 0.867%.

gdp_proportion_df <- tidy_GDP_data %>% 
  filter(Country %in% country_list,
         IndicatorName %in% Indicator_list) %>%
  pivot_wider(names_from = IndicatorName,values_from = value) %>% 
  mutate(GDP = Household_expenditure + Government_Expenditure + Gross_capital_formation + Exports - Imports, 
         Household_expenditure = Household_expenditure/GDP, 
         Government_Expenditure = Government_Expenditure/GDP, 
         Gross_capital_formation = Gross_capital_formation/GDP, 
         Net_Exports = (Exports-Imports)/GDP,
         GDP_difference = (GDP/real_GDP)-1) %>% 
         select(-Imports,-Exports)

mean(gdp_proportion_df$GDP_difference,na.rm = FALSE)
## [1] 0.00867
gdp_proportion_df %>% 
  select(-GDP,-real_GDP) %>% 
  pivot_longer(4:7,names_to = 'IndicatorName', values_to = 'values') %>% 
  ggplot(aes(year,values,color = IndicatorName,group = IndicatorName)) + 
  geom_line(size = 1.05) + 
  facet_wrap(~Country) +
  theme_bw() + 
  theme(legend.title = element_blank()) + 
  scale_x_discrete(breaks = seq(1970,2020,10)) + 
  labs(title = "GDP and its breakdown at constant 2010 prices in US dollars", 
       y ="proportion",
       x= "") + 
  scale_y_continuous(labels = scales::percent)

What is this last chart telling you? Can you explain in a couple of paragraphs the different dynamic among these three countries?

0.5.1 Deliverables

There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

0.5.1.1 Details

Who did you collaborate with: Group 6: Sonakshi Gupta, Jean Francois Peters, Wybe Harms, Drishti Goyal, Suzy Wang, Zezhou Tang

Approximately how much time did you spend on this problem set: 5 hrs 45 min 10 sec

What, if anything, gave you the most trouble: It was initially difficult to combine all that we had learnt in the lecture to the assignment, but after going through the slides again, the task became easier.

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

0.5.1.2 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.