Homework 1

Rents in San Francisco 2000-2018

Kate Pennington created a panel of historic Craigslist rents by scraping posts archived by the Wayback Machine. You can read more about her work here

What impact does new housing have on rents, displacement, and gentrification in the surrounding neighborhood? Read our interview with economist Kate Pennington about her article, “Does Building New Housing Cause Displacement?:The Supply and Demand Effects of Construction in San Francisco.”

In our case, we have a clean(ish) dataset with about 200K rows that correspond to Craigslist listings for renting properties in the greater SF area. The data dictionary is as follows

variable class description
post_id character Unique ID
date double date
year double year
nhood character neighborhood
city character city
county character county
price double price in USD
beds double n of beds
baths double n of baths
sqft double square feet of rental
room_in_apt double room in apartment
address character address
lat double latitude
lon double longitude
title character title of listing
descr character description
details character additional details

The dataset was used in a recent tidyTuesday project.

# download directly off tidytuesdaygithub repo

rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')

What are the variable types? Do they all correspond to what they really are? Which variables have most missing values?

There are two variable types - character and numeric. Date column should be of the date data type but it is double. descr has the most missing values.

skimr::skim(rent)
(#tab:skim_data)Data summary
Name rent
Number of rows 200796
Number of columns 17
_______________________
Column type frequency:
character 8
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
post_id 0 1.00 9 14 0 200796 0
nhood 0 1.00 4 43 0 167 0
city 0 1.00 5 19 0 104 0
county 1394 0.99 4 13 0 10 0
address 196888 0.02 1 38 0 2869 0
title 2517 0.99 2 298 0 184961 0
descr 197542 0.02 13 16975 0 3025 0
details 192780 0.04 4 595 0 7667 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
date 0 1.00 2.01e+07 44694.07 2.00e+07 2.01e+07 2.01e+07 2.01e+07 2.02e+07 ▁▇▁▆▃
year 0 1.00 2.01e+03 4.48 2.00e+03 2.00e+03 2.01e+03 2.01e+03 2.02e+03 ▁▇▁▆▃
price 0 1.00 2.14e+03 1427.75 2.20e+02 1.30e+03 1.80e+03 2.50e+03 4.00e+04 ▇▁▁▁▁
beds 6608 0.97 1.89e+00 1.08 0.00e+00 1.00e+00 2.00e+00 3.00e+00 1.20e+01 ▇▂▁▁▁
baths 158121 0.21 1.68e+00 0.69 1.00e+00 1.00e+00 2.00e+00 2.00e+00 8.00e+00 ▇▁▁▁▁
sqft 136117 0.32 1.20e+03 5000.22 8.00e+01 7.50e+02 1.00e+03 1.36e+03 9.00e+05 ▇▁▁▁▁
room_in_apt 0 1.00 0.00e+00 0.04 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 ▇▁▁▁▁
lat 193145 0.04 3.77e+01 0.35 3.36e+01 3.74e+01 3.78e+01 3.78e+01 4.04e+01 ▁▁▅▇▁
lon 196484 0.02 -1.22e+02 0.78 -1.23e+02 -1.22e+02 -1.22e+02 -1.22e+02 -7.42e+01 ▇▁▁▁▁

Make a plot that shows the top 20 cities in terms of % of classifieds between 2000-2018.

rent %>%
  group_by(city) %>%
  summarize(city_count = count(city)) %>%
  mutate(percent_city = city_count/sum(city_count))%>%
  slice_max(order_by = percent_city,n=20) %>% #derived the top 20
  ggplot(
    aes(
      x=percent_city,
      y=fct_reorder(city,percent_city) #reordered city based on % of listings
      )
    ) + 
  geom_col() +
  scale_x_continuous(labels = scales::percent_format(),) +
  labs(
    title="San Francisco accounts for more than a quarter of all rental classifieds", 
    caption = "Source: Penninaton, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018", 
    subtitle = "% of Craigslist listings, 2000-2018", 
    x = NULL,
    y = NULL) +
  theme_bw(base_size = 14)

Make a plot that shows the evolution of median prices in San Francisco for 0, 1, 2, and 3 bedrooms listings.

rent %>%
  filter(city=="san francisco",beds<=3) %>%
  group_by(beds,year) %>%
  summarize(median_price=median(price)) %>%
  ggplot(aes(x=year,y=median_price,color=factor(beds))) + #setting colour for no. of beds
  geom_line() +
  facet_wrap(~beds,ncol=4) + #ncol is used to display the graphs in one line
  labs(
    title = "San Francisco rents have been been steadily increasing",
    subtitle = "0 to 3-bed listings, 2000-2018",
    caption = "Source: Pennington, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018",
    x = NULL,
    y = NULL) +
  theme_bw(base_size = 14) +
  theme(legend.position = "none") # to remove the legend

Finally, make a plot that shows median rental prices for the top 2 cities in the Bay area.

top <- rent %>%
      count(city) %>%
      slice_max(order_by = n, n=12) #derive the top 12 cities

vec <- c(top) #creating a vector of the top 12 cities

rent %>%
  filter(city == vec$city, beds==1) %>% #filtering the cities in the vector
  group_by(year, city) %>%
  summarise(median_price = median(price)) %>%
  ggplot(
    top,  
    mapping=aes(x=year, y=median_price, colour = factor(city))) +
  geom_line() +
  facet_wrap(~city) +
  labs(
    title = "Rental prices for 1-bedroom flats in the Bay Area",
    caption = "Source: Pennington, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018",
    x = NULL,
    y = NULL
  ) +
  theme_bw(base_size = 14) + 
  theme(legend.position = "none") 

What can you infer from these plots?

When looking at the rental prices for 1 bedroom apartments in the Bay Area, we see a clear trend for all the locations. There is a clear rise in rental prices over a period of almost 20 years (2000-2018), with certain areas seeing a larger increase than others, but most areas seeing the rent prices double. Interestingly, it seems as if major social and economic events are reflected in the rental prices. For example, there was a small dip in rental prices around 2002. This is most likely because of the dot com crash that occurred two years before, which significantly impacted Silicon Valley and the Bay Area.

Similarly, a large drop can be noted around 2010 which is most likely because the continued negative effects of the financial crisis. Lastly, for some areas, with Palo Alto leading the way, we can see a sharp drop around 2015. There are no obvious external factors that can explain this drop and there for we could speculate that the large rise in prices the years before were brought back down with a correction.

Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast memebrs received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

Use your data import, inspection, and cleaning skills to answer the following:

  • Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?

No there are no missing values or duplicate entries under any of the columns.

skimr::skim(movies)
Table 1: Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁
unique(movies) #find unique records in the data set
## # A tibble: 2,961 × 11
##    title genre direc…¹  year durat…²  gross budget cast_…³  votes reviews rating
##    <chr> <chr> <chr>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
##  1 Avat… Acti… James …  2009     178 7.61e8 2.37e8    4834 8.86e5    3777    7.9
##  2 Tita… Drama James …  1997     194 6.59e8 2   e8   45223 7.93e5    2843    7.7
##  3 Jura… Acti… Colin …  2015     124 6.52e8 1.5 e8    8458 4.18e5    1934    7  
##  4 The … Acti… Joss W…  2012     173 6.23e8 2.2 e8   87697 9.95e5    2425    8.1
##  5 The … Acti… Christ…  2008     152 5.33e8 1.85e8   57802 1.68e6    5312    9  
##  6 Star… Acti… George…  1999     136 4.75e8 1.15e8   37723 5.35e5    3917    6.5
##  7 Star… Acti… George…  1977     125 4.61e8 1.1 e7   13485 9.11e5    1752    8.7
##  8 Aven… Acti… Joss W…  2015     141 4.59e8 2.5 e8   92000 4.63e5    1752    7.5
##  9 The … Acti… Christ…  2012     164 4.48e8 2.5 e8  106759 1.14e6    3514    8.5
## 10 Shre… Adve… Andrew…  2004      93 4.36e8 1.5 e8    1148 3.15e5     688    7.2
## # … with 2,951 more rows, and abbreviated variable names ¹​director, ²​duration,
## #   ³​cast_facebook_likes
  • Produce a table with the count of movies by genre, ranked in descending order
movies %>%
  count(genre) %>%
  arrange(desc(n)) #sort the genres in descending order on the basis of count
## # A tibble: 17 × 2
##    genre           n
##    <chr>       <int>
##  1 Comedy        848
##  2 Action        738
##  3 Drama         498
##  4 Adventure     288
##  5 Crime         202
##  6 Biography     135
##  7 Horror        131
##  8 Animation      35
##  9 Fantasy        28
## 10 Documentary    25
## 11 Mystery        16
## 12 Sci-Fi          7
## 13 Family          3
## 14 Musical         2
## 15 Romance         2
## 16 Western         2
## 17 Thriller        1
  • Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order
movies %>%
  group_by(genre) %>%
  summarise(
    avg_gross_earn = mean(gross), 
    avg_gross_budget = mean(budget)) %>%
  mutate(return_on_budget = avg_gross_earn/avg_gross_budget) %>% #calculate $ made per $ spent
  arrange(desc(return_on_budget))
## # A tibble: 17 × 4
##    genre       avg_gross_earn avg_gross_budget return_on_budget
##    <chr>                <dbl>            <dbl>            <dbl>
##  1 Musical          92084000          3189500          28.9    
##  2 Family          149160478.        14833333.         10.1    
##  3 Western          20821884          3465000           6.01   
##  4 Documentary      17353973.         5887852.          2.95   
##  5 Horror           37713738.        13504916.          2.79   
##  6 Fantasy          42408841.        17582143.          2.41   
##  7 Comedy           42630552.        24446319.          1.74   
##  8 Mystery          67533021.        39218750           1.72   
##  9 Animation        98433792.        61701429.          1.60   
## 10 Biography        45201805.        28543696.          1.58   
## 11 Adventure        95794257.        66290069.          1.45   
## 12 Drama            37465371.        26242933.          1.43   
## 13 Crime            37502397.        26596169.          1.41   
## 14 Romance          31264848.        25107500           1.25   
## 15 Action           86583860.        71354888.          1.21   
## 16 Sci-Fi           29788371.        27607143.          1.08   
## 17 Thriller             2468           300000           0.00823
  • Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.
movies %>%
  group_by(director) %>%
  summarise(
    gross_rev = sum(gross), 
    mean_gross_rev = mean(gross), 
    median_gross_rev = median(gross), 
    std_gross_rev = StdDev(gross)) %>%
  slice_max(order_by = gross_rev, n=15) #deriving top 15 directors by gross revenue
## # A tibble: 15 × 5
##    director           gross_rev mean_gross_rev median_gross_rev std_gross_rev[…¹
##    <chr>                  <dbl>          <dbl>            <dbl>            <dbl>
##  1 Steven Spielberg  4014061704     174524422.       164435221        101421051.
##  2 Michael Bay       2231242537     171634041.       138396624        127161579.
##  3 Tim Burton        2071275480     129454718.        76519172        108726924.
##  4 Sam Raimi         2014600898     201460090.       234903076        162126632.
##  5 James Cameron     1909725910     318287652.       175562880.       309171337.
##  6 Christopher Nolan 1813227576     226653447        196667606.       187224133.
##  7 George Lucas      1741418480     348283696        380262555        146193880.
##  8 Robert Zemeckis   1619309108     124562239.       100853835         91300279.
##  9 Clint Eastwood    1378321100      72543216.        46700000         75487408.
## 10 Francis Lawrence  1358501971     271700394.       281666058        135437020.
## 11 Ron Howard        1335988092     111332341        101587923         81933761.
## 12 Gore Verbinski    1329600995     189942999.       123207194        154473822.
## 13 Andrew Adamson    1137446920     284361730        279680930.       120895765.
## 14 Shawn Levy        1129750988     102704635.        85463309         65484773.
## 15 Ridley Scott      1128857598      80632686.        47775715         68812285.
## # … with abbreviated variable name ¹​std_gross_rev[,1]
  • Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
top <- movies %>%
  group_by(genre) %>%
  summarise(
    mean_rating = mean(rating), 
    min_rating = min(rating), 
    max_rating=max(rating), 
    median_rating=median(rating), 
    std_rating = StdDev(rating)) 
top #print the table
## # A tibble: 17 × 6
##    genre       mean_rating min_rating max_rating median_rating std_rating[,1]
##    <chr>             <dbl>      <dbl>      <dbl>         <dbl>          <dbl>
##  1 Action             6.23        2.1        9            6.3           1.03 
##  2 Adventure          6.51        2.3        8.6          6.6           1.09 
##  3 Animation          6.65        4.5        8            6.9           0.968
##  4 Biography          7.11        4.5        8.9          7.2           0.760
##  5 Comedy             6.11        1.9        8.8          6.2           1.02 
##  6 Crime              6.92        4.8        9.3          6.9           0.849
##  7 Documentary        6.66        1.6        8.5          7.4           1.77 
##  8 Drama              6.73        2.1        8.8          6.8           0.917
##  9 Family             6.5         5.7        7.9          5.9           1.22 
## 10 Fantasy            6.15        4.3        7.9          6.45          0.959
## 11 Horror             5.83        3.6        8.5          5.9           1.01 
## 12 Musical            6.75        6.3        7.2          6.75          0.636
## 13 Mystery            6.86        4.6        8.5          6.9           0.882
## 14 Romance            6.65        6.2        7.1          6.65          0.636
## 15 Sci-Fi             6.66        5          8.2          6.4           1.09 
## 16 Thriller           4.8         4.8        4.8          4.8          NA    
## 17 Western            5.7         4.1        7.3          5.7           2.26
ggplot(top, mapping=aes(x=mean_rating)) + 
  geom_histogram(binwidth=0.5) + #creating histogram distributing mean movie rating
  labs(
    title = "Ratings Distributed by Genre",
    x = "Means Ratings",
    y = "Count"
  ) +
  theme_bw(base_size = 9)

Use ggplot to answer the following

  • Examine the relationship between gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?

There is a common assumption that the number of facebook likes depicts the popularity of the cast and hence, would also lead to more movie-goers but this scatter plot shows that there are some significant outliers and thus, there is a very slight positive correlation.

We have mapped gross to x-axis and cast_facebook_likes on the y-axis.

movies %>%
  ggplot(mapping=aes(x=gross, y=cast_facebook_likes)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + #create line of best fit
  labs(
    title = "Scatterplot for Facebook Likes v Gross Earnings at Box Office",
    x = "Gross Earnings",
    y = "Facebook Likes received Cast"
  ) +
  theme_bw(base_size = 14)

  • Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.

The scatter plot shows that there is a positive correlation between budgets and gross earnings, which can be because big budgets means big money for special effects, marketing and larger distribution world-wide, which attracts more audience.

movies %>%
  ggplot(mapping=aes(x=budget, y=gross)) +
  geom_smooth(method = "lm", se = FALSE) +  #create line of best fit
  labs(
    title = "Scatterplot for Gross vs Budget",
    x = "Budget",
    y = "Gross"
  ) +
  geom_point() +
  theme_bw(base_size = 14)

  • Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?

For some of the genres, like action and adventure, high IMDB ratings correlate to higher gross earnings. This could be because action and adventure movies have a higher mass appeal. However, for some genres like thriller and western, we don’t have enough data make an educated statement of the relationship between the two variables.

The strange thing is that documentary and sci-fi have a declining line of best fit, which shows a slight negative correlation.

movies %>%
  ggplot(mapping=aes(x=rating, y=gross)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + #create line of best fit
  labs(
    title = "Scatterplot for Gross vs Ratings per Genre",
    x = "Ratings",
    y = "Gross"
  ) +
  facet_wrap(~genre) +
  theme_bw(base_size = 14)

Returns of financial stocks

We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.

nyse <- read_csv(here::here("data","nyse.csv"))

Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order

nyse1 <- nyse %>%
  group_by(sector) %>%
  summarise(count_comp=count(sector))%>%
  mutate(
    sector=fct_reorder(sector,count_comp,.desc=TRUE)) %>% #arrange sectors by no. of companies
  arrange(desc(count_comp))

ggplot(nyse1, aes(x=sector,y=count_comp)) +
  geom_col() +
  labs(
    title = "Frequency of Companies per Sector",
    x = "Sector",
    y = "Frequency of Companies"
  ) +
  theme_bw(base_size=14) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

nyse1  #print the table
## # A tibble: 12 × 2
##    sector                count_comp
##    <fct>                      <int>
##  1 Finance                       97
##  2 Consumer Services             79
##  3 Public Utilities              60
##  4 Capital Goods                 45
##  5 Health Care                   45
##  6 Energy                        42
##  7 Technology                    40
##  8 Basic Industries              39
##  9 Consumer Non-Durables         31
## 10 Miscellaneous                 12
## 11 Transportation                10
## 12 Consumer Durables              8

Next, let’s choose some stocks and their ticker symbols and download some data.

myStocks <- c("SPY","A","Y","CMG","EVRG","IT","PKX" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2022-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 20,545
## Columns: 8
## Groups: symbol [7]
## $ symbol   <chr> "SPY", "SPY", "SPY", "SPY", "SPY", "SPY", "SPY", "SPY", "SPY"…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 127, 127, 127, 128, 128, 127, 127, 128, 129, 128, 129, 129, 1…
## $ high     <dbl> 128, 127, 128, 128, 128, 127, 128, 129, 129, 129, 130, 130, 1…
## $ low      <dbl> 126, 126, 126, 127, 126, 126, 127, 127, 128, 128, 129, 128, 1…
## $ close    <dbl> 127, 127, 128, 127, 127, 127, 127, 129, 128, 129, 130, 128, 1…
## $ volume   <dbl> 1.39e+08, 1.37e+08, 1.34e+08, 1.23e+08, 1.56e+08, 1.22e+08, 1…
## $ adjusted <dbl> 102, 102, 102, 102, 102, 102, 102, 103, 103, 104, 104, 103, 1…

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

myStocks_returns_monthly %>%
  group_by(symbol) %>%
  summarise(
    median_return = median(monthly_returns),
    min_return = min(monthly_returns), 
    max_return = max(monthly_returns),
    mean_return = mean(monthly_returns),
    sd_return = sd(monthly_returns))
## # A tibble: 7 × 6
##   symbol median_return min_return max_return mean_return sd_return
##   <chr>          <dbl>      <dbl>      <dbl>       <dbl>     <dbl>
## 1 A            0.0125      -0.175      0.216     0.0135     0.0697
## 2 CMG          0.0155      -0.231      0.343     0.0186     0.0949
## 3 EVRG         0.0130      -0.171      0.152     0.0117     0.0518
## 4 IT           0.0281      -0.230      0.266     0.0184     0.0763
## 5 PKX         -0.00484     -0.226      0.200    -0.00222    0.0845
## 6 SPY          0.0146      -0.125      0.127     0.0106     0.0404
## 7 Y            0.00623     -0.160      0.280     0.00901    0.0543

Plot a density plot, using geom_density(), for each of the stocks

ggplot(data = myStocks_returns_monthly,aes(monthly_returns)) +
  geom_density() +
  facet_wrap(~symbol) +
  labs(
    title = "Density Plot for the Choosen Stocks",
    x = NULL,
    y = NULL
  ) +
  theme_bw(base_size = 14)

What can you infer from this plot? Which stock is the riskiest? The least risky?

If we consider the performance of SPY to be the performance of the broader market, then we can learn that the stocks A, CMG, PKX, and Y have lagged the broader market in terms of return performance in more months. However, returns are only part of the picture and we must not forget the risks. In this chart, the dispersion of the data reflects the risk profile of the stock, the more even the density, the higher the risk, as this means that it is difficult for us to predict the performance of this stock. Therefore, A, CMG and PKG are thus probably the riskiest, and SPY is the least risky.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock

myStocks_returns_monthly %>%
  group_by(symbol) %>%
  summarise(
    sd_return = sd(monthly_returns), 
    mean_return = mean(monthly_returns)
  ) %>%
  ggplot(aes(mean_return, sd_return, label=symbol)) +
    geom_point() +
    ggrepel::geom_text_repel() + #add labels to the scatterplot points
  labs(
    title = "Risk vs Return Scatter Plot",
    x = "Mean of Expected Monthly Return",
    y = "Risk"
  ) +
  theme_bw(base_size = 14)

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

The x-axis of this graph represents average monthly returns and the y-axis represents risk. We can see that most of the points follow the common perception that higher the risk, higher the reward. But of all the 7 stocks, SPY has the lowest risk for a proportionally higher return and there does exist one stock, PKX which is riskier but does not have a high expected return.

On your own: Spotify

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

The data dictionary can be found below

variable class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

In this dataset, there are only 6 types of playlist_genre , but we can still try to perform EDA on this dataset.

Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:

  1. What is the distribution of songs’ popularity (track_popularity). Does it look like a Normal distribution?

No, the track popularity does not look like a normal distribution.

popularity <- spotify_songs['track_popularity'] #extract track popularity from the dataset

ggplot(popularity, aes(x=track_popularity))+
  geom_histogram()+
  labs(
    title= "Distribution of Popularity", 
    x="Popularity",
    y=NULL) +
  theme_bw(base_size = 14)

  1. There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechinesand instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode. How are they distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?

As shown below in the summary statistics of the 12 audio features. We could guess roughly that ‘tempo’, danceability, energy, instrumentalness are closer to the normal distribution because their mean and median are roughly equal, and they have similar value for 1st quartile and 3rd quartile.

features <- c('speechiness', 'acousticness', 'liveness', 'instrumentalness', 'duration_ms', 'energy', 'loudness', 'danceability', 'valence', 'tempo', 'key', 'mode') 
summary_spotify <- summary(spotify_songs[features])
summary_spotify
##   speechiness     acousticness      liveness     instrumentalness
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000   
##  1st Qu.:0.041   1st Qu.:0.015   1st Qu.:0.093   1st Qu.:0.000   
##  Median :0.062   Median :0.080   Median :0.127   Median :0.000   
##  Mean   :0.107   Mean   :0.175   Mean   :0.190   Mean   :0.085   
##  3rd Qu.:0.132   3rd Qu.:0.255   3rd Qu.:0.248   3rd Qu.:0.005   
##  Max.   :0.918   Max.   :0.994   Max.   :0.996   Max.   :0.994   
##   duration_ms         energy         loudness      danceability  
##  Min.   :  4000   Min.   :0.000   Min.   :-46.4   Min.   :0.000  
##  1st Qu.:187819   1st Qu.:0.581   1st Qu.: -8.2   1st Qu.:0.563  
##  Median :216000   Median :0.721   Median : -6.2   Median :0.672  
##  Mean   :225800   Mean   :0.699   Mean   : -6.7   Mean   :0.655  
##  3rd Qu.:253585   3rd Qu.:0.840   3rd Qu.: -4.6   3rd Qu.:0.761  
##  Max.   :517810   Max.   :1.000   Max.   :  1.3   Max.   :0.983  
##     valence          tempo          key             mode      
##  Min.   :0.000   Min.   :  0   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:0.331   1st Qu.:100   1st Qu.: 2.00   1st Qu.:0.000  
##  Median :0.512   Median :122   Median : 6.00   Median :1.000  
##  Mean   :0.511   Mean   :121   Mean   : 5.37   Mean   :0.566  
##  3rd Qu.:0.693   3rd Qu.:134   3rd Qu.: 9.00   3rd Qu.:1.000  
##  Max.   :0.991   Max.   :239   Max.   :11.00   Max.   :1.000
  1. Is there any relationship between valence and track_popularity? danceability and track_popularity ?

The correlation between valence and track_popularity is 0.032, and danceability and track_popularity is 0.0647, so they are not obviously related to each other.

ggplot(spotify_songs,aes(x=valence,y=track_popularity)) +
  geom_point() +
  labs(
    title="Valence and Popularity", 
    x="Valence", 
    y="Track popularity") +
  theme_bw(base_size = 14)

cor_vp <- cor(spotify_songs$valence,spotify_songs$track_popularity) #find coeff of correlation
cor_vp
## [1] 0.0332
ggplot(spotify_songs,aes(x=danceability,y=track_popularity)) +
  geom_point() +
  labs(
    title="Danceability and Popularity", 
    x="Danceability", 
    y="Track popularity") +
  theme_bw(base_size = 14)

cor_dt <- cor(spotify_songs$danceability,spotify_songs$track_popularity) #find coeff of correlation
cor_dt
## [1] 0.0647
  1. mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. Do songs written on a major scale have higher danceability compared to those in minor scale? What about track_popularity?

We can find that major songs, according to this dataset, no matter the mean or median, have more popularity, whereas minor songs have larger danceability, but the difference is little.

major <- spotify_songs %>%
  filter(mode == 1) #filter for major song

major %>%
  summarise(major_mean_d=mean(major$danceability),
            major_mean_p=mean(major$track_popularity),
            major_median_d=median(major$danceability),
            major_median_p=median(major$track_popularity))
## # A tibble: 1 × 4
##   major_mean_d major_mean_p major_median_d major_median_p
##          <dbl>        <dbl>          <dbl>          <dbl>
## 1        0.647         42.7          0.663             46
minor <- spotify_songs %>%
  filter(mode == 0) #filter for minor song

minor %>%
  summarise(minor_mean_d=mean(minor$danceability),
            minor_mean_p=mean(minor$track_popularity),
            minor_median_d=median(minor$danceability),
            minor_median_p=median(minor$track_popularity))
## # A tibble: 1 × 4
##   minor_mean_d minor_mean_p minor_median_d minor_median_p
##          <dbl>        <dbl>          <dbl>          <dbl>
## 1        0.665         42.2           0.68             45

Challenge 1: Replicating a chart

Create a graph that calculates the cumulative % change for 0-, 1-, and 2-bed flats between 2000 and 2018 for the top twelve cities in Bay Area, by number of ads that appeared in Craigslist.

library(scales)

top_12_cities <- rent %>% 
  group_by(city) %>%
  summarise(count_city = count(city)) %>%
  arrange(desc(count_city)) %>%
  slice_head(n = 12)  #derive top 12 after arranging in descending order by no. of city

rent_12_cities <- rent %>%
  filter(city %in% c(top_12_cities$city)) 

medianPricePerCity <- rent_12_cities %>% 
  group_by(city,beds,year) %>% #aggregating per city, bed and year
  summarise(mean_price = median(price)) %>%
  filter(beds < 3) %>% 
  arrange(city,beds,year) 

cum_top_12 <- medianPricePerCity %>% 
  group_by(city,beds) %>% 
  mutate(pct_change = (mean_price/lag(mean_price))) %>% #calculate change in price from prev row
  mutate(pct_change = ifelse(is.na(pct_change), 1, pct_change)) %>%  #replace blank values with 100%
  mutate(cumulative_change = cumprod(pct_change)) #calculate cumm % change

ggplot(cum_top_12,aes(x = year, y = cumulative_change,color= city)) + 
  geom_line() + 
  facet_grid(beds ~ city,scales= "free_y") + #scale of y is not fixed
  theme_bw(base_size = 14)  +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 90)) + 
  scale_y_continuous(labels = scales::percent_format(scale = 100))  + #y-axis in % format
  scale_x_continuous(breaks = seq(2005, 2018, by = 5)) + 
  labs(title = "Cumulative % change in 0,1, and 2-bed rentals in Bay Area", 
       subtitle = "2000-2018")

Challenge 2: 2016 California Contributors plots

Reproduce the plot that shows the top ten cities in highest amounts raised in political contributions in California during the 2016 US Presidential election.

library(patchwork)

CA_contributors_2016 <- vroom::vroom(here::here("data","CA_contributors_2016.csv")) #reading the csv file
CA_contributors <- CA_contributors_2016 %>%
                       mutate(zip=as.character(zip)) #changing zip data type to character

code <- vroom::vroom(here::here("data","zip_code_database.csv"))

final_data <- left_join(CA_contributors, code, by = "zip")

final_hillary <- final_data %>%
                    filter(cand_nm == "Clinton, Hillary Rodham") %>%
                    group_by(cand_nm, primary_city) %>%
                    summarise(total_amt = sum(contb_receipt_amt)) %>%
                    slice_max(order_by = total_amt, n=10) %>% #derive  top 10 cities
                    mutate(primary_city=fct_reorder(primary_city, total_amt)) %>% #reorder city by total amount
  ggplot(
    mapping=aes(x = total_amt, y=primary_city)) + 
  geom_col(fill="blue") + 
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(
    title = "Clinton, Hillary Rodham"
  )

final_trump <- final_data %>%
                    filter(cand_nm == "Trump, Donald J.") %>%
                    group_by(cand_nm, primary_city) %>%
                    summarise(total_amt = sum(contb_receipt_amt)) %>%
                    slice_max(order_by = total_amt, n=10) %>% #derive  top 10 cities
                    mutate(primary_city=fct_reorder(primary_city, total_amt)) %>% #reorder city by total amount
  
  ggplot(
    mapping=aes(x = total_amt, y=primary_city)) + 
  geom_col(fill="red") +
  scale_x_continuous(labels = scales::dollar_format()) + #x-axis is in $ units
  labs(
    title = "Trump, Donald J."
  ) 


final_hillary + theme_bw(base_size = 10) + labs(x = "Amount Raised", y = "Primary City") +
  final_trump + theme_bw(base_size = 10) + labs(x = "Amount Raised", y = "Primary City") #combine the two plots

library(tidytext)
final_data
## # A tibble: 1,292,843 × 19
##    cand_nm   contb…¹ zip   contb_date type  prima…² accep…³ unacc…⁴ state county
##    <chr>       <dbl> <chr> <date>     <chr> <chr>   <chr>   <chr>   <chr> <chr> 
##  1 Clinton,…    50   94939 2016-04-26 STAN… Larksp… <NA>    <NA>    CA    Marin…
##  2 Clinton,…   200   93428 2016-04-20 STAN… Cambria <NA>    <NA>    CA    San L…
##  3 Clinton,…     5   92337 2016-04-02 STAN… Fontana <NA>    <NA>    CA    San B…
##  4 Trump, D…    48.3 95334 2016-11-21 STAN… Living… <NA>    <NA>    CA    Merce…
##  5 Sanders,…    40   93011 2016-03-04 PO B… Camari… <NA>    <NA>    CA    Ventu…
##  6 Trump, D…   244.  95826 2016-11-24 STAN… Sacram… <NA>    Walsh … CA    Sacra…
##  7 Sanders,…    35   90278 2016-03-05 STAN… Redond… <NA>    <NA>    CA    Los A…
##  8 Sanders,…   100   90278 2016-03-06 STAN… Redond… <NA>    <NA>    CA    Los A…
##  9 Sanders,…    25   92084 2016-03-04 STAN… Vista   <NA>    <NA>    CA    San D…
## 10 Clinton,…    40   92637 2016-04-20 STAN… Laguna… Laguna… Laguna… CA    Orang…
## # … with 1,292,833 more rows, 9 more variables: timezone <chr>,
## #   area_codes <dbl>, latitude <dbl>, longitude <dbl>, world_region <chr>,
## #   country <chr>, decommissioned <dbl>, estimated_population <dbl>,
## #   notes <chr>, and abbreviated variable names ¹​contb_receipt_amt,
## #   ²​primary_city, ³​acceptable_cities, ⁴​unacceptable_cities
top_ten <- final_data %>%
             group_by(cand_nm) %>%
             summarise(total_amt = sum(contb_receipt_amt)) %>%
             slice_max(order_by = total_amt, n=10) #top 10 candidates
top_ten_vec <- c(top_ten) #create a vector for top 10 candidates

top_10_plot <- final_data %>%
                  filter(cand_nm == top_ten_vec$cand_nm) %>%
                  group_by(cand_nm, primary_city) %>%
                  summarise(total_amt = sum(contb_receipt_amt)) %>%
                  arrange(cand_nm,desc(total_amt)) %>% #order city by total amt for each candidate 
                  top_n(10, total_amt) %>% 
                  mutate(primary_city=reorder_within(primary_city, total_amt, cand_nm)) %>% #ordering within each candidate the top 10 cities by amount
  ggplot(mapping=aes(x=total_amt, y=primary_city)) + 
  geom_col() + 
  facet_wrap(~cand_nm, scales = "free", ncol = 3) + 
  scale_y_reordered() +
  theme_bw(base_size = 8.5) +
  labs(
    title = "Top 10 Cities for the Top Candidates",
    x = "Year",
    y = "City"
  )

top_10_plot

Details

  • Who did you collaborate with: Group 6 - Sonakshi Gupta, Drishti Goyal, Jean Francois Peters, Wybe Harmes, Suzy Wang, Zezhou Tang
  • Approximately how much time did you spend on this problem set: 7 hrs 30 min 20 sec
  • What, if anything, gave you the most trouble: The challenge questions were tougher so we needed to use our best friends, “Google” and “StackR” to find the needed code, and hence, required some time.