Introduction to the
data
In order to properly analyze the World Life Expectancy data, four
data sets were downloaded and read into rstudio. The following data was
utilized for the analysis:
- Income per person
- Life expectancy per year
- Population Size
- Country Regions
Each data set includes information that could potentially impact life
expectancy throughout the world. Ranging from 1800 to 2018 this public
domain data includes both relevant and irrelevant data for analysis.
countries.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/countries_total.csv")
income.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/income_per_person.csv")
life.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/life_expectancy_years.csv")
popu.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/population_total.csv")
Conserved across the four data sets is one common variable:
country name. Additionally, three of the four utilized data sets
share a secondary common variable, year.
Preparing the Data for
a Proper Merge
Once the raw data was loaded into rstudio, variables that were
outside of the scope of interest needed to be removed from each of the
four data frames.
Income Per Person
Data
Starting with Income per person data, the data frame required
reshaping to allow for proper merge of all four data sets and subsequent
analysis.
The data frame of raw income data (named
income.raw) was reshaped to gather
the data by year, omitting the country column. Omitting the country
column allows for multiple country entries for each year.
This process transformed the raw income data of
193 observations of
220 variables into the reshaped data
frame of income data (named income)
that includes 42267 observations of
3 variables.
All blank values were omitted from the reshaped income data
frame by setting na.rm = TRUE.
income <- income.raw %>%
gather(key = "Year",
value = "Income",
- geo,
na.rm = TRUE)
colnames(income) <- c("Country", "Year", "Income")
group_by(income, Income)
# A tibble: 42,267 × 3
# Groups: Income [2,218]
Country Year Income
<chr> <chr> <int>
1 Afghanistan X1800 603
2 Albania X1800 667
3 Algeria X1800 715
4 Andorra X1800 1200
5 Angola X1800 618
6 Antigua and Barbuda X1800 757
7 Argentina X1800 1510
8 Armenia X1800 514
9 Australia X1800 814
10 Austria X1800 1850
# ℹ 42,257 more rows
str(income)
'data.frame': 42267 obs. of 3 variables:
$ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Income : int 603 667 715 1200 618 757 1510 514 814 1850 ...
Life Expectancy
Data
Second, Life Expectancy data required reshaping to
appropriately merge the data sets for analysis.
The data frame of raw life expectancy data (named
life.raw) was reshaped to gather
the data by year, omitting the country column from the gather process.
Omitting the country column allows for multiple country entries for each
year.
This process transformed the raw life expectancy data of
187 observations of
220 variables into the reshaped data
frame of life expectancy data (named
life) that includes
40437 observations of
3 variables.
All blank values were omitted from the reshaped life
expectancy data frame by setting na.rm = TRUE.
life <- life.raw %>%
gather(key = "Year",
value = "Life_Exp",
- geo,
na.rm = TRUE)
colnames(life) <- c("Country", "Year", "Life_Exp")
group_by(life, Life_Exp)
# A tibble: 40,437 × 3
# Groups: Life_Exp [739]
Country Year Life_Exp
<chr> <chr> <dbl>
1 Afghanistan X1800 28.2
2 Albania X1800 35.4
3 Algeria X1800 28.8
4 Angola X1800 27
5 Antigua and Barbuda X1800 33.5
6 Argentina X1800 33.2
7 Armenia X1800 34
8 Australia X1800 34
9 Austria X1800 34.4
10 Azerbaijan X1800 29.2
# ℹ 40,427 more rows
str(life)
'data.frame': 40437 obs. of 3 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp: num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
Population Data
Third, the Population Data required reshaping as well to
appropriately merge the data sets for analysis.
The data frame of raw population data (named
popu.raw) was reshaped to gather
the data by year, omitting the country column. Omitting the country
column allows for multiple country entries for each year.
This process transformed the raw population data of
195 observations of
220 variables into the reshaped data
frame of population data (named
popu) that includes
42705 observations of
3 variables.
All blank values were omitted from the reshaped population
data frame by setting na.rm = TRUE.
popu <- popu.raw %>%
gather(key = "Year",
value = "Population",
- geo,
na.rm = TRUE)
colnames(popu) <- c("Country", "Year", "Population")
group_by(popu, Population)
# A tibble: 42,705 × 3
# Groups: Population [4,599]
Country Year Population
<chr> <chr> <int>
1 Afghanistan X1800 3280000
2 Albania X1800 410000
3 Algeria X1800 2500000
4 Andorra X1800 2650
5 Angola X1800 1570000
6 Antigua and Barbuda X1800 37000
7 Argentina X1800 534000
8 Armenia X1800 413000
9 Australia X1800 351000
10 Austria X1800 3210000
# ℹ 42,695 more rows
str(popu)
'data.frame': 42705 obs. of 3 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Population: int 3280000 410000 2500000 2650 1570000 37000 534000 413000 351000 3210000 ...
Country Data
Third, the Country Data required reshaping as well to
appropriately merge the data sets for analysis.
A subset of the data frame of raw country data (named
countries.raw) to remove all
variables except for the Country name and Continent.
This process transformed the raw country data of
248 observations of
11 variables into the reshaped data
frame of country data (named
countries) that includes
248 observations of
2 variables.
countries <- subset(countries.raw, select = c(name, region))
colnames(countries) <- c("Country", "Continent")
str(countries)
'data.frame': 248 obs. of 2 variables:
$ Country : chr "Afghanistan" "\xea\xf3land Islands" "Albania" "Algeria" ...
$ Continent: chr "Asia" "Europe" "Europe" "Africa" ...
Merging the Data
With all four data sets (Income Per Person Data, Life
Expectancy Data, Population Data, and Country Data)
reshaped for a clean merge, the final data set was merged in multiple
steps to ensure accuracy.
Merging Longitudinal
Data Part 1
First the following longitudinal data frames were merged into a
singular data frame (named
LifeExpIncom):
- Life Expectancy Data (reshaped)
- Income Per Person Data (reshaped)
LifeExpIncom <- inner_join(life, income, by = c("Country", "Year"))
group_by(LifeExpIncom, Life_Exp)
# A tibble: 40,437 × 4
# Groups: Life_Exp [739]
Country Year Life_Exp Income
<chr> <chr> <dbl> <int>
1 Afghanistan X1800 28.2 603
2 Albania X1800 35.4 667
3 Algeria X1800 28.8 715
4 Angola X1800 27 618
5 Antigua and Barbuda X1800 33.5 757
6 Argentina X1800 33.2 1510
7 Armenia X1800 34 514
8 Australia X1800 34 814
9 Austria X1800 34.4 1850
10 Azerbaijan X1800 29.2 775
# ℹ 40,427 more rows
In merging by both “Country” and “Year” variables the resultant data
frame LifeExpIncom includes a
single variable for Country and Year, despite both merged data frames
containing these variables.
All blank values were omitted from the merged data frame by
utilizing inner_join().
str(LifeExpIncom)
'data.frame': 40437 obs. of 4 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp: num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
$ Income : int 603 667 715 618 757 1510 514 814 1850 775 ...
Merging Longitudinal
Data Part 2
Second the following longitudinal data frames were merged into a
singular data frame (named
LifeIncomPopu):
- Life Expectancy and Income Data (merged data from above)
- Population Data (reshaped)
LifeIncomPopu <- inner_join( LifeExpIncom, popu, by = c("Country", "Year"))
In merging by both “Country” and “Year” variables the resultant data
frame LifeIncomPopu includes a
single variable for Country and Year, despite both merged data frames
containing these variables.
All blank values were omitted from the merged data frame by
utilizing inner_join().
str(LifeIncomPopu)
'data.frame': 40437 obs. of 5 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp : num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
$ Income : int 603 667 715 618 757 1510 514 814 1850 775 ...
$ Population: int 3280000 410000 2500000 1570000 37000 534000 413000 351000 3210000 880000 ...
Merging Longitudinal
Data with Categorical Data
Once all longitudinal data has been merged into a single data frame
the following data frames of both longitudinal and categorical data
frames were merged into a final data frame (named
LifeExp_Clean):
- Life Expectancy, Income, and Population Data (merged data from
above)
- Country Data (reshaped)
LifeExp_Clean <- inner_join(LifeIncomPopu, countries, by = "Country")
In merging by “Country” variable the resultant data frame
LifeExp_Clean includes a single
variable for Country , despite both merged data frames containing these
variables.
This process transformed the merged Income, Life Expectancy, and
Population data of 40437 observations
of 5 variables into the reshaped data
frame of merged Income, Life Expectancy, Population Data, and Country
Data,(named LifeExp_Clean) that
includes 37590 observations of
2 variables.
All blank values were omitted from the merged data frame by
utilizing inner_join().
str(LifeExp_Clean)
'data.frame': 37590 obs. of 6 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp : num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
$ Income : int 603 667 715 618 757 1510 514 814 1850 775 ...
$ Population: int 3280000 410000 2500000 1570000 37000 534000 413000 351000 3210000 880000 ...
$ Continent : chr "Asia" "Europe" "Africa" "Africa" ...
write.xlsx(LifeExp_Clean, file = "C:\\Users\\natal\\Downloads\\LifeExp_Clean.xlsx")
A copy of the above printed data file can be accessed at the
following github
link.
Pruning the Data for
Visualization Purposes
In order to properly visualize data of this scale a subset for the
year 2000 was taken (named
data2000)
data2000 <- filter(LifeExp_Clean, Year == "X2000")
str(data2000)
'data.frame': 174 obs. of 6 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : chr "X2000" "X2000" "X2000" "X2000" ...
$ Life_Exp : num 51.6 74.4 73.9 81.8 53.4 74.7 74.2 71.8 79.7 78.1 ...
$ Income : int 972 5470 10200 31700 3510 18800 14900 2930 35300 38800 ...
$ Population: int 20100000 3120000 31200000 65400 16400000 83600 37100000 3070000 19100000 8070000 ...
$ Continent : chr "Asia" "Europe" "Africa" "Europe" ...
write.xlsx(data2000, file = "C:\\Users\\natal\\Downloads\\data2000.xlsx")
A copy of the above printed data file can be accessed at the
following github
link.
Plotting the Data
The Prepared and Pruned data as found in
data2000 is plotted below.
ggplot(data2000)+
aes(x = Income, y = Life_Exp, color = Continent)+
geom_point(aes(size = Population, alpha = 0.5))+
geom_smooth(method ="lm", fill = NA)+
scale_color_manual(values=c('#648FFF','#785EF0', '#DC267F', '#FE6100', '#FFB000'))+
labs(x="Income Per Person ($/person/year)", y="Life Expectancy (years)", title = "Impact of Income on Average Life Expectancy", subtitle = "Controlled for both region (Continent) and population size (Population)", color="Continent", size="Population")+
guides(alpha ="none")

As illustrated above a positive correlation between Average Income
Per Person and Average Life Expectancy for all continents. Continents
with a greater range of average incomes per person demonstrate a greater
positive correlation between income and life expectancy. Income per
person range can be visualized by the length of the linear regression
lines added to the above plot. Regression lines that extend to greater x
values indicate a greater income range for the indicated Continent.
Population size is more difficult to observe with the above plot, but
there appears to be a minor negative correlation for the Asia continent
between both population size and life expectancy as well as population
size and income per person. This correlation is identified by the
clustering of larger sized points closer towards the bottom left of the
plot (lower values for both income and life expectancy).
---
title: "The Race to Immortality: An analysis of variables impacting average life expectancy"
author: "Natalie LePera"
date: "West Chester University <br>STA 503: Data Visualization"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
---

```{=html}
<style type="text/css">

div#TOC li {
    list-style:none;
    background-color:lightgray;
    background-image:none;
    background-repeat:none;
    background-position:0;
    font-family: Arial, Helvetica, sans-serif;
    color: #780c0c;
}

/* mouse over link */
div#TOC a:hover {
  color: red;
}

/* unvisited link */
div#TOC a:link {
  color: blue;
}



h1.title {
  font-size: 24px;
  color: Darkblue;
  text-align: center;
  font-family: Arial, Helvetica, sans-serif;
  font-variant-caps: normal;
}
h4.author { 
    font-size: 18px;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}
h4.date { 
  font-size: 18px;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 {
    font-size: 24px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 {
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { 
    font-size: 15px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* unvisited link */
a:link {
  color: green;
}

/* visited link */
a:visited {
  color: green;
}

/* mouse over link */
a:hover {
  color: red;
}

/* selected link */
a:active {
  color: yellow;
}

</style>
```
```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
options(repos = list(CRAN="http://cran.rstudio.com/"))
if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("cowplot")) {
   install.packages("cowplot")
   library(cowplot)
}
if (!require("latex2exp")) {
   install.packages("latex2exp")
   library(latex2exp)
}
if (!require("plotly")) {
   install.packages("plotly")
   library(plotly)
}
if (!require("gapminder")) {
   install.packages("gapminder")
   library(gapminder)
}
if (!require("png")) {
    install.packages("png")             # Install png package
    library("png")
}
if (!require("RCurl")) {
    install.packages("RCurl")           # Install RCurl package
    library("RCurl")
}
if (!require("colourpicker")) {
    install.packages("colourpicker")              
    library("colourpicker")
}
if (!require("gifski")) {
    install.packages("gifski")              
    library("gifski")
}
if (!require("magick")) {
    install.packages("magick")              
    library("magick")
}
if (!require("grDevices")) {
    install.packages("grDevices")              
    library("grDevices")
}
### ggplot and extensions
if (!require("ggplot2")) {
    install.packages("ggplot2")              
    library("ggplot2")
}
if (!require("gganimate")) {
    install.packages("gganimate")              
    library("gganimate")
}
if (!require("ggridges")) {
    install.packages("ggridges")              
    library("ggridges")
}
if (!require("graphics")) {
    install.packages("graphics")              
    library("graphics")
}
if (!require("openxlsx")) {
    install.packages("openxlsx")              
    library("openxlsx")
}

knitr::opts_chunk$set(echo = TRUE,       
                      warning = FALSE,   
                      result = TRUE,   
                      message = FALSE,
                      comment = NA)
```
\

# Introduction to the data
In order to properly analyze the World Life Expectancy data, four data sets were downloaded and read into rstudio. The following data was utilized for the analysis:

-   Income per person
-   Life expectancy per year
-   Population Size
-   Country Regions

Each data set includes information that could potentially impact life expectancy throughout the world. Ranging from 1800 to 2018 this public domain data includes both relevant and irrelevant data for analysis.

```{r}
countries.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/countries_total.csv")
income.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/income_per_person.csv")
life.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/life_expectancy_years.csv")
popu.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/population_total.csv")

```

Conserved across the four data sets is one common variable: <b>country name</b>.  Additionally, three of the four utilized data sets share a secondary common variable, <b>year</b>. 

<br>


# Preparing the Data for a Proper Merge

Once the raw data was loaded into rstudio, variables that were outside of the scope of interest needed to be removed from each of the four data frames. 
<br>


## Income Per Person Data

Starting with <u>Income per person</u> data, the data frame required reshaping to allow for proper merge of all four data sets and subsequent analysis.

<br>The data frame of raw income data (named <font color = "purple"><i>income.raw</i></font>) was reshaped to gather the data by year, omitting the country column.  Omitting the country column allows for multiple country entries for each year.  

<br>This process transformed the raw income data of <font color = "red"><b>193 observations</b></font> of <font color = "red"><b>220 variables</b></font> into the reshaped data frame of income data (named <font color = "purple"><i>income</i></font>) that includes <font color = "red"><b>42267 observations</b></font> of <font color = "red"><b>3 variables</b></font>. 

<br><b>All blank values were omitted from the reshaped income data frame by setting na.rm = TRUE.</b>

```{r}
income <- income.raw %>% 
  gather(key = "Year",
         value = "Income",
         - geo,
         na.rm = TRUE)
colnames(income) <- c("Country", "Year", "Income")
group_by(income, Income)
```

```{r echo = TRUE}
str(income)
```
<br>


## Life Expectancy Data

Second, <u>Life Expectancy</u> data required reshaping to appropriately merge the data sets for analysis.  

<br>The data frame of raw life expectancy data (named <font color = "purple"><i>life.raw</i></font>) was reshaped to gather the data by year, omitting the country column from the gather process.  Omitting the country column allows for multiple country entries for each year.  

<br>This process transformed the raw life expectancy data of <font color = "red"><b>187 observations</b></font> of <font color = "red"><b>220 variables</b></font> into the reshaped data frame of life expectancy data (named <font color = "purple"><i>life</i></font>) that includes <font color = "red"><b>40437 observations</b></font> of <font color = "red"><b>3 variables</b></font>. 

<br><b>All blank values were omitted from the reshaped life expectancy data frame by setting na.rm = TRUE.</b>

```{r}
life <- life.raw %>% 
  gather(key = "Year",
         value = "Life_Exp",
         - geo,
         na.rm = TRUE)
colnames(life) <- c("Country", "Year", "Life_Exp")
group_by(life, Life_Exp)
```

```{r echo=TRUE}
str(life)
```
<br>


## Population Data

Third, the <i>Population Data</i> required reshaping as well to appropriately merge the data sets for analysis. 

<br>The data frame of raw population data (named <font color = "purple"><i>popu.raw</i></font>) was reshaped to gather the data by year, omitting the country column.  Omitting the country column allows for multiple country entries for each year.  

<br>This process transformed the raw population data of <font color = "red"><b>195 observations</b></font> of <font color = "red"><b>220 variables</b></font> into the reshaped data frame of population data (named <font color = "purple"><i>popu</i></font>) that includes <font color = "red"><b>42705 observations</b></font> of <font color = "red"><b>3 variables</b></font>. 

<br><b>All blank values were omitted from the reshaped population data frame by setting na.rm = TRUE.</b>

```{r}
popu <- popu.raw %>% 
  gather(key = "Year",
         value = "Population",
         - geo,
         na.rm = TRUE)
colnames(popu) <- c("Country", "Year", "Population")
group_by(popu, Population)
```

```{r echo=TRUE}
str(popu)
```
<br>


## Country Data

Third, the <i>Country Data</i> required reshaping as well to appropriately merge the data sets for analysis. 

<br>A subset of the data frame of raw country data (named <font color = "purple"><i>countries.raw</i></font>) to remove all variables except for the Country name and Continent.    

<br>This process transformed the raw country data of <font color = "red"><b>248 observations</b></font> of <font color = "red"><b>11 variables</b></font> into the reshaped data frame of country data (named <font color = "purple"><i>countries</i></font>) that includes <font color = "red"><b>248 observations</b></font> of <font color = "red"><b>2 variables</b></font>. 


```{r}
countries <- subset(countries.raw, select = c(name, region))

colnames(countries) <- c("Country", "Continent")
```


```{r echo=TRUE}
str(countries)
```
<br><br>


# Merging the Data
With all four data sets (<u>Income Per Person Data</u>, <u>Life Expectancy Data</u>, <u>Population Data</u>, and <u>Country Data</u>) reshaped for a clean merge, the final data set was merged in multiple steps to ensure accuracy. 

<br>


## Merging Longitudinal Data Part 1

First the following longitudinal data frames were merged into a singular data frame (named <font color = "purple"><i> LifeExpIncom</i></font>):

-   Life Expectancy Data (reshaped)
-   Income Per Person Data (reshaped)


```{r}
 LifeExpIncom <- inner_join(life, income, by = c("Country", "Year"))
group_by(LifeExpIncom, Life_Exp)
```

In merging by both "Country" and "Year" variables the resultant data frame <font color = "purple"><i> LifeExpIncom</i></font> includes a single variable for Country and Year, despite both merged data frames containing these variables. 

<br><b>All blank values were omitted from the merged data frame by utilizing inner_join().</b>

```{r echo=TRUE}
str(LifeExpIncom)
```

<br>


## Merging Longitudinal Data Part 2
Second the following longitudinal data frames were merged into a singular data frame (named <font color = "purple"><i>LifeIncomPopu</i></font>):

-   Life Expectancy and Income Data (merged data from above)
-   Population Data (reshaped)


```{r}
LifeIncomPopu <- inner_join( LifeExpIncom, popu, by = c("Country", "Year"))
```

In merging by both "Country" and "Year" variables the resultant data frame <font color = "purple"><i>LifeIncomPopu</i></font> includes a single variable for Country and Year, despite both merged data frames containing these variables. 

<br><b>All blank values were omitted from the merged data frame by utilizing inner_join().</b>

```{r echo=TRUE}
str(LifeIncomPopu)
```

<br>


## Merging Longitudinal Data with Categorical Data

Once all longitudinal data has been merged into a single data frame the following data frames of both longitudinal and categorical data frames were merged into a final data frame (named <font color = "purple"><i>LifeExp_Clean</i></font>):

-   Life Expectancy, Income, and Population Data (merged data from above)
-   Country Data (reshaped)

```{r}
LifeExp_Clean <- inner_join(LifeIncomPopu, countries, by = "Country")
```

In merging by "Country" variable the resultant data frame <font color = "purple"><i>LifeExp_Clean</i></font> includes a single variable for Country , despite both merged data frames containing these variables. 

<br>This process transformed the merged Income, Life Expectancy, and Population data of <font color = "red"><b>40437 observations</b></font> of <font color = "red"><b>5 variables</b></font> into the reshaped data frame of merged Income, Life Expectancy, Population Data, and Country Data,(named <font color = "purple"><i>LifeExp_Clean</i></font>) that includes <font color = "red"><b>37590 observations</b></font> of <font color = "red"><b>2 variables</b></font>.

<br><b>All blank values were omitted from the merged data frame by utilizing inner_join().</b>

```{r echo=TRUE}
str(LifeExp_Clean)
```
```{r echo = TRUE, eval = FALSE}
write.xlsx(LifeExp_Clean, file = "C:\\Users\\natal\\Downloads\\LifeExp_Clean.xlsx")
```
A copy of the above printed data file can be accessed at the following github <a href="https://nlepera.github.io/sta553/w05_ggplot/LifeExp_Clean.xlsx">link</a>.

<br><br>


# Pruning the Data for Visualization Purposes
In order to properly visualize data of this scale a subset for the year 2000 was taken (named <font color = "purple"><i>data2000</i></font>)

```{r}
data2000 <- filter(LifeExp_Clean, Year == "X2000")
```
```{r echo=TRUE}
str(data2000)
```
```{r echo = TRUE, eval = FALSE}
write.xlsx(data2000, file = "C:\\Users\\natal\\Downloads\\data2000.xlsx")
```
A copy of the above printed data file can be accessed at the following github <a href="https://nlepera.github.io/sta553/w05_ggplot/data2000.xlsx">link</a>.

<br><br>


# Plotting the Data

The Prepared and Pruned data as found in <font color = "purple"><i>data2000</i></font> is plotted below. 

```{r}
ggplot(data2000)+
  aes(x = Income, y = Life_Exp, color = Continent)+
  geom_point(aes(size = Population, alpha = 0.5))+
  geom_smooth(method ="lm", fill = NA)+
  scale_color_manual(values=c('#648FFF','#785EF0', '#DC267F', '#FE6100', '#FFB000'))+
  labs(x="Income Per Person ($/person/year)", y="Life Expectancy (years)", title = "Impact of Income on Average Life Expectancy", subtitle = "Controlled for both region (Continent) and population size (Population)", color="Continent", size="Population")+
  guides(alpha ="none")

```

As illustrated above a positive correlation between Average Income Per Person and Average Life Expectancy for all continents.  Continents with a greater range of average incomes per person demonstrate a greater positive correlation between income and life expectancy.  Income per person range can be visualized by the length of the linear regression lines added to the above plot.  Regression lines that extend to greater x values indicate a greater income range for the indicated Continent. Population size is more difficult to observe with the above plot, but there appears to be a minor negative correlation for the Asia continent between both population size and life expectancy as well as population size and income per person.  This correlation is identified by the clustering of larger sized points closer towards the bottom left of the plot (lower values for both income and life expectancy).  

\













