Monday, May 4, 2015

Assignment 5: Regression Analysis

Part 1: Crime Rates and Lunches

y=21.819+1.685x

Null hypothesis:
There is no linear relationship between crime rates and lunches.
Alternative hypothesis: There is a linear relationship between crime rates and lunches.
Reject the null hypothesis with a significance value to be .05 at a 95% confidence interval. If crime rates (X) were at 79.7, free lunches (Y) would be 2,930.

The data suggests that the two variables have a linear relationship, but the study forgets to mention why people are receiving free lunches. People that receive free lunches do not have enough money to pay for them; they most likely have fallen below an income bracket that qualifies them for a free lunch. If a school is giving more free lunches, it means that it has more people that cannot pay for their food. One cannot automatically assume that receiving free lunches is causing people to break laws. It would be interesting to look at household income data and crime rates, because the number of free lunches is a backwards way of looking at income.


Part 2:

Introduction

The purpose of the assignment is to spatially analyze enrollment data from the UW system. Two UW schools, Eau Claire and Madison, were chose for the analysis. Although the reasons that an individual attends a given college are endless, this analysis is looking at overall trends based on population, household income, and number of bachelor degrees.

Methods

Data was obtained for Wisconsin Counties with information like number of bachelor degrees, population normalized by distance to the University, and household income. The data is opened in SPSS, and run through a linear regression. A regression analysis is a statistical tool to investigate the relationship between two variables. It seeks to predict the effect of on variable on another to investigate causation. The two variables are the independent and dependent variables. Independent is found on the x-axis and is what explains the independent variable. The independent variable is found on the vertical axis Y and is what is explained by the dependent variable.

For the analysis, both Eau Claire and Madison are the dependent variables. Each are run through 3 individual linear regression analysis: the number of bachelor degrees, population normalized by distance to the University, and household income. Significant linear relationships were run through again to save the standardized residuals. Residuals is the amount of deviation of each point from the line of best fit, it shows the difference between the actual and predicted value of y. The residuals were then opened in ArcMap and were mapped using natural breaks.


Results

Null hypothesis: There is no linear relationship between percent of bachelor degrees and Eau Claire enrollment.

Alternative hypothesis: There is a linear relationship between percent of bachelor degrees and Eau Claire enrollment.

Reject the null hypothesis with a significance of .003. The R Square however shows that there is a weak linear relationship. The standard error of the estimate is 209.611, which is very high. This means that their are outliers. The map shows that the biggest outlier is Eau Claire county. This could be explained because it is a regional University; many people that go to Eau Claire are from Eau Claire, so Eau Claire county is a large outlier.


Null hypothesis: There is no linear relationship between population by county and Eau Claire enrollment.

Alternative hypothesis: There is a linear relationship between population by county and Eau Claire enrollment.

Reject the null hypothesis with a significance level of .000. The R square value shows that their is a strong linear relationship. The map shows mostly flat residual values, with  couple counties that have more than expected.

Null hypothesis: There is no linear relationship between percent of bachelor degrees and Madison enrollment.

Alternative hypothesis: There is a linear relationship between percent of bachelor degrees and Madison enrollment.

Reject the null hypothesis with a significance level of .000. The R Square value of .363 shows that their is a weak linear relationship. The map shows Dane county as another very large outlier, because many people from the area attend the school.



Null hypothesis: There is no linear relationship between Household Income and Madison enrollment.

Alternative hypothesis: There is a linear relationship between household income and Madison enrollment.

Reject the null hypothesis with a significance value of .001. The R square value is very weak at only .154, and a high standad error of the estimate 810.123. The map shows Dane county to be a large postive outlier, and the counties around it negative outliers.This means that more people in Dane County attend Madison than predicted with their income levels.

Null hypothesis: There is no linear relationship between Population and Madison enrollment.

Alternative hypothesis: There is a linear relationship between population and Madison enrollment.

Reject the null hypothesis with a significance value of .000. The R square value shows a strong postive linear relationship with a .902. 

Conclusion:

The most significant variable for both Madison and Eau Claire enrollment was the population normalized by distance. All other variable were significant, but were weak linear relationships because they had many outliers. it is very interesting that the most significant variables are also the only normalized numbers. Proximity to the school plays a large role in who attends, because the UW system has created many regional Universities, they have become quite popular in their regional areas.


Friday, April 10, 2015

Assignment 4

Part 1



Null Hypothesis: Distance and sound level are not correlated.
Alternative Hypothesis: Distance and sound are correlated.

Fail to reject the alternative hypothesis because the significance is .000, which is less than .5 based on a 95% confidence level. 

Distance vs. Sound level has a strong negative correlation and the points are situated closely around the best fit line. 

Part 2
Some of the patterns I have noticed are with bachelors degrees. There is a negative correalation with percent black, percent no high school, and percent below poverty. There is also a positive correlation between bachelors degrees and percent white. The percent white is the only of these statistics to have a strong positive correlation. It is also the other way around, if there is a high percent of hispanic, black, or poverty, there is a negative correaltion with bachelors degrees. 

There is also a racial divide, no race has a positive correlation with another race. This means that

Part 3
Introduction

The Texas Election Commission is interested in doing an analysis comparing the 1980 and 2008 presidential elections by county. They have provided all of the the data necessary for both years including voter turn out, percent democratic vote, and percent Hispanic population.

The purpose of this study is to determine if there is a spatial auto-correlation with the data. If there is clustering, where does it occur and how does it relate?

Methods

A shapefile of Texas counties was obtained from the American FactFinder. Hispanic county data was also obtained and added to the Texas data sheet. This data sheet was joined to the Texas counties and exported as a shapefile. 

The data was then ready to be opened in Geoda to perform spatial autocorrealtion tests. A spatial autocorrelation is defined as the correlation between a variable with itself through space. First, a Moran's I test was performed on each data set.



Moran's I is a spatial auto-correlation test compares the value of the variable at any one location with the value at all other locations. Moran's I have 4 quadrants of comparisons.

Next, Local Indicators of Spatial Autocorrelation (LISA) maps were made for each variable. These maps provide a spatial component of spatial autocorrelation. It uses spatial weights to determine clustering on a visual map.

Results

For percent Hispanic, there are 2 cluster areas, low low and high high. The high high significance is shown to be close to the border of Mexico. The low low is clustered farthest away from the border. The Moran's I shows there is a strong correlation of .7787.

For the percent democratic vote in 2008, there is high high significance close to the border of Mexico. There is a low low significance on the northern border of the state. The Moran's I shows high correlation of .6957.

For the percent democratic vote in 1980, there are two high high significance clusters, one close to the southern border of Texas and Mexico and one to the east north border. There is low low significance cluster close to the northwestern border. The Moran's I shows a strong correlation of .5752.

The voter turnout in 1980 shows a low low significance cluster close to the southern border of Texas and Mexico. There is a high high significance in the northern side of the state, clustering around both Dallas and Austin TX. The Moran's I shows a significance of .3634.

The voter turnout in 2008 shows a low low significance cluster close to the southern border of Texas and Mexico. There a high high significance near the northern border and also clustered around where Austin, TX is. The Moran's I shows a significance of .4681.

Figure 1



















Conclusion

From the comparison of 1980 and 2008 elections, some interesting patterns have revealed them self through both LISA and Moran's I spatial auto-correlation tests. There is a high correlation of Hispanic populations near the southern border of Texas and Mexico. On this same border, there is a high correlation of low numbers of voter turnouts, and high number of democratic votes. This means that areas that have high Hispanic clustering also have low voter turnouts, and high democratic votes in relation to them self.

There also is shown to be low hispanic spatial auto-corelations in the northern part of the state, excluding the area of  Dallas TX. The area of Dallas TX Texas shows high voter turnout, but no significance of democratic vote. The northwestern part of the state shows no significance of Hispanic clustering, but high significance of voter turnout, and low democratic vote.

If the TEC is trying to increase voting in the state, the should focus on the southern border of Texas and Mexico. There is a trend of low voter turnout there, so it would be beneficial if someone was able to get these areas to vote, and the turnout is mostly democratic. This area also has clustering of Hispanic populations so that should be put into consideration also.



Monday, March 16, 2015

Quantitative Methods- Assignment 3

1.

2.

Asian Long Horned Beetles
Null hypothesis: the number of this invasive species in a Bucks county sample should not differ from the state of Pennsylvania averages.

Alternative Hypothesis: the number of this invasive in Bucks county is different from the state of Pennsylvania averages.

I reject the null hypothesis that there is no difference in this number of invasive species between Bucks County sample and the state of Pennsylvania averages. This is because Z-score of the given sample is -7.7519 which falls outside of the critical value of +/- 1.96.

Emerald Ash Borer Beetle 
Null hypothesis: the number of this invasive species in a Bucks county sample should not differ from the state of Pennsylvania averages.

Alternative Hypothesis: the number of this invasive in Bucks county is different from the state of Pennsylvania averages.

I reject the null hypothesis that there is no difference in this number of invasive species between Bucks County sample and the state of Pennsylvania averages. This is because Z-score of the given sample is 9.249 which falls outside of the critical value of +/- 1.96.

Golden Nematode
Null hypothesis: the number of this invasive species in a Bucks county sample should not differ from the state of Pennsylvania averages.

Alternative Hypothesis: the number of this invasive in Bucks county is different from the state of Pennsylvania averages.

I reject the null hypothesis that there is no difference in this number of invasive species between Bucks County sample and the state of Pennsylvania averages. This is because Z-score of the given sample is 2.47 which falls outside of the critical value of +/- 1.96.

In conclusion, all of these samples reject the null hypothesis.This means that something is happening in Bucks county that makes it less habitable for these invasive species.


3.
 Null hypothesis: The number of people per party has no difference in the intervening years.
Alternative hypothesis: The number of people per pasty has a difference in the intervening years.

t-score: 4.92

The corresponding probability value for the t-score is 1.711 for a one tailed test at 95% confidence level.



4.

Introduction

In this assignment I have been hired by the tourism board of Wisconsin to analyze the concept of "Up-North." Northern Wisconsin is home to many cabins and is where many go to vacation for the summer. Being able to understand aspects of tourism of Northern vs. Southern Wisconsin could lead to better marketing and planning for such activities.
   Fishing is the focus of this analysis. Fishing is an activity many people partake in, northern and southern Wisconsin may have a difference in who and how many people are fishing there.

Methods

The State of Wisconsin provided a broad data set (SCORP) where 3 different variables were to be chosen. The chosen variable are state fishery areas, non-residential fishing licenses, and residential licenses.\
 A shapefile of Wisconsin was obtained from the U.S. Census FactFinder. This shapefile was joined with the given dataset table. The 3 variables are broken down using natural breaks into 4 classes for statistical analayis and mapping.

These classes were added as another field and exported as a dBASE table for use with SPSS.

SPSS was used to run a chi-squared analysis of the 3 variable data against the northern vs. southern data. Chi-square tests whether or not observed values differ from expected values. All three variables were calculated at a 95% confidence level to determine significance.


Results

Tourism in northern Wisconsin (Figure 1) proves not to be a different than the South, except for resident fishing licenses. State Fishery Areas (Figure 2) show that there are a few hot spots for fishing around Wisconsin, but it is not limited to the north. This is further backed up with a chi-square test that fails to reject the null hypothesis that there is not a difference between the north and south. With a significance value of .192, it is greater than .05, there is not a significant difference between the northern and southern acres of Wisconsin State Fishery Area locations.

The number of non-resident fishing licenses (Figure 3) shows popularity in both northern and southern counties. While the north may look like it has a lot more non-residential fishing going on, it is not a significant amount. The result for the chi-square fails to reject the null hypothesis that there is not a difference between non-residential fishing licences in northern and southern Wisconsin. At a 95% confidence interval, the significance value is .144. This number is greater than .05, which supports that there is no difference.

The number of residential fishing licenses per county shows different story (Figure 4). The map shows looks as though there is more residential fishing going on in southern counties. This is supported by a chi square test that rejects the null hypothesis that there is no difference between northern and southern counties. Tested at a 95% confidence level, the significance value is just less than .05 at .049. This means that there is a significant difference between residential fishing licenses, and the map shows that there seems to be more in the south.


Figure 1
Figure 2
Figure 3

Figure 4


Conclusions

The tourism for fishing in Wisconsin does not differ between the north and south, only residential fishing does. This could be because there is not a significant difference between State fishery areas. If I could further investigate this, I would try and find if population per county and residential fishing licences correlates.

Thursday, February 26, 2015

Quantitative Methods- Assignment 2


Introduction

     In this assignment, I have been hired by an independent research consortium to study the geography of tornados in Kansas and Oklahoma. This is a topic of interest because tornados are very common in these states. If there is a spatial pattern to where the tornados land and how destructive they are in a given area, safety measures can begin to be implemented in places that need it most.
     This analysis compares two periods of time; 1995-2006 and 2007-2012. Some people argue that tornado patterns have not changed over the years, so places where they have always occurred should be required to build shelters. Others disagree, and say that not every place sees tornados, shelters are a waste of time and money. This project will be looking at if tornados change over time, if there are any reoccurring patterns of touchdowns and size of tornados across the states. This review will provide answers to whether or not storm shelters could be a necessary precaution to be implemented.

Methodology

      Two datasets were received of tornado locations and width for the years 1995-2006 and 2007-2012. A shapefile of the county level for a combined view of Kansas and Oklahoma. The first spatial statistical analysis tool used is the mean center. The mean center is the average spatial point of a given data set. This is calculated from the average of x and y values. A weighted mean center was also used, which is a mean center but take into occasion frequencies of grouped data. The mean center was found for both 1995-2006 and 2007-2008. The weighted mean center was also found for both data sets and was weighted by width of tornados. It is assumed in this study that the width of tornados makes it more destructive.
      The next spatial statistical tool used is standard distance. Standard distance is the spatial equivalent to the standard deviation. Standard distance measures the degree to which features are concentrated or dispersed around the points and expressed by as a radius or circle. It can only be calculated if there is a weighted mean center. Standard distance was found for 1995-2006 and 2007-2012 within 1 standard deviation, both weighted by tornado width.
     Lastly, the standard deviation of tornado occurrences by counties was found. The standard deviation shows how close to the mean a given dataset is. A high standard deviation shows that there is a lot more occurring in an area than the mean, and  a low standard deviation showing there is a lot less than the mean.

Results

The mean center and weighted mean center of 1995-2006 data show that the mean center is farther north than the weighted mean center (Figure 1). This means that there is a tendency for larger tornados in the more southerly locations. For 2007-2012, the weighted mean center also is more southerly and farther east than the mean center(Figure 2). This means that there were larger tornados in the south and east pulling the weighted mean center in that direction compared to just the tornadoes locations in the mean center. Comparing the years 1995-2006 and 2007-2012, both weighted mean centers are the farthest south (Figure 3). The mean center for 2007-2012 is also farther north than all of the weighted and non-weighted mean centers, meaning that there was more frequency of tornados farther north in 2007-2012, but they were not as big.



Figure 1
Figure 2
Figure 3

The standard distance for the two time periods, 1995-2006  (Figure 4) and 2007-2012 (Figure 5). These maps show 1 standard deviation around the mean center weighted by width of tornados. Comparing the two standard distances shows that in 2007-2012 (Figure 6),has a smaller radius than 1995-2007. This means that the 2007-2012 data is more concentrated around the weighted mean center than in 1995-2006. In 2007-2012, the width of tornados and their locations show two concentrations of tornados, one starting north and running through the weight mean center and one running through the south side of the standard distance. These concentrations are both near the weighted mean center, and there is not many tornados outside of the standard distance. In 1995-2006 there is a much higher number of tornados farther away from the mean center, it is much more spread across the states. The standard distance has to be bigger for 1995-2007 to account for the larger number of tornados occurring on the edges of the states.



Figure 4

Figure 5

Figure 6
The standard deviations of the year 2007-2012 was also found (Figure 7). This shows where each county falls within a normal distribution. This map shows where there are patterns by counties that more or less tornados.



Figure 7

    Statistics of the data were also calculated. The Z-scores based on the number of tornadoes per county for Russell County, KS is 4.88, for Caddo County, OK is 2.09, and Alfalfa County is .23. The average number of tornadoes per county is 4 and the standard deviation is 4.3. Russell County has a very high Z-score of 4.88 which means that it is 4.88 standard deviations away from the mean, that county has many more tornados compared to the mean. Afalfa County on the other hand, with a Z-score of .23, is close to an average amount of tornados because it is within 1 standard deviation.
    If the patterns hold true over the next five years in OK and KS, the z-score of tornados that will be exceeded 70% of the time is 1.764. The z-score of tornados that will exceed only 20% of the time is 7.612.

Conclusions

the weighted mean center for both time period shift to the south which means that width plays a role in tornados, it shows that more southern locations have larger tornados. There is a larger standard distance radius for 1995-2006 because there are more tornados spread on the outer edges of the state, where has in 2007-2012 tornados are more concentrated around the weighted mean center. The standard deviations of counties show that there are patterns of more occurrences and less occurrences of tornados by county. The z-scores show that there is a large difference between counties on the frequency of tornados, this shows that some counties would benefit more from shelters than other counties. Both time periods lean towards the south for larger tornados, so if shelters were to be put in, Oklahoma would benefit the most from shelters. Looking at the graduated symbols of tornados across both states, there are a large number of tornados happening almost everywhere, so for safety precautions I would suggest shelters are a necessity, especially around the weighted mean center.