Pearson's correlation coefficient can be calculated using the formula:
r=covxysxsy = i=1n(xi-x)(yi- y)(N-1)sxsy Where sx is the standard deviation of the first variable and sy is the standard deviation of the second variable. This formula creates a standardization of the covariance between the two variables, resulting in a variable (r) between -1 and +1. A coefficient of +1 indicates a perfect positive correlation, such that the increase of one variable is associated with a proportionate increase in the other. On the other hand, a coefficient of -1 indicates a perfect negative correlation, such that an increase of one variable is associated with a proportionate decrease in the other variable. A correlation of 0 indicates no linear relationship such that an increase or decrease in one variable is associated with no change in the other variable.
There are five statistical assumptions of Pearson's correlation coefficient. They include:
- Random selection of the samples. As Corty (2007) observes, the results of a study may be limitedly generalized despite the fact that this assumption is robust to a violation.
- Both variables (independent variable and dependent variable) measurement levels involve either interval or ratio level data. However, this assumption is not robust to a violation (Corty 2007).
- Nominal distribution of both variables. This assumption is robust to a violation if the n is large enough 30 and there are no major problems with skewness or kurtosis (Corty, 2007).
- Homoscedasticity of variables. This assumption can be assessed by looking at a scatterplot to check the spread of a variable around another. This assumption is robust to a violation if the n is large enough 30. (Corty, 2007).
There is a linear relationship between the two variables. This assumption is usually evident in a scatterplot. Variables showing nonlinear relationship will be transformed accordingly. This assumption is not robust to a violation (Corty 2007).
Multiple linear regression analysis
The relationship between one dependent variable (continuous) and multiple independent variables (usually continuous but can be categorical) will be demonstrated by using multiple linear regression analysis. Polit & Beck (2012), describes these as predictor variables.
Y' = a + b1x1 + b2x2 + . . . + bkxk is the multiple linear regression's basic equations.
where Y' is the predicted value for dependent variable Y; a is the intercept, k is the number of predictor variables; b1 to bk is the regression coefficient; and x1 to xk are the values of predictor variables (Polit & Beck, 2012).
Assumptions for multiple linear regression analysis
Linearity, homoscedasticity, multicollinearity, normality, influential data points, and independence of residuals form the multiple regression's statistical assumptions of multiple regression. They also define the screening process. If these assumptions are not met, either Type I or Type II errors or to the overestimation or underestimation of the significance of the effect size might occur (Osborne & Waters, 2002). Besides assumption testing, outliers' preliminary data screening will be executed by the principal investigator. To compare the results, an outlier may or may not be used for analysis. Otherwise, exploratory descriptive studies present difficulties in interpreting outliers. The analysis will be done with and without the outlier for a comparison of the results. Otherwise, it is difficult to interpret outliers in an exploratory descriptive study. A lot of caution is required to remove them.
Normality
Either graphical or statistical model can be used in determining the normality assumption. This can be achieved through the use of skewness and kurtosis to cross-check the descriptive statistics. Skewness is related to the symmetry of distribution, while kurtosis is associated with a distribution's peakedness (Tabachnick & Fidell, 2007). For the normal distribution of variables, it is imperative that skewness and kurtosis be equal to zero (Corty, 2007). However, considering the normal probability plots and visual histograms inspection, there should be an approximately normal distribution. Additionally, data distribution should be linear without any significant indications and problems with skewness (a bow-shape) or kurtosis (sharply curved S-shape. Moreover, outliers should be closely monitored.
In addition to graphical assessment of normality, Kolmogorov-Smirnov (K-S) is also a statistical technique that can be used to test normality. Nonsignificant test result (p > 0.05), implies that indicates the acceptance of the null hypothesis and the normal distribution of data. Conversely, significant test result (p < 0.05), indicates a rejection of the null hypothesis and the data are not normally distributed. Nonetheless, since skewness and kurtosis testing are expected to be significant, the use of large sample size should be avoided (Field, 2013). Otherwise, in the presence of non-normal distribution, the transformation of IVs should be considered based on the degree of deviation from normal (Tabachnick & Fidell, 2007). Also, the cause of non-normality should be determined and appropriate corrective actions are taken to meet the assumption of normality.
Linearity
Linearity assumption assesses the linear relationship that exists between an independent and a dependent variable. This assumption can be clearly illustrated using a scatterplot of observed versus predicted values or a partial regression plot between one IV and one DV. In this study, a partial regression plot will be created and best-fit lines will be applied. A nonlinear relationship is indicated by a curvilinear pattern. If this assumption is violated, it could lead to an increased risk of Type I errors (overestimation) for other independent variables (Hair, Black, Babin, Anderson, & Tatham, 2006). On the other hand, a polynomial model or transformation is required to improve linearity for a nonlinearity.
Homoscedasticity (constant error of variance)
The assumption of homoscedasticity implies an equality in the variance of errors all the independent variables' levels (Osborne & Waters, 2002). This assumption can be established by considering a plot of the standardized residuals vis a vis predicted values. Normally, residuals are randomly distributed around zero (Osborne & Waters, 2002). An uneven distribution like a fan or a butterfly shape is indicated by heteroscedasticity. If this assumption is violated, the overall analysis and statistical power of the analysis may be weakened, and this can lead to an increase the risk of Type I errors. One way of solving this kind of problem is to fix it using the transformation of DV scores (Tabachnick & Fidell, 2007).
Independence of residual errors.
The assumption of independence of residual errors implies that the errors in the model are autonomous of one another (Field, 2013). To identify the existence of autocorrelation in the residuals, it is imperative to use the Dubrin-Watson test, whose result would range from 0-4. Here, when a score is less than 2, it indicates the independence of the residuals while a score greater than 2 shows a negative correlation; and a score less than 2 indicates a positive correlation (Field, 2013). A violation of the assumption of independence of errors will invalidate the confidence intervals and significance tests. As a result, there risk the risk of increasing the Type I error, which thereby increases the power of the results. Besides, the assumption of independence of residual errors will be met by the transformation or multilevel linear models.
Multicollinearity
Multicollinearity is occasioned by the presence of a strong correlation between two or more independent variables. Incidentally, multicollinearity can only be assessed by examining the correlation matrix of the IVs to detect the highly correlated variables (r = .9 and above). Moreover, the evaluation of the collinearity diagnostics table should be aimed at achieving three major components: the variance inflation factor (VIF), the tolerance, and the condition index. The work of the VIF is to assess if a predictor's relationship with the other predictors is strong and linear. A VIF that is greater than 10 is indicative of multicollinearity. Meanwhile, tolerance takes the measurement of the effectiveness of one independent variable on all other independent variables. A tolerance that is below 0.1 indicates a multicollinearity problem. On its part, the condition index denotes one variable's dependence on the others (Tabachnick & Fidell, 2007, p. 90). When a condition index is high, there is a likelihood that there is a high variance proportion. A multicollinearity is when the condition index is 30 or higher. Multicollinearity issues can be handled in different ways. These include removing the highly correlated variables, combining highly correlated independent variables, or performing an analysis of principal components (Tabachnick & Fidell, 2007, p. 91).
Influential data points
These are outliers that have significant effects on the gradient of the regression line. To find the points of influential data, it is imperative to use a Cook's distance. When the distance exceeds 1, the outlier is considered to be influential. When an influential outlier is, however, deleted, at least one of the regression coefficients will experience a substantial change. Therefore, to compare the results, it is essential to run the analysis with and without the influential data point (Tabachnick & Fidell, 2007, p. 75).
Logistic regression (Field, 2013)
The independent variables or the predictors are continuous while the dependent variable or the outcome is dichotomous or binary
Linearity between independent variables (predictors) and the logic of the outcome variable must be exited. This assumption is achieved with a non-significant interaction term between the predictor and its log transformation. Solutions include dummy coding the independent variable, or transformation
Responses of Independence of errors must be autonomous without duplication. A correlated outcome data produces similarly correlated errors, and this results in overdispersion. If the ratio of this statistic to its degrees of freedom is greater than 1 it implies that there is a presence of a chi-square goodness-of-fit statistic and overdispersion. When the dispersion parameter is used in rescaling the confidence intervals and standard errors, overdispersion effects can be significantly reduced.
- no multicollinearity
- lack of strongly influential outliers
Data Transformation
Data transformation is the process through which a mathematical function is applied to a variable's values. Data transformations provide excellent solutions for outliers or in normality, linearity, or homoscedasticity if there is a violation. Transformation types vary based on the direction and degree of skewness. When the data are negatively skewed, a reflected transformation is required before using an appropriate type of transformation. To reflect a variable, 1 is added to the largest values in the distribution to calculate the constant; then, subtracting each value from the constant creates a new variable. Thus, the negatively skewed variable will be transformed into positive. It is important to check the distribution after transformation. The most common types of data transformations are described below.
- Square root transformation is appropriate when data distribution differs moderately from normal and the data are counts.
- Log transformation is appropriate when data distribution substantially differs from normal and the variance increases with mean.
- Inverse transformation is appropriate when data distribu...
Cite this page
Multiple Linear Regression Analysis Paper Example. (2022, Jul 07). Retrieved from https://proessays.net/essays/multiple-linear-regression-analysis-paper-example
If you are the original author of this essay and no longer wish to have it published on the ProEssays website, please click below to request its removal:
- Course Work Sample: The Effects of Dissolved Oxygen on Fish Growth
- Paper Example on Intermolecular Force
- PEM Hydrogen Fuel Cells Research Paper Example
- Research Paper on Human Geography
- Essay Example on Centre of Mass & Centre of Gravity in Human Body
- Essay Example on Linus Pauling: Nobel Laureate Who Changed Lives
- Essay Sample on Galaxy Formation: Gravity, Matter, and Black Holes