Search
Services
- Subjects
- Nursing
- Law
- Management
- Economics
- Engineering
- IT & Computer Science
- Marketing
- Accounting
- Statistics
- Finance
- View All
- Samples
- Nursing
- Law
- Management
- Economics
- Engineering
- IT & Computer Science
- Marketing
- Accounting
- Statistics
- Finance
- View All
- Countries
- Australia
- USA
- UK
- Canada
- UAE
Study Material
Blog
Offers
Reviews 4.5/5
Support
- Help & Support
- Highlights
- Certified Experts
- About Us
- FAQs
- Our Policies
- Ask your Question
- Contact Us
- Request Callback
Order Now
My Account

Quick Searches

Prices are LOW don't be SLOW - Order Now

Allocating Data Analytics Tasks Assignment Sample

Subject Name : Statistics

Case Study: Tass Paper Mill

Introduction:

TPM is a subsidiary of the Pinnon Paper Industries which has a versatile history of paper manufacturing products. The firm has been catering services to the local and International Markets. The financial performance of the firm has been phenomenal for the past 2 decades. However, the firm seems to have seen a shift in their business scenario with the increasing reach of online interactions and social media services. The company is looking towards optimizing their management skills and upgrading their customer base conducting surveys to know their view points about the company’s performance on many factors. An in-depth analysis of these parameters will produce new sights of development and shifts of paradigms for the company which in turn will lead to the catching new markets and reach new heights in business.

Overview of Data variables:

The data file consists of variables that provide information on the brand status, distribution channel, Quality, Quantity produced, Product line along with customers perceptions on various other parameters. These parameters are rating scale questions and marked on a scale from 1 to 10. The variable contract with TPM is a binary variable taking values 1 and 0. This is the ultimate response variable the company is looking at. Here we would like to know if the number of 1 will eventually increase and bring in big business for the company.

Statistical Techniques used:

The lists of various statistical techniques used in the data analysis are as follows:

Descriptive statistics
Data visualization using graphs and other appropriate tools
Univariate analysis and checking for interaction effects
Regression analysis and trying to find the best relationship exhibiting variables.
Logistic regression
Time series analysis and fitting of best possible model for forecasting

Data Analysis

The order quantity denotes the production units measured in tonnes. On an average the supply to per company is around 7.7 tonnes. This parameter is estimated with a margin of error of 6.3%. From the list of 200 potential customers it can be observed that nearly 101 customers have signed a contract with TPM. However, there is count of 99 customers who do not seem to sign a real time contract with the organization.

The company has therefore initiated a Market researcher to model the necessary components that need immediate intervention so that in due course of time the firm will pick up business and perform phenomenally.

Correlations and Regressios:

The usage of correlation analysis or multivariate analysis depends on your data set and also the objective of the study. Correlation analysis is employed to quantify the degree to which two variables are related. Through the correlation analysis, you evaluate correlation that tells you ways much one variable changes when the opposite one does. Correlation analysis provides you with a linear relationship between two variables.

Regression analysis may be a related technique to assess the link between an outcome variable and one or more risk factors or confounding variables. the result variable is additionally called the response or variable quantity and also the risk factors and confounders are called the predictors, or independent variables. multivariate analysis is beneficial once you need to identify the impact of a unit change within the known variable (x) on the estimated variable (y).

Let us try to evaluate the amount relationship between the quantity ordered and the other variables. If there exists a significant relationship then we can try incorporating them into a model and evaluate a study based on the model.

Order_Qty

Quality

SM_Presence

Advert

Prdct_Line

0.433371527

0.235189092

0.237037785

0.462329463

Order_Qty	Brand_Image	Comp_Pricing	Order_Fulfillment	Flex_Price	Shipping_Speed	Shipping_Cost
Order_Qty	0.33800493	-0.217712219	0.314590653	-0.002839655	0.425081995	0.504412715

From this correlation matrix we observe that variables like Quality, Product_line, Brand_Image, Shipping_Speed and Shipping _Cost how a good amount of relationship with the Order_Qty.

Regression Model to estimate the Order Quantity:

Taking this further let us look forward to build a Model in which the Qrder _Qty response variable shows dependency on other independent variables like Quality, Product_line, Brand_Image, Shipping_Speed and Shipping _Cost.

However, there is a strong gut feeling that Quantity ordered is greatly influenced by Brand_Image and Quality of the product. In order to substantiate this state and provide an evidential proof for the data let us carry out a univariate analysis and check if these variables are directly related to each other or is there any combined effect of both these variables taken together.

A regression analysis with necessary explanatory variables and Qrder_Qty as the response variable was carried out.

Summary Output

Regression Statistics
Multiple R	0.692495298
R Square	0.479549738
Adjusted R Square	0.452012687
Standard Error	0.661225775
Observations	200

ANOVA
	df	SS	MS	F	Significance F
Regression	10	76.14050963	7.614050963	17.41471	3.22025E-22
Residual	189	82.63449037	0.437219526
Total	199	158.775

	Coefficients	Standard Error	t Stat	P-value
Intercept	2.618318138	0.873861782	2.996261185	0.0031
Quality	0.26763316	0.043862847	6.101591176	0.00
SM_Presence	-0.153049874	0.100559372	-1.521985179	0.129684
Advert	-0.029137289	0.054738262	-0.532302049	0.595142
Prdct_Line	0.234070222	0.216578954	1.080761623	0.28118
Brand_Image	0.344528252	0.079344709	4.342170454	0.00
Comp_Pricing	-0.042500107	0.037690334	-1.127612902	0.260913
Order_Fulfillment	-0.148402699	0.083245614	-1.782708915	0.076239
Flex_Price	0.252908454	0.224441457	1.126834843	0.261241
Shipping_Speed	-0.281426833	0.430113409	-0.654308439	0.513709
Shipping_Cost	0.246930459	0.076137882	3.243201067	0.001397

This table implies the variables Quality, Brand_image and Shipping cost are significant. None other variables are significantly contributing of influencing to the Order_Qty.

Hence let us take Order_Qty as the response variable being significantly influenced by Quality, Brand_image as our basic Model under Consideration.

Hence the Basic Regression Model equation looks like:

Order_Qty (Y) = β0 +β₁*(Quality)+ β₂*( Brand_ Image) +ε

= 2.618+0.267*Quality+0.345*Brand_image +ε

Under this model we are assuming that the variables Quality and Brand_Image as independent variables. However there is a possibility that these ae related and their cumulative effect or interaction would have a positive effect as well. We are therefore going to construct a new Model where we will be using the 3^rd factor as the interaction between these 2 explanatory variables.

A univariate analysis taking this interaction into pictures reveals the following details.

Summary Output

Regression Statistics
Multiple R	0.59575666
R Square	0.354925998
Adjusted R Square	0.345052416
Standard Error	0.722882639
Observations	200

ANOVA
	df	SS	MS	F	Significance F
Regression	3	56.35337527	18.78446	35.94704	1.48068E-18
Residual	196	102.4216247	0.522559
Total	199	158.775

	Coefficients	Standard Error	t Stat	P-value
Intercept	0.501086903	1.536769758	0.326065	0.744723
Quality	0.691114214	0.187122412	3.69338	0.000287
Brand_Image	0.864327126	0.269458518	3.207644	0.001563
Quality *Brand_Image	-0.068555382	0.032933539	-2.08163	0.038675

Model 1:

Order_Qty (Y) = β0 +β₁*(Quality)+ β₂*( Brand_ Image) +β₃(Quality *Brand_Image) +ε

=0.5010 + 0.691114*( Quality )+ 0.86432* (Brand_ Image ) -0.068555*(Quality *Brand_Image) + ε

We observe that p value across the interaction variable Quality*brand_image has p_value <0.05. Hence we can say that the interaction is significant.

Model 2:

Another model with Quantity being estimated using Quality, Advertisement and the interaction between these variables was done. The estimates indicated the advertisement and the interaction between quality and advertisement did not contribute significantly. The p-value in the anova summary indicated values greater than 0.05 indicating insignificant variables altogether.

Hence the Basic Regression Model equation looks like:

Order_Qty (Y) = β0 +β₁*(Quality)+ β₂*( Advert) + β₃(Quality*Advert)+ ε

= 2.2466797+0.576640373*Quality+0.7476544*Advert + -0.0678712*( Quality*Advert ) +ε

	Coefficients	Standard Error	t Stat	P-value
Intercept	2.246679738	1.240870879	1.810567	0.071739
Quality	0.576640373	0.152258512	3.787246	0.000203
Advert	0.747654494	0.28229155	2.648519	0.008743
Quality*Advert	-0.067871229	0.034697097	-1.95611	0.051873

Model 3:

Another model with Quantity being estimated using Quality, Product_line and the interaction between these variables was done. The estimates indicated the Quality and the interaction between quality and Product_line did not contribute significantly. The p-value in the anova summary indicated values greater than 0.05 indicating insignificant variables altogether.

	Coefficients	Standard Error	t Stat	P-value
Intercept	4.019192703	1.754678883	2.290557	0.023053
Prdct_Line*	0.39784377	0.309059317	1.287273	0.199517
Quality	0.300369716	0.225546863	1.33174	0.0184492
Quality*Prdct_Line	-0.022183249	0.03833725	-0.57863	0.5635

Hence the Basic Regression Model equation looks like:

Order_Qty (Y) = β0 +β₁*(Quality)+ β₂*(Prdct_line) + β₃(Quality* Prdct_line)+ ε

=4.019192+ 0.3003697*(Quality)+ 0.397843*(Prdct_line) -0.022183(Quality* Prdct_line)+ ε

Hence the model can be modified as:

Hence the revised final Regression Model equation looks like:

Order_Qty (Y) = β0 +β₁*(Quality)+ β₂*( Brand_ Image) + B₃*Quality *Brand_Image +ε

=0.5011+0.69*Quality+0.846*Brand_image +0.069* Quality*Brand_Image+ ε

Logistic Regression:

Logistic regression is the acceptable multivariate analysis to conduct when the variable is dichotomous (binary). Like all regression analyses, the logistic regression could be a predictive analysis. Logistic regression is employed to explain data and to clarify the link between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

We have conducted Logistic regression analysis where in the first model we have taken Quality, Product line, Brand image and Flexible price as the explanatory variables. We observe the p-value for the brand image to be insignificant and hence term the variable as redundant in the Model.

Logistic Model:

Probablity of event: Model equation

t = -8.9743 + 0.3186 Quality + 0.2915 Prdct_Line + 0.1812 Brand_Image + 0.2274 Flex_Price

	Coeff	SE	z-stat	lower z_0.025	upper z_0.975	p-value
b₀	-8.9743	1.2686	-7.0741	-11.4607	-6.4878	0.00
Quality	0.3186	0.1003	3.1754	0.1219	0.5153	0.0015
Prdct_Line	0.2915	0.1032	2.8248	0.08925	0.4937	0.00473
Brand_Image	0.1812	0.09571	1.8929	-0.006422	0.3687	0.05838
Flex_Price	0.2274	0.1107	2.0539	0.01039	0.4443	0.03999

b₀=-8.9743 this implies the probability of the contract not being signed with TPM organization is very low and is approximately equal to exp(-8.9743) =0.00012 in accordance to the other factors what have not been a part of this model. The probability of the firm being selected because of its quality accounts to a mean value exp(0.3186)=1.375 times of being selected. Similarly we can see that the brand image and flexible price have positive values and are contributing their share for the fir being selected among the target individuals. The values of every variable in the model have their coefficients as positive hence they correspond to higher probabilities that the even of the the company being selected by its potential customers is high enough.

When sample sizes are large, it is much easier to accept (or at the least more difficult to reject) greater complex models because the chi-rectangular test statistics are designed to stumble on any departure between a version and observed data. That is, adding extra phrases to a version will always enhance the match, but with a big sample it turns into harder to distinguish a “real” improvement in fit from a substantively trivial one. • Likelihood-ratio tests therefore frequently result in the rejection of applicable models, and models grow to be much less parsimonious than they want to be.

Higher the R²value greater the fit of the data. From the Mc MFadden Pseudo R²value. Avalue close to 1 indicaes the best possible fit . In our case the value happens to be between 0.55 to 0.60. Hence it is a moderately fit model. As the number of explanatory values in logistic regression goes on increasing the R²value goes on decreasing.

Later the model was modified as Brand_image variable was insignificant.

Probabiity of the event : Model equation

t = -8.4575 + 0.3218 Quality + 0.3254 Prdct_Line + 0.2795 Flex_Price

this is the value of the test statistic: McFadden R square (R²_McF) equals 0.04739. A value close to 0.04 indicates a great fit. It means the probabilities of TPM being selected or rejectedworks like the real time scenario.

Hence this is the final logistic Model recommended to be used for prediction analysis.

Summary table:
	Coeff	SE	z-stat	lower z_0.025	upper z_0.975	p-value
b₀	-8.4575	1.2442	-6.7977	-10.896	-6.019	0.00
Quality	0.3218	0.1001	3.2129	0.1255	0.518	0.00131
Prdct_Line	0.3254	0.1003	3.2442	0.1288	0.5219	0.00118
Flex_Price	0.2795	0.1077	2.5948	0.06838	0.4906	0.00946

Time series analysis and forecasting :

A statistic could be a sequence of numerical data points in successive order. In investing, a statistic tracks the movement of the chosen data points, like a security’s price, over a specified period of your time with data points recorded at regular intervals. there's no minimum or maximum amount of your time that has got to be included, allowing the information to be gathered in a very way that gives the data being sought by the investor or analyst examining the activity.

Values taken by a variable over time (such as daily sales revenue, weekly orders, monthly overheads, yearly income) and tabulated or plotted as chronologically ordered numbers or data points. To yield valid statistical inferences, these values must be repeatedly measured, often over a four to 5 year period. statistic include four components: (1) seasonal differences that repeat over a particular period like each day, week, month, season, etc., (2) Trend variations that move up or down during a reasonably predictable pattern, (3) Cyclical variations that correspond with business or economic 'boom-bust' cycles or follow their own peculiar cycles, and (4) Random variations that don't be any of the above three classifications.

In our dataset we are having the quarterly data of for the years starting from 2010 to 2019. We are looking forward to create a time series modelling for the data. We shall use different varieties of Time series and pick the one which is optimum and produces reliable forecasts as well.

Choosing the right simple regression model may be difficult. Trying to model it with only a sample doesn’t make it any easier. During this post, we'll review some common statistical methods for choosing models, complications you'll face, and supply some practical advice for selecting the most effective regression model.

It starts when a researcher wants to mathematically describe the connection between some predictors and therefore the response variable. The research team tasked to research typically measures many variables but includes just some of them within the model. The analysts attempt to eliminate the variables that don't seem to be related and include only those with a real relationship. Along the way, the analysts consider many possible models.

Assumptions involved in a Time series analysis:

The linear trend analysis forecasts assuming linear relationship between the dependent and independent variable. Here it is assumed that the relationship hasn’t changed since a long period of time. The polynomial regression is used when the data seems to have huge fluctuations in the dataset. The exponential model is used when there is a huge emphasis laid on the time variable. The slope of this model indicates the rate at which the changes have been recorded between the variables. To choose the esteem of the smoothing constant(s) equitably, you hunt for values that are best in a few sense. Our program looks for that values that minimizes the measure of the combined estimate errors of the as of now available series. Three strategies of summarizing the sum of mistake within the figures are accessible: the cruel square error (MSE), the cruel supreme mistake (MAE), and the cruel outright percent mistake (MAPE)

Let us have a look at the output of the various types of time series

	y = 16.391x + 1111.4
Year -2020	R² = 0.5398
Quarters	Linear Trending
Q1	1767.04
Q2	1783.431
Q3	1799.822
Q4	1816.213

	y = -0.2334x2 + 25.727x + 1047.6
Year -2020	R² = 0.5509
Quarters	Polynomial Trending
Q1	2450.12
Q2	2494.7524
Q3	2539.8516
Q4	2585.4176

Exponential Triple Smoothing

1689.482637

1705.578948

1721.675259

1737.77157

Moving Average

1699.40

1714.72

1656.55

1688.71

Looking at the R²values we can say that the exponential smoothing would be ideal as with any more increase in the number of variables the adj(R²) does not change significantly. Hence the forecasts made by this model can be treated as best possible predictions and we shall accordingly proceed with the forecasts generated.

Conclusion:

The variables have been explored to the best of knowledge. Complete thorough checks in terms of relationships between the variables along with their influencing factors have been checked. The variables who is possibly contributing to interaction has also been evaluated and accordingly substantiated using appropriate models and analysis. Further a logistic regression model has been obtained for the target variable No of contract obtained by the organization which is categorical in nature. A best possible logistic Model was fit after checking the basic model and eliminating the redundant variable in the Logistic Model. The second logistic Model obtained has been best fit as the prediction by the model has to quite an extent satisfied the realistic data.

Later on the company quarterly data has been accessed and processed using time series forecasting. 3 types of forecasting has been done in order to find the best fit.The triple exponential smoothing has been concluded as the best possible fit for the data as the adj R²has been reduced and kept at bay by this model only. In all other models the R²and adj R²both have been phenomenally increased thus making it an unfit for a selection choice.

Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Statistics Assignment Help