• Internal Code :
  • Subject Code :
  • University :
  • Subject Name : Statistics

Case Study: Tass Paper Mill

Introduction:

TPM is a subsidiary of the Pinnon Paper Industries which has a versatile history of paper manufacturing products. The firm has been catering services to the local and International Markets. The financial performance of the firm has been phenomenal for the past 2 decades. However, the firm seems to have seen a shift in their business scenario with the increasing reach of online interactions and social media services.  The company is looking towards optimizing their management skills and upgrading their customer base conducting surveys to know their view points about the company’s performance on many factors.  An in-depth analysis of these parameters will produce new sights of development and shifts of paradigms for the company which in turn will lead to the catching new markets and reach new heights in business.

Overview of Data variables:

The data file consists of variables that provide information on the brand status, distribution channel, Quality, Quantity produced, Product line along with customers perceptions on various other parameters.  These parameters are rating scale questions and marked on a scale from 1 to 10.  The variable contract with TPM is a binary variable taking values 1 and 0.  This is the ultimate response variable the company is looking at.  Here we would like to know if the number of 1 will eventually increase and bring in big business for the company.

Statistical Techniques used:

The lists of various statistical techniques used in the data analysis are as follows:

  • Descriptive statistics

  • Data visualization using graphs and other appropriate tools

  • Univariate analysis and checking for interaction effects

  • Regression analysis and trying to find the best relationship exhibiting variables.

  • Logistic regression

  • Time series analysis and fitting of best possible model for forecasting

Data Analysis

The order quantity denotes the production units measured in tonnes.  On an average the supply to per company is around 7.7 tonnes.  This parameter is estimated with a margin of error of 6.3%. From the list of 200 potential customers it can be observed that nearly 101 customers have signed a contract with TPM.  However, there is count of 99 customers who do not seem to sign a real time contract with the organization. 

The company has therefore initiated a Market researcher to model the necessary components that need immediate intervention so that in due course of time the firm will pick up business and perform phenomenally.

Correlations and Regressios:

The usage of correlation analysis or multivariate analysis depends on your data set and also the objective of the study. Correlation analysis is employed to quantify the degree to which two variables are related. Through the correlation analysis, you evaluate correlation that tells you ways much one variable changes when the opposite one does. Correlation analysis provides you with a linear relationship between two variables.

Regression analysis may be a related technique to assess the link between an outcome variable and one or more risk factors or confounding variables. the result variable is additionally called the response or variable quantity and also the risk factors and confounders are called the predictors, or independent variables. multivariate analysis is beneficial once you need to identify the impact of a unit change within the known variable (x) on the estimated variable (y).

Let us try to evaluate the amount relationship between the quantity ordered and the other variables.  If there exists a significant relationship then we can try incorporating them into a  model and evaluate a study  based on the model.

 

Order_Qty

Quality

SM_Presence

Advert

Prdct_Line

0.433371527

0.235189092

0.237037785

0.462329463

 

Order_Qty

Brand_Image

Comp_Pricing

Order_Fulfillment

Flex_Price

Shipping_Speed

Shipping_Cost

0.33800493

-0.217712219

0.314590653

-0.002839655

0.425081995

0.504412715

From this correlation matrix we observe that variables like Quality, Product_line, Brand_Image, Shipping_Speed and Shipping _Cost how a good amount of relationship with the Order_Qty.

Regression Model to estimate the Order Quantity:

Taking this further let us look forward to build a Model in which the Qrder _Qty response variable shows dependency on other independent variables like Quality, Product_line, Brand_Image, Shipping_Speed and Shipping _Cost. 

However, there is a strong gut feeling that Quantity ordered is greatly influenced by Brand_Image and Quality of the product. In order to substantiate this state and provide an evidential proof for the data let us carry out  a univariate analysis and check if these variables are directly related to each other or is there any combined effect of both these variables taken together.

A regression analysis with necessary explanatory variables and Qrder_Qty as the response variable was carried out.

Summary Output

Regression Statistics

 

Multiple R

0.692495298

R Square

0.479549738

Adjusted R Square

0.452012687

Standard Error

0.661225775

Observations

200

 

ANOVA

 

 

 

 

 

 

df

SS

MS

F

Significance F

Regression

10

76.14050963

7.614050963

17.41471

3.22025E-22

Residual

189

82.63449037

0.437219526

 

 

Total

199

158.775

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

Intercept

2.618318138

0.873861782

2.996261185

0.0031

Quality

0.26763316

0.043862847

6.101591176

0.00

SM_Presence

-0.153049874

0.100559372

-1.521985179

0.129684

Advert

-0.029137289

0.054738262

-0.532302049

0.595142

Prdct_Line

0.234070222

0.216578954

1.080761623

0.28118

Brand_Image

0.344528252

0.079344709

4.342170454

0.00

Comp_Pricing

-0.042500107

0.037690334

-1.127612902

0.260913

Order_Fulfillment

-0.148402699

0.083245614

-1.782708915

0.076239

Flex_Price

0.252908454

0.224441457

1.126834843

0.261241

Shipping_Speed

-0.281426833

0.430113409

-0.654308439

0.513709

Shipping_Cost

0.246930459

0.076137882

3.243201067

0.001397

 

This table implies the variables Quality, Brand_image and Shipping cost are significant.  None other variables are significantly contributing of influencing to the Order_Qty.

Hence let us take Order_Qty as the response variable being significantly influenced by Quality, Brand_image as our basic Model under Consideration.

Hence the Basic Regression Model equation looks like:

Order_Qty (Y) = β0 +β1*(Quality)+ β2*( Brand_ Image) +ε

                        = 2.618+0.267*Quality+0.345*Brand_image +ε

Under this model we are assuming that the variables Quality and Brand_Image as independent variables.  However there is a possibility that these ae related and their cumulative effect or interaction would have a positive effect as well.  We are therefore going to construct a new Model where we will be using the 3rd factor as the interaction between these 2 explanatory variables.

A univariate analysis taking this interaction into pictures reveals the following details.

Summary Output

Regression Statistics

 

Multiple R

0.59575666

R Square

0.354925998

Adjusted R Square

0.345052416

Standard Error

0.722882639

Observations

200

 

ANOVA

 

 

 

 

 

 

df

SS

MS

F

Significance F

Regression

3

56.35337527

18.78446

35.94704

1.48068E-18

Residual

196

102.4216247

0.522559

 

 

Total

199

158.775

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

Intercept

0.501086903

1.536769758

0.326065

0.744723

Quality

0.691114214

0.187122412

3.69338

0.000287

Brand_Image

0.864327126

0.269458518

3.207644

0.001563

Quality *Brand_Image

-0.068555382

0.032933539

-2.08163

0.038675

 

Model 1:

Order_Qty (Y) = β0 +β1*(Quality)+ β2*( Brand_ Image) +β3(Quality *Brand_Image) +ε

=0.5010 + 0.691114*( Quality )+  0.86432* (Brand_ Image ) -0.068555*(Quality *Brand_Image) + ε

We observe that p value across the interaction variable Quality*brand_image has p_value <0.05.  Hence we can say that the interaction is significant.

Model 2:

Another model with Quantity being estimated using Quality, Advertisement and the interaction between these variables was done.  The estimates indicated the advertisement and the interaction between quality and advertisement did not contribute significantly. The p-value in the anova summary indicated values greater than 0.05 indicating insignificant variables altogether.

Hence the Basic Regression Model equation looks like:

Order_Qty (Y) = β0 +β1*(Quality)+ β2*( Advert) + β3(Quality*Advert)+ ε

= 2.2466797+0.576640373*Quality+0.7476544*Advert + -0.0678712*( Quality*Advert ) +ε

 

Coefficients

Standard Error

t Stat

P-value

Intercept

2.246679738

1.240870879

1.810567

0.071739

Quality

0.576640373

0.152258512

3.787246

0.000203

Advert

0.747654494

0.28229155

2.648519

0.008743

Quality*Advert

-0.067871229

0.034697097

-1.95611

0.051873


Model 3:

Another model with Quantity being estimated using Quality, Product_line and the interaction between these variables was done.  The estimates indicated the Quality and the interaction between quality and Product_line did not contribute significantly. The p-value in the anova summary indicated values greater than 0.05 indicating insignificant variables altogether.

 

Coefficients

Standard Error

t Stat

P-value

Intercept

4.019192703

1.754678883

2.290557

0.023053

Prdct_Line*

0.39784377

0.309059317

1.287273

0.199517

Quality

0.300369716

0.225546863

1.33174

0.0184492

Quality*Prdct_Line

-0.022183249

0.03833725

-0.57863

0.5635

Hence the Basic Regression Model equation looks like:

Order_Qty (Y) = β0 +β1*(Quality)+ β2*(Prdct_line) + β3(Quality* Prdct_line)+ ε

=4.019192+ 0.3003697*(Quality)+ 0.397843*(Prdct_line) -0.022183(Quality* Prdct_line)+ ε

Hence the model can be modified as:

Hence the revised final Regression Model equation looks like:

Order_Qty (Y) = β0 +β1*(Quality)+ β2*( Brand_ Image) + B3*Quality *Brand_Image +ε

                        =0.5011+0.69*Quality+0.846*Brand_image +0.069* Quality*Brand_Image+ ε

Logistic Regression:

Logistic regression is the acceptable multivariate analysis to conduct when the variable is dichotomous (binary). Like all regression analyses, the logistic regression could be a predictive analysis. Logistic regression is employed to explain data and to clarify the link between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

We have conducted Logistic regression analysis where in the first  model we have taken Quality, Product line, Brand image and Flexible price as the explanatory variables. We observe the p-value for the brand image to be insignificant and hence term the variable as redundant in the Model.

Logistic Model:

Probablity of event: Model equation

t = -8.9743 + 0.3186 Quality + 0.2915 Prdct_Line + 0.1812 Brand_Image + 0.2274 Flex_Price

 

Coeff

SE

z-stat

lower z0.025

upper z0.975

p-value

b0

-8.9743

1.2686

-7.0741

-11.4607

-6.4878

0.00

Quality

0.3186

0.1003

3.1754

0.1219

0.5153

0.0015

Prdct_Line

0.2915

0.1032

2.8248

0.08925

0.4937

0.00473

Brand_Image

0.1812

0.09571

1.8929

-0.006422

0.3687

0.05838

Flex_Price

0.2274

0.1107

2.0539

0.01039

0.4443

0.03999


b0 =-8.9743  this implies the probability of the contract not being signed with TPM organization is very low and is approximately equal to exp(-8.9743) =0.00012 in accordance to the other factors what have not been a part of this model. The probability of the firm being selected because of its quality accounts to a mean value exp(0.3186)=1.375 times of being selected. Similarly we can see that the brand image and flexible price have positive values and are contributing their share for the fir being selected among the target individuals.  The values of every variable in the model have their coefficients as positive hence they correspond to higher probabilities that the even of the the company being selected by its potential customers is high enough.

When sample sizes are large, it is much easier to accept (or at the least more difficult to reject) greater complex models because the chi-rectangular test statistics are designed to stumble on any departure between a version and observed data. That is, adding extra phrases to a version will always enhance the match, but with a big sample it turns into harder to distinguish a “real” improvement in fit from a substantively trivial one. • Likelihood-ratio tests therefore frequently result in the rejection of applicable models, and models grow to be much less parsimonious than they want to be.

Higher the R2 value greater the fit of the data.  From the Mc MFadden Pseudo R2 value.  Avalue close to 1 indicaes the best possible fit .  In our case the value happens to be between 0.55 to 0.60.  Hence it is a moderately fit model.  As the number of explanatory values in logistic regression goes on increasing the R2 value goes on decreasing. 

Later the model was modified as Brand_image variable was insignificant.

Probabiity of the event : Model equation

t = -8.4575 + 0.3218 Quality + 0.3254 Prdct_Line + 0.2795 Flex_Price

this is the value of the test statistic: McFadden R square (R2McF) equals 0.04739.  A value close to 0.04 indicates a great fit.  It means the probabilities of TPM being selected or rejectedworks like the real time scenario.

Hence this is the final logistic Model recommended to be used for prediction analysis.

Summary table:

           
 

Coeff

SE

z-stat

lower z0.025

upper z0.975

p-value

b0

-8.4575

1.2442

-6.7977

-10.896

-6.019

0.00

Quality

0.3218

0.1001

3.2129

0.1255

0.518

0.00131

Prdct_Line

0.3254

0.1003

3.2442

0.1288

0.5219

0.00118

Flex_Price

0.2795

0.1077

2.5948

0.06838

0.4906

0.00946

Time series analysis and forecasting :

A statistic could be a sequence of numerical data points in successive order. In investing, a statistic tracks the movement of the chosen data points, like a security’s price, over a specified period of your time with data points recorded at regular intervals. there's no minimum or maximum amount of your time that has got to be included, allowing the information to be gathered in a very way that gives the data being sought by the investor or analyst examining the activity.

Values taken by a variable over time (such as daily sales revenue, weekly orders, monthly overheads, yearly income) and tabulated or plotted as chronologically ordered numbers or data points. To yield valid statistical inferences, these values must be repeatedly measured, often over a four to 5 year period. statistic include four components: (1) seasonal differences that repeat over a particular period like each day, week, month, season, etc., (2) Trend variations that move up or down during a reasonably predictable pattern, (3) Cyclical variations that correspond with business or economic 'boom-bust' cycles or follow their own peculiar cycles, and (4) Random variations that don't be any of the above three classifications.

In our dataset we are having the quarterly data of for the years starting from 2010 to 2019. We are looking forward to create a time series modelling for the data.  We shall use different varieties of Time series and pick the one which is optimum and produces reliable forecasts as well.

Choosing the right simple regression model may be difficult. Trying to model it with only a sample doesn’t make it any easier. During this post, we'll review some common statistical methods for choosing models, complications you'll face, and supply some practical advice for selecting the most effective regression model.

It starts when a researcher wants to mathematically describe the connection between some predictors and therefore the response variable. The research team tasked to research typically measures many variables but includes just some of them within the model. The analysts attempt to eliminate the variables that don't seem to be related and include only those with a real relationship. Along the way, the analysts consider many possible models.

Assumptions involved in a Time series analysis:

The linear trend analysis forecasts assuming linear relationship between the dependent and independent variable.  Here it is assumed that the relationship hasn’t changed since a long period of time.  The polynomial regression is used when the data seems to have huge fluctuations in the dataset. The exponential model is used when there is a huge emphasis laid on the time variable. The slope of this model indicates the rate at which the changes have been recorded  between the variables. To choose the esteem of the smoothing constant(s) equitably, you hunt for values that are best in a few sense. Our program looks for that values that minimizes the measure of the combined estimate errors of the as of now available series. Three strategies of summarizing the sum of mistake within the figures are accessible: the cruel square error (MSE), the cruel supreme mistake (MAE), and the cruel outright percent mistake (MAPE)

Let us have a look at the output of the various types of time series

 

y = 16.391x + 1111.4

Year -2020

R² = 0.5398

Quarters

Linear Trending

Q1

1767.04

Q2

1783.431

Q3

1799.822

Q4

1816.213

 

 

y = -0.2334x2 + 25.727x + 1047.6

Year -2020

R² = 0.5509

Quarters

Polynomial Trending

Q1

2450.12

Q2

2494.7524

Q3

2539.8516

Q4

2585.4176

 

Exponential Triple Smoothing

    1689.482637

1705.578948

1721.675259

1737.77157

Moving Average

1699.40

1714.72

1656.55

1688.71

Looking at the R2 values we can say that the exponential smoothing would be ideal  as with any more increase in the number of variables the adj(R2) does not change significantly. Hence the forecasts made by this model can be treated as best possible predictions and we shall accordingly proceed with the forecasts generated.

Conclusion: 

The variables have been explored to the best of knowledge.  Complete thorough checks in terms of relationships between the variables along with their influencing factors have been checked.  The variables who is possibly contributing to interaction has also been evaluated and accordingly substantiated using appropriate models and analysis. Further a logistic regression model has been obtained for the target variable No of contract obtained by the organization which is categorical in nature. A best possible logistic Model was fit after checking the basic model and eliminating the redundant variable in the Logistic Model. The second logistic Model obtained has been best fit as the prediction by the model has to quite an extent satisfied the realistic data. 

Later on the company quarterly data has been accessed and processed using time series forecasting.   3 types of forecasting has been done in order to find the best fit.The triple exponential smoothing has been concluded as the best possible fit for the data as the adj R2 has been reduced and kept at bay by this model only. In all other models the R2 and adj R2 both have been phenomenally increased thus making it an unfit for a selection choice.

Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Statistics Assignment Help

Get It Done! Today

Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
Not Specific >5000
  • 1,212,718Orders

  • 4.9/5Rating

  • 5,063Experts

Highlights

  • 21 Step Quality Check
  • 2000+ Ph.D Experts
  • Live Expert Sessions
  • Dedicated App
  • Earn while you Learn with us
  • Confidentiality Agreement
  • Money Back Guarantee
  • Customer Feedback

Just Pay for your Assignment

  • Turnitin Report

    $10.00
  • Proofreading and Editing

    $9.00Per Page
  • Consultation with Expert

    $35.00Per Hour
  • Live Session 1-on-1

    $40.00Per 30 min.
  • Quality Check

    $25.00
  • Total

    Free
  • Let's Start

Get
500 Words Free
on your assignment today

Browse across 1 Million Assignment Samples for Free

Explore MASS
Order Now

Request Callback

Tap to ChatGet instant assignment help

Get 500 Words FREE
Ask your Question
Need Assistance on your
existing assignment order?