Data Science Practice

Table of Contents

1.0 Advancement of Data Science

2.0 Spark in Data Science

3.0 Machine Learning Implementation

3.1 Dataset

3.2 Collaborative filtering

3.3 Logistic Regression

References

1.0 Advancement of Data Science

Data Science is the fastest emerging and developing domain in the present. Statistical Methods and number crunching is known since ages. The exponentially growing unstructured data around us is managed by data science that emerges as an solution to a lots of real life problems [Aalst, 2016].

Latest advancement includes the potentiality to change the way we interact our surroundings. One of the examples is understanding, generation and processing of Natural Language [Provost and Fawcett, 2013]. NLP algorithms are improving day by day and our digital ecosystem is becoming capable of understanding our needs in an unsupervised way. Few of the advance usage of NLP that is reforming our traditional approach are:

  • Personal Voice based assistants used in cell phones, cars cruise control and home appliances
  • IVR based routing of Customer calls by bots to specific team and pre-emptive topic modelling of previous calls and complaints by the same customer.
  • Skill based relevant categorization of Resume to assist recruiters.
  • Customer remarks sentiment analysis to help customer support team.
  • Use of Social Media analytics to determine willful defaulters.
  • Chat based assistance to know status of bookings, running status of trains etc.

It is a fact that though the demand of the data scientists is great, anyone who wants to explore and work in this field has to make continuous learning of various dimensions of this field.

In past two years, data analysis becomes a major part of business intelligence. It includes analysis of future growth of business, customer requirement analysis and many more. The process consists of the following steps:

  • Defining Objectives: A business study begins with the clear understanding of the objectives. Most of the decisions depend on the proper clarification of the objectives of the business. The well stated objective always helps the developers to develop the correct analytic tools.
  • Posing Questions: The posing of question is very important to make the dataset. The write question helps to generate more accurate training data.
  • Data Collection: The relevant data in accordance with the posed question should be collected from individual sources. The source may be data of any reputed laboratory, patient details from a hospital, sells related data from a business company, students data from academic institutions or any data regarding disease from different pathology. When survey is used to collect the data a questionnaire is need to be placed in different sources. The questions must be properly modelled for the statistical analysis being used.
  • Data Wrangling: Raw data can be composed in various formats. It is required to clean and convert the collected data, so that the data analysis algorithm can import it. For an example, DMV accident reports may be received as text files, relational database can be used for insurance claims and patient details of a hospital as an API. These different forms must be aggregated by data analyst and convert it into a suitable form for the analysis tool.
  • Data Analysis: In this step the aggregated and cleaned data is imported in the data analysis tool. These tools help to explore the data, find specific patterns within it and classify it properly. In this process statistical methods are used to take a particular decision based on the collected data.
  • Drawing Conclusions and Making Predictions: In this is the step conclusion may be drawn and accurate predictions can be made for the data after detailed analysis. These predictions and conclusions may then be reported in a summarized form to the end users.

2.0 Spark in Data Science

a. Apache Spark [Zaheria et al, 2016]

The creators of Spark founded the company, named Databricks, have summarized the functionality in their introduction to Apache Spark eBook. Apache Spark is the example of unified computing engine. It includes a set of libraries to process the parallel data for computer clusters. Spark is an open source engine for the mentioned task. It also makes the de facto tool for the data scientists and developers of Big Data. Programming languages like Java, R, and Python are supported by Spark.

b. Pysparks [Lovric et al, 2019]

The three main components for which Apache Spark considers as a leader is described below. This motivates to most of the big companies to adopt Apache Spark into their stack to work with large amounts of unstructured data.

1. Spark includes most of the functionalities to work with Big Data. Spark includes the tasks like data loading, learning from data, and streaming computation. It involves for real world data analytic tasks.

2. Spark enhances the core engine for computational efficiency. Spark is able to be used with an extensive variety of storage systems, which includes Amazon S3 and Azure Storage, Apache Hadoop, Apache Cassandra, and Apache Kafka.

3. Spark’s libraries provide a verity of functionalities. The standard libraries of Spark’s are the most widely used libraries for the open source project. Since it was first released, the Spark core engine has changed, whereas the libraries have been developed to provide various types of functionalities. Spark includes libraries for machine learning (MLlib), Spark SQL and graph analytics (GraphX).

3.0 Machine Learning Implementation

3.1 Dataset

Two datasets are used-

1. Titanic Dataset link - https://www.kaggle.com/c/titanic/data

In the machine learning model training set should be used. The outcome or “ground truth” is provided for each passenger in the training set. It is appropriate for the model which is based on features such as passengers’ class and gender. Feature engineering can also be used to create new features.

2. MovieLense Dataset Link-https://grouplens.org/datasets/movielens/1m/

This dataset contents of 1,000,209 numbers of anonymous ratings for 3,900 movies that are made by 6,040 MovieLens users. The file "ratings.csv" stores all the ratings.

For data analysis some of pyspark modules are used. The modules are-

from pyspark.sql import SparkSession

from pyspark.ml import Pipeline

from pyspark.sql.functions import mean,col,split, col, regexp_extract, when, lit

from pyspark.ml.feature import StringIndexer

from pyspark.ml.feature import VectorAssembler

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml.feature import QuantileDiscretizer

All algorithms have been called from above modules. These make the data analysis part easier.

3.2 Collaborative Filtering

The features of Collaborative filtering are-

1. It is a predictive process behind recommendation engines, which is used to analyse information regarding users with similar features.

2. It uses algorithms in order to filter data from the reviews of user to personalize recommendations for the users with same type of likings.

3. It is also used to advertise in social media for every individual.

The key parameters are-

Import RegressionEvaluator from pyspark.ml.evaluation

Import ALS from pyspark.ml.recommendation

Import Row from pyspark.sql

3.3 Logistic Regression

Logistic Regression Algorithm [Bartlett et al, 2020] is used for classification. The features of Logistic regression are-

  • Independent cases
  • The independent and dependent variables do not have any linear relationship.
  • It is not required to satisfy the homogeneity of variance
  • Errors are not generally distributed but they are independent.

The key parameters are-

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="Survived", featuresCol="features")

#Training algo

lrModel = lr.fit(trainingData)

lr_prediction = lrModel.transform(testData)

lr_prediction.select("prediction", "Survived", "features").show()

evaluator = MulticlassClassificationEvaluator(labelCol="Survived", predictionCol="prediction", metricName="accuracy")

References for Data Science Practice

Foster Provost, Tom Fawcett, “Data Science and its Relationship to Big Data and Data-Driven Decision Making”, in Big Data, vol. 1, no. 1, 2013.

Mario Lovric, Jose Manuel Molero, Roman Kern, “PySpark and RDKit: Moving towards Big Data in Cheminformatics”, in Molecular Informatics, Volume38, Issue6, 2019.

Matei Zaheria, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, “ Apache Spark: a unified engine for big data processing”, in Communications of the ACM, Volume 59, Issue 11, 2016.

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler, “Benign overfitting in linear regression”, in proceedings of the National Academy of Sciences of the United States of America, doi.org/10.1073/pnas.1907378117, 2020.

Wil van der Aalst, “Data Science in Action”, in Process Mining, pp. 3-23, ISBN 978-3-662-49851-4, 2016.

Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Computer Science Assignment Help

Get It Done! Today

Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
Upload your assignment
  • 1,212,718Orders

  • 4.9/5Rating

  • 5,063Experts

Highlights

  • 21 Step Quality Check
  • 2000+ Ph.D Experts
  • Live Expert Sessions
  • Dedicated App
  • Earn while you Learn with us
  • Confidentiality Agreement
  • Money Back Guarantee
  • Customer Feedback

Just Pay for your Assignment

  • Turnitin Report

    $10.00
  • Proofreading and Editing

    $9.00Per Page
  • Consultation with Expert

    $35.00Per Hour
  • Live Session 1-on-1

    $40.00Per 30 min.
  • Quality Check

    $25.00
  • Total

    Free
  • Let's Start

Browse across 1 Million Assignment Samples for Free

Explore MASS
Order Now

My Assignment Services- Whatsapp Tap to ChatGet instant assignment help

refresh