Table of Contents
1.0 Advancement of Data Science
2.0 Spark in Data Science
3.0 Machine Learning Implementation
3.2 Collaborative filtering
3.3 Logistic Regression
Data Science is the fastest emerging and developing domain in the present. Statistical Methods and number crunching is known since ages. The exponentially growing unstructured data around us is managed by data science that emerges as an solution to a lots of real life problems [Aalst, 2016].
Latest advancement includes the potentiality to change the way we interact our surroundings. One of the examples is understanding, generation and processing of Natural Language [Provost and Fawcett, 2013]. NLP algorithms are improving day by day and our digital ecosystem is becoming capable of understanding our needs in an unsupervised way. Few of the advance usage of NLP that is reforming our traditional approach are:
It is a fact that though the demand of the data scientists is great, anyone who wants to explore and work in this field has to make continuous learning of various dimensions of this field.
In past two years, data analysis becomes a major part of business intelligence. It includes analysis of future growth of business, customer requirement analysis and many more. The process consists of the following steps:
The creators of Spark founded the company, named Databricks, have summarized the functionality in their introduction to Apache Spark eBook. Apache Spark is the example of unified computing engine. It includes a set of libraries to process the parallel data for computer clusters. Spark is an open source engine for the mentioned task. It also makes the de facto tool for the data scientists and developers of Big Data. Programming languages like Java, R, and Python are supported by Spark.
The three main components for which Apache Spark considers as a leader is described below. This motivates to most of the big companies to adopt Apache Spark into their stack to work with large amounts of unstructured data.
1. Spark includes most of the functionalities to work with Big Data. Spark includes the tasks like data loading, learning from data, and streaming computation. It involves for real world data analytic tasks.
2. Spark enhances the core engine for computational efficiency. Spark is able to be used with an extensive variety of storage systems, which includes Amazon S3 and Azure Storage, Apache Hadoop, Apache Cassandra, and Apache Kafka.
3. Spark’s libraries provide a verity of functionalities. The standard libraries of Spark’s are the most widely used libraries for the open source project. Since it was first released, the Spark core engine has changed, whereas the libraries have been developed to provide various types of functionalities. Spark includes libraries for machine learning (MLlib), Spark SQL and graph analytics (GraphX).
Two datasets are used-
1. Titanic Dataset link - https://www.kaggle.com/c/titanic/data
In the machine learning model training set should be used. The outcome or “ground truth” is provided for each passenger in the training set. It is appropriate for the model which is based on features such as passengers’ class and gender. Feature engineering can also be used to create new features.
2. MovieLense Dataset Link-https://grouplens.org/datasets/movielens/1m/
This dataset contents of 1,000,209 numbers of anonymous ratings for 3,900 movies that are made by 6,040 MovieLens users. The file "ratings.csv" stores all the ratings.
For data analysis some of pyspark modules are used. The modules are-
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean,col,split, col, regexp_extract, when, lit
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import QuantileDiscretizer
All algorithms have been called from above modules. These make the data analysis part easier.
The features of Collaborative filtering are-
1. It is a predictive process behind recommendation engines, which is used to analyse information regarding users with similar features.
2. It uses algorithms in order to filter data from the reviews of user to personalize recommendations for the users with same type of likings.
3. It is also used to advertise in social media for every individual.
The key parameters are-
Import RegressionEvaluator from pyspark.ml.evaluation
Import ALS from pyspark.ml.recommendation
Import Row from pyspark.sql
Logistic Regression Algorithm [Bartlett et al, 2020] is used for classification. The features of Logistic regression are-
The key parameters are-
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="Survived", featuresCol="features")
lrModel = lr.fit(trainingData)
lr_prediction = lrModel.transform(testData)
lr_prediction.select("prediction", "Survived", "features").show()
evaluator = MulticlassClassificationEvaluator(labelCol="Survived", predictionCol="prediction", metricName="accuracy")
Foster Provost, Tom Fawcett, “Data Science and its Relationship to Big Data and Data-Driven Decision Making”, in Big Data, vol. 1, no. 1, 2013.
Mario Lovric, Jose Manuel Molero, Roman Kern, “PySpark and RDKit: Moving towards Big Data in Cheminformatics”, in Molecular Informatics, Volume38, Issue6, 2019.
Matei Zaheria, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, “ Apache Spark: a unified engine for big data processing”, in Communications of the ACM, Volume 59, Issue 11, 2016.
Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler, “Benign overfitting in linear regression”, in proceedings of the National Academy of Sciences of the United States of America, doi.org/10.1073/pnas.1907378117, 2020.
Wil van der Aalst, “Data Science in Action”, in Process Mining, pp. 3-23, ISBN 978-3-662-49851-4, 2016.
Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Computer Science Assignment Help
Proofreading and Editing$9.00Per Page
Consultation with Expert$35.00Per Hour
Live Session 1-on-1$40.00Per 30 min.
Doing your Assignment with our resources is simple, take Expert assistance to ensure HD Grades. Here you Go....