Introduction to Data Science - Task A

  1. Decompress the file. How big is it (bytes)?

Decompression of the data and its size.

Unzip –Zt FacebookNews.zip

Size = 28382357 Bytes

There are total 10 files in the zip file and the uncompressed file is 93607150 bytes which is equal to 2 GB approx. To see the details of the 3 files we use this code.

  1. What delimiter is used to separate the columns in the file and how many columns are there?

Delimiter description:

Since the file is a CSV(Comma-separated values) format hence the file is using the delimited ‘,’ to separate the column. And we used the ‘,’ while reading the file. And we can see that there are 22 columns in the processed-FacebookNews file.

unzip -p FacebookNews.zip | awk -F',' '{print NF; exit}'

  1. The first column is the unique identifier for each article. What are the other columns?

The first column in each file is basically a header for that column which shows what this column consists of while the other columns under the same header represents the data of the first columns for example if first column is likes_count than the other columns are showing the number of likes.

The number of other columns is approx 135376.

 awk –F ‘\t’’{print $2}’*|sort|uniq –c|sort -nr

  1. How many articles are there in the file?

The number of articles shows whether how many articles are used in the particular file which contains the data of facebook about the US elections happened in 2016 and what articles are used while storing that information into a file.

The number of articles are as follows,

awk –F ‘\t’’{print $2}’*|sort|uniq –c

  1. What is the date range for the articles in this file? (Assume that the data is in

Order)?

The date range in the data set is from 09-08-2012 to 31-07-2016, all the files contain the date range in the above format and the data includes all kind of posts shared on the facebook related to the particular.The date range for the particular data file is as follows,

find –newermt “MM-DD-YYYY” ! –newermt “MM-DD-YYYY”

  1. How many unique titles are there?

The unique titles are used in the files to differentiate between the regular post and the post created for any special purposes. These types of posts need a unique title, so that it can get recognized easily. The numbers of unique titles are as follows,

ls –i|awk –F- ‘(print $i)’ | sort --unique

  1. How many articles don’t have a title?

The articles which don’t have any title will show NULL as a title and the only which will be shown is the description about that articles like there is an article which says “chief justice Roberts responds to judicial ethics critics” in such article the title is NULL and hence they will show NULL if we access the title name.

cat abc-news-86680/28811.csv|(while read line; do day=$(date –d$’(echo$|line|cut –d, -f3 )’+%a); echo”$line,$day”;done;)| grep ‘, Sun$’| cut –d, -f1 -3.

Articles without titles in a file are;

  1. When was the first mention in the files regarding “Italian food” and what was

the title of the post?

To know the first occurrence of Italian food for example ‘pizza’, we extract another csv file from the given csv data set and then execute its head. To extract the Italian food and to create the separate file we can use following command.

grep –i “pizza” abc-news-86680728811.csv>>pizza.csv

and then simply run the head pizza.csv;

head pizza.csv

The occurrence of ‘Italian food’;

  1. How many times is “Hillary Clinton” mentioned in the articles? How did you find this? (Do not ignore the case)

To see the occurrence of a particular term in a data set we grep command with the name of .CSV file and it will give the occurrence of particular word in a dataset, for example; if we want to check the frequency of the term “Hillary Clinton” in each dataset we will simply run the following code;

grep –c “Hillary Clinton” name of .CSV file.

The occurrence of “Hillary Clinton” is;

  1. What about “Donald Trump”? Who is the focus on more articles, Clinton or Trump? (Do not ignore the case)

As we have seen in the previous case of Hillary Clinton and the frequency of occurrence in the particular dataset, we will apply the same one liner code here with the name of particular .CSV file and we can see the frequency of occurrence. The occurrence of Donald Trump is less than Hillary Clinton so, the posts are mostly focused on Hillary Clinton.

  1. Select the posts where “Trump” (Ignore the case) is mentioned in the postcontent and number of likes for those posts are greater than 100. Generate a new file with post_id and the sorted like_count and name it “trump.txt”. (In the output, you need to show the headers as well) [Hint: Find Trump in themessage column, i.e., a specific column]. Then copy and paste the first 5 lines of txt in your answer.

To know the post where Trump is mentioned first we have to run the same code which we have used previously in the case Italian food and then we will run exactly same way the head of trump.txt file.

grep “Donald Trump” name of .CSV file > Trump.txt

head trump.txt

First extract the data from root csv file and then count the occurrence of Trump in the file.

  1. Find the total number of love_count and angry_count for “Donald Trump” and “Hillary Clinton” separately. Who has more positive feeling among people?Justify your answer. [Hint 1: you will need to search online to find how to sum a column of numbers using awk.

Hint 2: You will need to consider both love and angry count when justifying your answer.]

For these questions we have used awk command then we will just run the sum of the particular column.

awk ‘BEGIN (FS – “Hillary Clinton”);(sum = $12) END (print sum)’ name of .CSV file

Total number of angry love reacts on trump and Clinton.

  1. How many articles discussed Trump and Putin? How many discussed Trump but not Clinton?

For this simply run the grep command with Trump and Putin with the name of .CSV file and we will get the output accordingly.

grep –c “trump” name of .CSV file

Articles discussed Trump but not Clinton.

  1. For each publication in trump.txt, find out which month had the most articles about Trump. Try to do this without using grep. Months which had most publications in Trump?

grep –c “Trump” Trump.txt

The answer is July and the occurrence of trump is 698.

Introduction to Data Science - Task B

  1. How many times does the term ‘Trump’ appear in the post message?(use Unix shell to answer to this question)

The unix shell (bash script) takes following code for the frequency of occurrence of Trump.

grep –c “Trump” the-new- York-times-5281959998.csv

The occurrence of term “Trump” in the posts;

  1. We want to consider how the amount of discussion regarding Donald Trump

varies over the time period covered by the data file. To answer this question,

you will need to extract the timestamps for all posts referring to Trump using

shell. You will then need to read them into R and generate a histogram.

[Hint: To read the data into R, first generate a file containing only the

timestamp column as text, then read the file into R as a CSV.]

R will not recognise the strings as timestamps automatically, so you’ll need to convert them from text values using the strptime() function. Instructions on how to use the function is available here:https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/strptime ,You will need to write a format string, starting with “%a %b” to tell the function how to parse the particular date/time format in your file. What format string do you need to use?

The IDE which was used for the above question is basically a RStudio and RScript and the library which is used for that is simply ggplot to visualize the histogram which shows how many times trump appeared in the dataset.

ggplot(data_fox_news, aes(x = shares_count, color = shares_count ))+geom_histogram()

Histogram of timestamp when Trump was occurred mostly.

Here we can see news facebook pages referred the term “Trump” most of the time instead of using the full name “Donald Trump” , and that’s how they advertised Trump but not with Hillary Clinton.

  1. In this question, we want to investigate the Facebook posts of a few top media sources. To answer this question, you will need to extract the Facebook posts made on the pages of "abc-news", "cnn" and "fox-news" from your original Facebook dataset.

(i) Use the Unix shell to first generate a file containing all the records

belonging to "abc-news", "cnn" and "fox-news" only. Then read the

resulting file in R.

 To see the few top records of the facebook dataset we can simply run head and tail command with the name of that particular .CSV file and hence the few top post of every channel will visible.

$ head abc-news-86688728811.csv

(ii) Background: We now want to see if any relationship exists between the numbers of times a post is shared on Facebook and the number of likes it generates. Task: Use appropriate R code to generate a plot showing the relationship between the number of shares and the number of likes in your dataset. Do you see any relationship?

ggplot(data_fox_news, aes(x = shares_count,y = likes_count, color = page_id ))+geom_point()

Using the above R script we can easily see how many times a particular page id has shared the posts on facebook. 

Number of times when the post is shared;

(iii) Fit a linear regression model using R to the above data (i.e., shares_count and likes_count) and plot the linear fit. Does it look like a good fit to you?

The linear regression plot shown below is the ratio between the likes_count and the shares_count and we can clearly see that that likes_count are much more than the shares count on the particular facebook post.

ggplot(data_fox_news, aes(x = shares_count,y = likes_count, color = page_id ))+geom_point()

 Linear regression plot between likes_count and shares_count;

(iv) Use the linear fit to predict the number of likes a post will generate if itis

shared 0 times, 100 times, 1000 times, 10000 times and 100000 times

on Facebook.

To check the linear fit that how an uploaded post gets the likes and share count according to the type of post we run the following script.

ggplot(data_fox_news, aes(x = shares_count,y = post_type, color = page_id ))+geom_point()

When the post is generated it gets most of the likes if it is a photo but it gets the least number of likes in the form link. Hence people mostly believes in the pictures and videos doesn’t matter whether they are authentic or doctored.

Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Computer Science Assignment Help

Get It Done! Today

Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
Upload your assignment
  • 1,212,718Orders

  • 4.9/5Rating

  • 5,063Experts

Highlights

  • 21 Step Quality Check
  • 2000+ Ph.D Experts
  • Live Expert Sessions
  • Dedicated App
  • Earn while you Learn with us
  • Confidentiality Agreement
  • Money Back Guarantee
  • Customer Feedback

Just Pay for your Assignment

  • Turnitin Report

    $10.00
  • Proofreading and Editing

    $9.00Per Page
  • Consultation with Expert

    $35.00Per Hour
  • Live Session 1-on-1

    $40.00Per 30 min.
  • Quality Check

    $25.00
  • Total

    Free
  • Let's Start

Browse across 1 Million Assignment Samples for Free

Explore MASS
Order Now

My Assignment Services- Whatsapp Tap to ChatGet instant assignment help

refresh