Question Paper: Big Data Analytics Question Paper - May 19 - Information Technology (Semester 8) - Mumbai University (MU)

0

## Big Data Analytics - May 19

### Information Technology(Semester 8)

Total marks: 80

Total time: 3 Hours
INSTRUCTIONS

(1) Question 1 is compulsory.

(2) Attempt any **three** from the remaining questions.

(3) Draw neat diagrams wherever necessary.

**1.a.**Explain Blooms filter for stream data mining.

**1.b**Find the jaccard distance and cosine distance between the following pairs of set: X=(0,1,2,4,5,3) and Y=(5,6,7,9,10,8)

**1.c**Explain the steps of the HITS algorithm.

**1.d**Explain "Shuffle & Sort" phase and "Reducer Phase" in Map Reduce.

**2.a**Write a Map reduce pseudo code to multiply two matrices. Illustrate with an example showing all the steps.

**2.b**Explain Hadoop Ecosystem with core components. Explain its physical architecture. State the limitations of Hadoop.

**3.a**Suppose a data stream consists of the integers 1,3,2,1,2,3,4,3,1,2,3,1. Let the Hash function being used is h(x) = (6x+1) mod 5; estimate the number of distinct in this stream using Flajolet- Martin algorithm

**3.b.i.**Distinguish the following: a) PCY, Multistage

**3.b.ii.**Document data store and Column family data store

**4.a**Give two applications for counting the number of 1's in a long stream of binary values. Using a stream of binary digits, Illustrate how DGIM will find the number of 1's

**4.b**For the given graph show how clique percolation method will find cliques.

**5.a**Consider the web graph given below six pages(A,B,C,D,E,F) with directed links as follows.

A-> B,C

A-> A,D,E,F

C->AF

Assume that the PageRank values for any page m at iteration 0 is PR(m)=1 and teleportation factor for iterations is $\beta$=0.85.Perform the page rank algorithm and determine the rank for every page at iteration 2.

**5.b**Explain clearly how the SON partition based algorithm helps to perform frequent item set mining for large data sets. How does this algorithm avoid false negatives?

**6.a**Explain collaborative filtering system. How is it different from content based system?

**6.b**Clearly explain how CURE algorithm can be used to cluster big data sets.