The recommendation algorithm finds out what the user is most likely to like and recommend it to the user. It is widely used in e-commerce websites. The more mature recommendation algorithm currently used is the Collaborative Filtering (CF) recommendation algorithm. The basic idea of ​​CF is to recommend items to users based on their previous preferences and the choices of other users with similar interests.


Intuition behind Factorization

To get an intuitive understanding of matrix factorization, Let us consider an example: Suppose we have a user-movie matrix of ratings where each value of the matrix represents rating given by the user to the movie.

We observe from the table above that some of the ratings are missing and we would like to devise a method to predict these missing ratings. The intuition behind using matrix factorization to solve this problem is that there should be some latent features that determines how a user rates a movie. …


The objective of time series comparison methods is to produce a distance metric between two input time series. The similarity or dissimilarity of two-time series is typically calculated by converting the data into vectors and calculating the Euclidean distance between those points in vector space.

Dynamic time warping is a seminal time series comparison technique that has been used for speech and word recognition since the 1970s with sound waves as the source; an often cited paper is Dynamic time warping for isolated word recognition based on ordered graph searching techniques.

Background

This technique can be used not only for pattern…


The LIBSVM data format is given by:

<label> <index1>:<value1> <index2>:<value2> …

As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix.

If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.

For the meaning of the tags…


Comparison between Hive Internal Tables vs External Tables

Fundamentally, Hive knows two different types of tables: Internal table and the External table. The Internal table is also known as the managed table.

We can identify the internal or External tables using the DESCRIBE FORMATTED table_name statement in the Hive, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on the table type.

1. Hive Internal Table

Hive owns the data for the internal tables.

It is the default table in Hive. When the user creates a table in Hive without specifying it as external, then by default, an internal table gets created in a specific location in HDFS.

By default, an internal table…


Regularizers allow you to apply penalties on layer parameters or layer activity during optimization. These penalties are summed into the loss function that the network optimizes.

Regularization penalties are applied on a per-layer basis. The exact API will depend on the layer, but many layers (e.g. Dense, Conv1D, Conv2D and Conv3D) have a unified API.

These layers expose 3 keyword arguments:

  • kernel_regularizer: Regularizer to apply a penalty on the layer's kernel
  • bias_regularizer: Regularizer to apply a penalty on the layer's bias
  • activity_regularizer: Regularizer to apply a penalty on the layer's output

Suppose the loss function is given as :

loss…

Hadoop YARN

Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and is responsible for resource allocation and job scheduling. Introduced in the Hadoop 2.0 version, YARN is the middle layer between HDFS and MapReduce in the Hadoop architecture.

The elements of YARN include:

  • ResourceManager (one per cluster)
  • ApplicationMaster (one per application)
  • NodeManagers (one per node)

Resource Manager

Resource Manager manages the resource allocation in the cluster and is responsible for tracking how many resources are available in the cluster and each node manager’s contribution. It has two main components:

  1. Scheduler: Allocating resources to various running applications and scheduling…

Overview

The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines.

HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications operate under two rules:

  1. Two identical blocks cannot be placed on the same DataNode
  2. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack

In this example, blocks A, B, C, and D are…


Big Data and Its Challenges

As the years went by and data generation increased, higher volumes and more formats emerged. Hence, multiple processors were needed to process data to save time. However, a single storage unit became the bottleneck due to the network overhead that was generated. This led to using a distributed storage unit for each processor, which made data access easier. This method is known as parallel processing with distributed storage — various computers run the processes on various storages.

Big Data refers to the massive amount of data that cannot be stored, processed, and analyzed using traditional ways.

The main elements of…

Z² Little

Data Science Diary

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store