Data Analytics

                       

What is Data Analytics?

As the process of analyzing raw data to find trends and answer questions, the definition of data analytics captures its broad scope of the field. However, it includes many techniques with many different goals.
The data analytics process has some key components that are needed for any initiative. By combining these components, a successful data analytics initiative will provide a clear picture of where you are, where you have been and where you should go.
  • Generally, this process begins with descriptive analytics. This is the process of describing historical trends in data. Descriptive analytics aims to answer the question “what happened?” This often involves measuring traditional indicators such as return on investment (ROI). The indicators used will be different for each industry. Descriptive analytics does not make predictions or directly inform decisions. It focuses on summarizing data in a meaningful and descriptive way.
  • The next essential part of data analytics is advanced analytics. This part of data science takes advantage of advanced tools to extract data, make predictions and discover trends. These tools include classical statistics as well as machine learning. Machine learning technologies such as neural networks, natural language processing, sentiment analysis and more enable advanced analytics. This information provides new insight from data. Advanced analytics addresses “what if?” questions.
  • The availability of machine learning techniques, massive data sets, and cheap computing power has enabled the use of these techniques in many industries. The collection of big data sets is instrumental in enabling these techniques. Big data analytics enables businesses to draw meaningful conclusions from complex and varied data sources, which has made possible by advances in parallel processing and cheap computational power.    
https://www.mastersindatascience.org


Types of Data Analytics

Mainly there are 4 types of data analytics categories.
  1. Descriptive analytics describes what has happened over a given period of time. Have the number of views gone up? Are sales stronger this month than last?
  2. Diagnostic analytics focuses more on why something happened. This involves more diverse data inputs and a bit of hypothesizing. Did the weather affect beer sales? Did that latest marketing campaign impact sales?
  3. Predictive analytics moves to what is likely going to happen in the near term. What happened to sales the last time we had a hot summer? How many weather models predict a hot summer this year?
  4. Prescriptive analytics suggests a course of action. If the likelihood of a hot summer is measured as an average of these five weather models is above 58%, we should add an evening shift to the brewery and rent an additional tank to increase output.

Types of Big Data Analytics

In here also we can divided in to 4 categories .

  1. Interactive Analytics  - We have data and allow user to interact with data and get some output
  2. Real-time Analytics - We receive data, but we didn't store it. Process the data in real time. Queries are pre-defined.
  3. Predictive Analytics - we have some information about same thing. Based on that information we predict the future
  4. Batch Analytics- store data in databases and take/process data using sql quires

So here I'm going to demonstrate how to use Batch Analytics and Real-time Analytics for given dataset.

This is the dataset that I have used and it is about  Uber and Lyft Dataset Boston, USA  contained  about 7M records.




First I load the data set into Jupiter notebook and do some data prepossessing  such as,
  • Remove unnecessary columns
  • Removed NAN data
  • Saved into new CSV file




1. Batch Analytics


I used Data bricks for the visualizations.

This is how I create my table in and import the data set to data bricks. 



After creating my table I used it to create queries for  the visualizations.

Here are some visualizations that I have got.




This is the query for the visualization .


     2.


This is the query.

    3.


This is the query.


     4.


This is the query.


    5.

This is the query.






 Problems that I have faced.

  • U can see I used same colors for each graph. Because in Data brick these colors are only available.
  • Visualization are not clear. Because we can't customize the visualizations. Only can apply variables. At least we can't change names in X and Y axis.
  • We cant't modify the output  designs as our wish.
  • We can use Data Studio , instead of using  Data brick. We can get beautiful visualizations easily. No need to use sql quires. Just drags and drops and you can get the visualizations.
  • In my case my data set size is about nearly 342 Mb. After the prepossession it is about 108 Mb. So in Data Studio we can't import data sets more than 100 Mb. The maximum size is 100 Mb. If I remove some data the It will affect to my analyse. Because every single byte of data is really valuable. There for I used Data bricks.

2. Real - Time Analytics


For the Real-time analytics I used Siddhi Stream Processor.

What is Siddhi?

Siddhi is a cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to capture events from diverse data sources, process them, detect complex conditions, and publish output to various endpoints in real time.

Further information

So after all configurations I started creating a new app. Basically we need to define,

  • Input stream -  In here we need to define what are the input that we are going to give to query.And also we need to define what kind ow data source that we are going to insert. In my case  it is like this. 

             Here I uploaded my data set(CSV file) from my laptop. There for I changed type in to 'file' . If it is a url then, type is 'http' .

  • Output stream - We define this to, represent what are the our  outputs of query.


  • Database connection - I used this because I  used a file. So the data should be add time to time to db as like the data coming  live.

  • Define a table 

  • Your query

This is my database.




Here are 5 scenarios that I tried to visualized using my same data set. And also I used Power BI and PHP with Morris.js scripts for visualization.
  1. Total number of rides over the last 1 minute for each ​product_id​. 
  2.  Average taxi fare for the last 1 min for each ​product_id and the rides are taken between same source and destination.
  3. Highest taxi fare for the last 1 min for each ​cab_type.
  4. Top 3 taxi fares of each minute for each ​cab_type​. 
  5. The ​source​ location which has the least number of rides of each 1 minute. 


1.
    The Siddhi code :



    Visualization :



2.
   The Siddhi code :



    Visualization :



3.
  The Siddhi code :



    Visualization :





4.
   The Siddhi code :




    Visualization :


5.
   The Siddhi code :


    
   Visualization :


The reason that I used PHP and Morris.js for last 2 visualization is that I needed more detailed visualization  for my 4th and 5th scenarios.  Because in my 4th scenario it required top 3 fares for each cab type. In Power Bi I couldn't represent detailed visualization. And also this is the cause for 5th scenario as well. There for I choose PHP with Morris.js for visualization.


So that it for today. See you soon.




Comments

  1. Thanks for the Great Content. I will also share with my Friends and once again Thanks a lot for sharing valuable information with us. Please keep on sharing.
    mtech in india

    ReplyDelete

Post a Comment

Popular posts from this blog

How to run multiple Transformations from one Job in Pentaho

Distributed Systems

Data Persistence