Analytics Redefined
Big Data

What is Big Data? | An Introduction to Big Data Analytics

Big Data Analytics |

What is Big Data ? - We have been hearing this question every now and then . Most common answer to this question that you might receive will be like - “Any data that cannot fit into a single machine”, “Data that is > 1 TB is considered big data” and so on . But are these the right answer to one of the most important questions of this data driven era “What is Big Data ?”.

The simple answer is NO. You cannot define big data simply by putting in some numbers . There are other factors also which form the characteristics of Big Data , the famous V’s of Big Data .

Gartner defines Big data as “high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation

Let's look into this definition for a second and try to understand What is Big Data

They say, any data can be considered as  Big Data , if the data is huge in volume than what can be handled by traditional systems like RDBMS. It is generated at high velocity from different sources, and it can be any form of data i.e, images, text, videos JSON, etc.

But the main point is the last part. Big Data processing demands Cost-effective and Innovative approach. Otherwise, it may not be accessible to everybody. Processing such data demands either vertical scaling of existing hardware or an innovative approach that can enable big data processing even using regular commodity hardware like your own PC. Vertical scaling is limited and not feasible to the huge data volume and variety as there is always a limit to how much we can upgrade/scale a particular system hardware. So the solution was distributed , parallel computing and frameworks like Hadoop Mapreduce , Spark, etc.

Now , it is important to know why suddenly we are seeing the huge demand for Big Data Analytics in the digital world
Facts related to Big Data
  • Over 2.5 quintillion bytes of data is generated worldwide every day
  • In 2016, 90% of the world’s data had been created in the previous two years
  • Trillions of sensors monitor , track and communicate with each other helping for Internet of Things with real time data
  • 294 Billion emails sent everyday 
  • Over 1 billion Google searches everyday
  • 30+ petabytes of user-generated data is stored , processed and analyzed at Facebook
  • 230+ million tweets each day at Twitter 

There are a lot more than this ! The above facts might have given you an idea of how important data is going to  be in coming years . Big Data Analytics is a field that is growing rapidly but the available expertise and technologies to explore ever growing data is limited . This is what is called “Information Gap” . This gap is since now every 2 years mankind doubles the volume of data produced, but processes, analyzes and comprehends only a part of these data.

Characteristics of Big Data - 5 V’s of Big Data

What is Big Data ? 5 V's of Big Data |

In Big Data , there are 5 important characteristics that you need to see as shown in the above info-graphics . These characteristics are generally called the important V’s of Big Data .

Few years back , there were only 3 V’s of Big Data i.e., Volume , Velocity and Variety . Later came into existence 2 move V’s into this list. They are Veracity and Value . Let's understand more about these 5V’s of Big Data.

  1. Volume : This refers to the amount of data that is being generated. This is huge and most of them interpret Big Data only in terms of this property.

    Over 2.5 quintillion bytes of data is generated worldwide every day. We have better access to the internet , smartphone industry is growing rapidly with the advent of online retailers like Amazon , Flipkart , Ebay , etc .

    All these developments have led to generation of huge amount of data. Twitter sends out over 500,000 tweets a day. We have IOT nowadays , which leads to sensor collected data from almost any devices and machines you can think of. Industries are adopting IOT with Big data Analytics to help grow their business.

    These all developments lead to huge amount of data , and these are getting generated in such a volume that it cannot be handled in traditional data processing systems . We need outputs faster , and traditional processes would take a long time to even process 10% of this data . This is where we need to think of Big Data Frameworks and technologies.
  2. Velocity : This refers to the rate at which data is being generated by different sources. This is one of the main characteristics of Big Data. We have data being generated from new sources everyday. And the speed at which this data is generated is increasing exponentially. Social Media giants like Facebook , Twitter have to process millions of posts and tweets per day or even per minute !

    In 2019,
    • Twitter users send more than 500,000 tweets every minute
    • Instagram users post over 250,000 stories
    • Twitch users view 1 million videos
    • Tinder users swipe 1.4 million times

    Since Internet Of Things (IOT) is very popular these days , it is one of the best examples with which I can explain Velocity to you. With the advancements in IOT , businesses have started innovative ways to collect data on their processes , so that they can analyze these and improvise there system. Sensors are being connected to machinery, to the employees , sportsperson etc to analyze their performance.

    Now, these data are being generated continually . This is what we call streaming data , where data is being generated continually real-time or near real-time. Until few years back , we didn't have any mature system to analyze such streaming or real-time data.

    So, nobody used such data. We were limited to analyzing only historical data in batch processing fashion . But the potential in streaming data is immense in any industry.

    Velocity is one of the most important factors in Big Data processing. Highly Active Researches are being undertaken and frameworks are being developed using various methodologies to reduce the execution time .If an organization cannot keep up with the velocity of data, they might probably have to rethink their data processing strategy
  3. Variety : This refers to the types of data that are being generated. We can categorize data into mainly 3 types:
    • Structured : Any data that can be segregated into a table format is what basically called structured data. Analysis of Structured data is predominant in traditional DBMS approach.
    • Unstructured : These are the data that cannot be organised into a table like format. These are complex entities like Images, Tweets, Emails, Videos etc
    • Semi-Structured :  These lie between the Structured and Unstructured forms of data. These kind of data are not completely unstructured. They are defined in specific formats with tags. Examples include JSON, XML etc
    Now, an interesting fact is shown below. You can see that over 80% of the data in digital universe are Unstructured format. Various Industries , Governments , Research Organisations are making use of the Big Data Frameworks and methodologies to harness useful information out of this. 
    The ability to harness insights out of Unstructured data has mainly been a game changer over past few years. Every industry is trying to make use of this data from Retail Giants to Governments
  4. Veracity : This refers to the Trustworthiness an Quality of the data. Before you start to process and analyze data, you should always check if the data is trustworthy or not.

    Suppose you are analyzing twitter data. You have to do a sentimental analysis of a particular Twitter hashtag. Now before jumping into processing, you should ensure that whether the data provided to you is trustworthy or not. Is the data of any quality. You have to check if it is credible enough to provide some value or not.

    Now , what is the point in wasting your efforts and time in analyzing something that is fake or corrupt. You may find lot of tweets that have your desired hashtag but the tweet might not be related !

    Let's take another example. Suppose , you are given a group of individuals car GPS information. You are supposed to find some metrics. Now you find a lot of data missing due to GPS connectivity lost in some remote areas . These kind of conditions is where Veracity of data comes into picture.
  5. Value : This refers to how much value does processing and analyzing a data provide to you or a business. What is the use of analyzing large data-sets if it doesn't provide any useful insights to you.

    There are immense quantity of data out there, but choosing the right data is important. Always analyze the value, a data can provide to you before starting to processing it.


Let's summarize few key point from our discussion ! 

  • Big Data is not just about size of data
  • There are 3 main types of data 
    • Structured - Data that can be confined to tables
    • Semi-Structured - JSON, XML
    • Unstructured - Images, Videos, Social Media , IOT
  • There are 5 important characteristics to identify a Big Data Problem called the 5V's of Big Data
    • Volume
    • Variety - Structured , Unstructured , Semi-Structured
    • Velocity
    • Veracity - Trustworthiness / Quality
    • Value
Now , one important point you have to always remember is that, Big Data is a Problem rather than a technology. It is a big problem faced by the industry. Until few years back, we didn't have the resource and budget to process these data.

It was like you have the treasure in front of you , but you don't know how to open the case ! Have you ever wondered why now a sudden explosion in Analytics domain ? Its because we now have the resources and capabilities to innovate , process and analyze. 8 GB of RAM is now a standard in almost all laptops. If we go back to around 5 to 10 years, we were running our laptops and PC's with 512 MB RAM. Then came the semi-conductor revolution and here we are now.

There are endless possibilities in the field of Big Data Analytics. Those who have access to data is the king. As a Data Engineer / Analyst your role is to help organizations find useful insights out of their data. In coming articles we will discuss further about the various tools and frameworks that are used in the industry for Big Data Analytics

Hope I was able to provide you an understanding of Big Data and answer to  what is Big Data. Let me know your views on Big Data in the comments below. Keep learning and keep innovating !  

Related Posts

No comments:

Post a Comment