Data-intensive computing is a class of parallel computing which generally focus on those applications which deal with a large amount of data. The volume of data that is processed can be in the size of terabytes and petabytes and this type of data is also referred as big data.
Data-intensive computing is used in many applications ranging from social networking to computational science where a large amount of data needs to be accessed, stored, indexed and analyzed.
It is more challenging as the amount of data keeps on accumulating over time and the rate at which the data is generating also increases. Distributed computing is helpful in these case as it provides more stable and scalable storage architecture and better performance in terms of computation and processing.
The use of parallel and distributed computing techniques gives an advantage in handling the large size of data but the challenges such as data representation, scalable infrastructure, and much efficient algorithm need to be faced.
In this article, we will be focusing on data intensive computing definition and what are the challenges and problem faced by companies to manage and store big data.
What is Data-Intensive Computing?
Data-intensive computing is concerned with manipulation, production and analyzing of large-scale data ranging from megabytes to petabytes. You must have heard about MapReduce which is the most popular programming model for creating data-intensive applications with cloud deployment.
Datasets are the most important term when we discuss data intensive computing. This term is used to identify a collection of similar type of elements that are relevant to one or more applications also called metadata.
Computational Science is one of the most important domains of data science and data-intensive computing. Scientists and other people who are conducting scientific simulations and experiments are most often keen to produce, manage and analyze a huge amount of data.
Every second, hundreds of gigabytes of data is produced by telescopes mapping the sky and at the end of the year, the size of data reaches to petabytes.
Other applications like Bioinformatics and Earthquake simulations produce and process a massive amount of data and with data intensive computing, this process becomes a little easier.
Data-Intensive Computing Applications
Beside scientific computing, there are several IT industry sectors which require handling and managing of big data. The data of customers in any telecom company may vary from 10-100 terabytes and every time, the same data is processed to generate billing information but also used to identify trends, scenario, and fashions.
Google, one of the biggest company giants has reported that it processes 24 petabytes of data every day.
Facebook search result crawls about 150 terabytes of data and the volume of uncompressed data reach to 36 petabytes. Mostly data intensive computing is used in applications like Bioinformatics, weather forecasting, processing of telescopic images and much more.
Characterizing Data-Intensive Computations
Data-intensive applications also exhibit compute-intensive properties along with dealing with huge volume of data. These applications are designed to handle datasets on the scale of multiple terabytes and petabytes.
Datasets can be stored in different formats and can be stored at any geographical locations. The data is processed using multistep analytical pipelines, including fusion and transformation of stages. Parallel processing is used by these applications and data-intensive applications also need efficient mechanisms for managing data, filtering, and fusion along with distribution and efficient querying. It is completely a challenge to manage and store data as the rate at which data is generating will definitely going to be a big problem in future