The large increase in data volume leads to the upsurge in research – the term ‘Big Data’ is one of the hottest words used in today’s world. The influence of big data is pervading in science, business, industry, government, society etc.

​Processing of big data includes collection, storage, transportation and exploitation. The term is defined by four V’s- Volume, Velocity, Veracity and Variety. Volume is the size of data that needs to be processed using algorithms. Velocity is streaming of data which is increasing very rapidly that might be too fast that can be handled by traditional algorithms and systems. Veracity deals with the quality of data. In fact, the quality of the data decreases as the data size becomes bigger and bigger. Variety is presenting different data types and modalities for a given object. Now-a-days all traditional process-oriented companies are trying to become knowledge-based companies driven by data and not by process.

Data analytics comprises of various machine learning techniques which has some misconceptions as well. Below are the points which tells what misleading arguments are present in this big data era.

Models are irrelevant now:
The argument on the selection of models of small data era for the big data still going on as many think that sophisticated models are not fit for big data. But this is not the case if we carefully observe the empirical results. We can’t conclude that the worst-performing model on small data is the simplest one and vice-versa. Also, there is no proof that the simplest model on small data will achieve the best performance with big data.
With the rapid increase in computational techniques, we can conclude that in the bid data era, sophisticated models become more favored since simple models are usually incapable of handling large amount of data due to memory issues.

Correlation is enough
Some books claim that only finding correlation from big data is enough. But the fact is that the role of causality can never be replaced by correlation. Sometimes discovering valid correlation is able to provide some helpful information but ignoring the importance of correlation and taking the replacement of causality by correlation as a feature of the big data era can be dangerous.

Previous Methodologies are useless
Many claims that previous research methodologies were designed for small data and hence they cannot work on big data as expected. But this is relative term. What we called small data today may be as good as the big data of that time. So, continuously research work is going on to handle big data since early years. Infact, high-performance computing, parallel and distributed computing, high efficiency storage, etc., will remain very popular in future as well.

Opportunities and Challenges
We need to consider about the increasing data size because we do the repeated scans of the entire dataset while doing performance measurements e.g., AUC. So, the question comes that can we identify valuable datasets from our original dataset?

Again, we can ask if there is any possibility to develop a parameter tuning guide so that current exhaustive search can be replaced? Another area can be noted down related to big data is statistical hypothesis testing. We can verify that what we have done is correct. The same task can be done by deriving interpretable models. Rule extraction and Visualization are another important approach on improving the comprehensibility of models. Another open problem in this big data era is can we really avoid the violation of privacy concerns, as this is still long-standing unresolved issue.

Author: Xaltius

This content is not for distribution. Any use of the content without intimation to its owner will be considered as violation.


In the past, mining big data by algorithms, which are scalable in nature by leveraging parallel and distributed architecture, was the focus of numerous conferences and workshops. Volume, the third most important aspect of big data is now readily available. These datasets, which was in terabytes, are now measured in petabytes, since the data is being collected from individuals, business transactions and scientific experiments. Veracity, the fourth aspect of big data which focuses on data quality. Data quality has been focus because of complexity, noise, missing values or imbalance datasets. Another problem which velocity presents are to build streaming algorithms, which could take care of data qualities issues. This fields are still in learning or inception stage. Variety of data which is unstructured and structured such as social media, images, video and audio presents opportunity for data mining community. A fundamental challenge would be integrating these values into single features vector presentation. The last decade has seen social media growing which is contributing to big data field.

From Data to Knowledge to Discovery to Action
Alternative insights or a hypothesis leads to scenarios, which can be also weighted. Data driven companies have seen a productivity increase to 5-6 % as per Bryson Eyjolfsson report which studied about 179 companies.
Healthcare is another area which big data application are seen a tremendous growth. For example, United Healthcare, is making use of big data for mining customer attitudes to identify customer sentiments and satisfaction. These analytics could lead to quantifiable and actionable insights.

Another area where big data is offering opportunity and challenges is global optimization. The objective is to maximize decision making variables. The meta heuristic algorithms has been successfully applied to complex and large-scale systems, reengineering and reconstructing organic networks.

Global Optimization of Complex Systems
The key aspect is club decision variables with high correlation and independent variables in another group. Multi objective optimization are as
Weighted aggregation based methods,
Pareto dominance based approaches, and
Performance indicator based algorithms.
However, the problem with the above-mentioned approaches is that these methods are not efficient when the number of objective is larger than three.

Big Data in Optimization
Analysis and mining is also challenging since the data is huge and might be stored in different forms and be present with noise. In short, such data is a perfect example of the four V’s representing big data.

In this discussion, emphasis is laid on the future practices rather than presenting a scientific point of view.

Decentralizing Big Data
“Crowdsourced” data, i.e. the data collected by tech giants like Google, facebook, Apple etc. can be used to deliver new insights. It’s a two-way deal in which the users knowingly or unknowingly are depositing their data in exchange for better services. This data is also made available to others on purpose and/or inappropriately.
With the data being stored at a centralized server, concerns about the privacy and security of the data are being raised. Many governments are introducing new regimes for the internet companies. It is felt that there is a need of privacy friendly cloud platforms in which the data is stored with the user rather than on the cloud.
Decentralizing data would give rise to the problem of extreme data distribution. The internet companies will face three challenges:
• Scale of the data
• Timeliness of model building
• Migration from centralized to decentralized

Scaled Down Targeted Sub-Models
It is always preferable to use the entire data to avoid the need for sampling. A lot of research is done generally on the algorithms. Since data is now so massive, the challenges presented to the data collection, storage, management, cleansing and transformation cannot be ignored. A possible solution could be to massively partition the data into many overlapping subsets to represent many subspaces so that we can accurately uncover the multitudes of behavioral patterns. A community of thousands of micro models can be used as an ensemble to operate over the entire population.

Right Time, Real Time Online Analytics
Organizations cannot afford to spend months to build models. It is clear that there is a need to build models in real-time to respond in real-time and that learn and change their behavior in real-time. There is a need for developing dynamic, agile models working in real-time.

Extreme Data Distribution: Privacy and Ownership
Aggregating data centrally is potentially dangerous as it poses great privacy and security risks. Single point of failure on the centralized data, i.e. one flaw or one breach may lead to devastating consequences on massive scale. Personal data needs to move back to the individual to whom the data belongs. Instead of centralizing all the computation, there is a need to bring the computation to intelligent agents running on personal devices and communicating with the service providers. Personal data can be encrypted and the smart devices can decrypt it. The smart devices interact with each other on the mesh network and insights can be found without having to collect the data on a central server.

Opportunities and Challenges
Once the decentralization of data is done, the only problem would be to tackle the extreme data distribution problem. With the data being distributed widely, we will be challenged to provide the services we have come to expect with massive centralized data storage. The challenges presented then would be:
• Appropriately partitioning data to identify behavioral groups within which we can learn and to level model and learn at an individual level.
• Refocus again on delivering learning algorithms that self-learn in real time and do it online.

Author: Xaltius

This content is not for distribution. Any use of the content without intimation to its owner will be considered as violation.