The large increase in data volume leads to the upsurge in research – the term ‘Big Data’ is one of the hottest words used in today’s world. The influence of big data is pervading in science, business, industry, government, society etc.
Processing of big data includes collection, storage, transportation and exploitation. The term is defined by four V’s- Volume, Velocity, Veracity and Variety. Volume is the size of data that needs to be processed using algorithms. Velocity is streaming of data which is increasing very rapidly that might be too fast that can be handled by traditional algorithms and systems. Veracity deals with the quality of data. In fact, the quality of the data decreases as the data size becomes bigger and bigger. Variety is presenting different data types and modalities for a given object. Now-a-days all traditional process-oriented companies are trying to become knowledge-based companies driven by data and not by process.
MACHINE LEARNING WITH BIG DATA:
Data analytics comprises of various machine learning techniques which has some misconceptions as well. Below are the points which tells what misleading arguments are present in this big data era.
Models are irrelevant now:
The argument on the selection of models of small data era for the big data still going on as many think that sophisticated models are not fit for big data. But this is not the case if we carefully observe the empirical results. We can’t conclude that the worst-performing model on small data is the simplest one and vice-versa. Also, there is no proof that the simplest model on small data will achieve the best performance with big data.
With the rapid increase in computational techniques, we can conclude that in the bid data era, sophisticated models become more favored since simple models are usually incapable of handling large amount of data due to memory issues.
Correlation is enough
Some books claim that only finding correlation from big data is enough. But the fact is that the role of causality can never be replaced by correlation. Sometimes discovering valid correlation is able to provide some helpful information but ignoring the importance of correlation and taking the replacement of causality by correlation as a feature of the big data era can be dangerous.
Previous Methodologies are useless
Many claims that previous research methodologies were designed for small data and hence they cannot work on big data as expected. But this is relative term. What we called small data today may be as good as the big data of that time. So, continuously research work is going on to handle big data since early years. Infact, high-performance computing, parallel and distributed computing, high efficiency storage, etc., will remain very popular in future as well.
Opportunities and Challenges
We need to consider about the increasing data size because we do the repeated scans of the entire dataset while doing performance measurements e.g., AUC. So, the question comes that can we identify valuable datasets from our original dataset?
Again, we can ask if there is any possibility to develop a parameter tuning guide so that current exhaustive search can be replaced? Another area can be noted down related to big data is statistical hypothesis testing. We can verify that what we have done is correct. The same task can be done by deriving interpretable models. Rule extraction and Visualization are another important approach on improving the comprehensibility of models. Another open problem in this big data era is can we really avoid the violation of privacy concerns, as this is still long-standing unresolved issue.
This content is not for distribution. Any use of the content without intimation to its owner will be considered as violation.