In the past, mining big data by algorithms, which are scalable in nature by leveraging parallel and distributed architecture, was the focus of numerous conferences and workshops. Volume, the third most important aspect of big data is now readily available. These datasets, which was in terabytes, are now measured in petabytes, since the data is being collected from individuals, business transactions and scientific experiments. Veracity, the fourth aspect of big data which focuses on data quality. Data quality has been focus because of complexity, noise, missing values or imbalance datasets. Another problem which velocity presents are to build streaming algorithms, which could take care of data qualities issues. This fields are still in learning or inception stage. Variety of data which is unstructured and structured such as social media, images, video and audio presents opportunity for data mining community. A fundamental challenge would be integrating these values into single features vector presentation. The last decade has seen social media growing which is contributing to big data field.
From Data to Knowledge to Discovery to Action
Alternative insights or a hypothesis leads to scenarios, which can be also weighted. Data driven companies have seen a productivity increase to 5-6 % as per Bryson Eyjolfsson report which studied about 179 companies.
Healthcare is another area which big data application are seen a tremendous growth. For example, United Healthcare, is making use of big data for mining customer attitudes to identify customer sentiments and satisfaction. These analytics could lead to quantifiable and actionable insights.
GLOBAL OPTIMIZATION WITH BIG DATA
Another area where big data is offering opportunity and challenges is global optimization. The objective is to maximize decision making variables. The meta heuristic algorithms has been successfully applied to complex and large-scale systems, reengineering and reconstructing organic networks.
Global Optimization of Complex Systems
The key aspect is club decision variables with high correlation and independent variables in another group. Multi objective optimization are as
Weighted aggregation based methods,
Pareto dominance based approaches, and
Performance indicator based algorithms.
However, the problem with the above-mentioned approaches is that these methods are not efficient when the number of objective is larger than three.
Big Data in Optimization
Analysis and mining is also challenging since the data is huge and might be stored in different forms and be present with noise. In short, such data is a perfect example of the four V’s representing big data.
INDUSTRY, GOVERNMENT AND SOCIETY WITH BIG DATA
In this discussion, emphasis is laid on the future practices rather than presenting a scientific point of view.
Decentralizing Big Data
“Crowdsourced” data, i.e. the data collected by tech giants like Google, facebook, Apple etc. can be used to deliver new insights. It’s a two-way deal in which the users knowingly or unknowingly are depositing their data in exchange for better services. This data is also made available to others on purpose and/or inappropriately.
With the data being stored at a centralized server, concerns about the privacy and security of the data are being raised. Many governments are introducing new regimes for the internet companies. It is felt that there is a need of privacy friendly cloud platforms in which the data is stored with the user rather than on the cloud.
Decentralizing data would give rise to the problem of extreme data distribution. The internet companies will face three challenges:
• Scale of the data
• Timeliness of model building
• Migration from centralized to decentralized
Scaled Down Targeted Sub-Models
It is always preferable to use the entire data to avoid the need for sampling. A lot of research is done generally on the algorithms. Since data is now so massive, the challenges presented to the data collection, storage, management, cleansing and transformation cannot be ignored. A possible solution could be to massively partition the data into many overlapping subsets to represent many subspaces so that we can accurately uncover the multitudes of behavioral patterns. A community of thousands of micro models can be used as an ensemble to operate over the entire population.
Right Time, Real Time Online Analytics
Organizations cannot afford to spend months to build models. It is clear that there is a need to build models in real-time to respond in real-time and that learn and change their behavior in real-time. There is a need for developing dynamic, agile models working in real-time.
Extreme Data Distribution: Privacy and Ownership
Aggregating data centrally is potentially dangerous as it poses great privacy and security risks. Single point of failure on the centralized data, i.e. one flaw or one breach may lead to devastating consequences on massive scale. Personal data needs to move back to the individual to whom the data belongs. Instead of centralizing all the computation, there is a need to bring the computation to intelligent agents running on personal devices and communicating with the service providers. Personal data can be encrypted and the smart devices can decrypt it. The smart devices interact with each other on the mesh network and insights can be found without having to collect the data on a central server.
Opportunities and Challenges
Once the decentralization of data is done, the only problem would be to tackle the extreme data distribution problem. With the data being distributed widely, we will be challenged to provide the services we have come to expect with massive centralized data storage. The challenges presented then would be:
• Appropriately partitioning data to identify behavioral groups within which we can learn and to level model and learn at an individual level.
• Refocus again on delivering learning algorithms that self-learn in real time and do it online.
Author: Xaltius
This content is not for distribution. Any use of the content without intimation to its owner will be considered as violation.