CLUSTERING: AN UNSUPERVISED MACHINE LEARNING ALGORITHM

Machine Learning in general is generally divided into two types, Supervised and Unsupervised Learning. Unsupervised Learning is also further divided into two main types. These are:

  • Clustering: A clustering problem is where there is a need to bring out the inherent groupings in data. Eg. – Grouping customers by their purchasing behavior.
  • Association:  An association rule learning problem is implemented when we want to discover those rules that describe large portions of our data. Eg. – The recommended content on most of the Online Shopping Websites, Social Networking Sites, etc. of the type “People that buy X also tend to buy Y.”

In this article we will learn more about clustering and how it is used!

Clustering, which is one of the forms of Unsupervised Learning, is where there is only the input data (X) and no corresponding variables like the dependent variable (y) or the variable which needs to be predicted. The goal of unsupervised learning is to model the underlying structure or distribution in the data to work and develop more facts about the data. This learning is called unsupervised learning because unlike supervised learning, there are no correct answers, and there is no teacher like the training set data that is used in the supervised learning in regression and classification. Algorithms are left to their devices to discover and present the impressive structure in the data.

Clustering Methods

 Clustering methods are broadly classified into the following categories −

  • Partitioning Method – Partitioning ‘n’ objects into ‘k’ partitions of data.
  • Hierarchical Method – It creates a hierarchical decomposition of the given set of data objects.
  • Density-based Method – The basic idea of this approach is to continue growing the given cluster for as long as the density in a particular neighborhood exceeds some fixed threshold.
  • Grid-Based Method – Here, the objects together form a grid.
  • Model-Based Method – Here, a model is hypothesized for each cluster to find the best fit of data for a given model.
  • Constraint-based Method – In this method, the clustering is performed by the incorporation of the user or application-oriented constraints.

It would not be inappropriate to say that life would be too difficult for us, if not for the assistance of clustering in our daily walks of life. Clustering finds its use in the industry in many ways. Some of them are:

  • It can also help the marketing managers to discover distinct groups and sub-groups in their customers based on their similarities, like the Age group, have a car or not, average expense, etc. which can undoubtedly help in using tactics for a better sale.
  • Clustering analysis is broadly used and finds its applications in market research, unique pattern recognition, image processing, and data analysis.
  • Identification of particular areas of similar land use in the Earth Observation Database, which also finds implementation in the identification of groups of houses in a city based on the house type, value, and geographic location.
  • In the Biological field, it can be used to derive animal and plant taxonomies, categorize their genes with similar and dissimilar functionalities and gain insight into structures inherent to populations.
  • The publicly available Taxi service provided by Uber, Ola, etc. process a large amount of valuable data using Clustering around traffic, transit-time, peak pickup localities, and more.
  • Classifying the documents on the web for Information discovery like a search Engine.
  • Outlier detection applications. Eg. : Detection of the Credit card fraud.
  • Clustering is also helpful in identifying Crime localities that require the special attention of the police.
  • The broadest and most extensive usage of Clustering is implemented in Data-mining. It is a technique by which different data elements are classified and put into related groups.
  • Call Record Detail Analysis (CDR) is the information captured by the worldwide telecom companies during the call, SMS, and the data usage activity of a customer.

Concluding, Unsupervised Learning portrays an extraordinary role in visualizing facts and figures, that can’t be seen and observed by human eyes. This processed information is not just useful for a company, but it has extensive application on a broad industry scale.

BEST PRACTICES IN DATA VISUALIZATION

Our world is progressively filling up with data, all companies – significant multinationals to the minor young startups are stockpiling massive amounts of data and are looking for ways to analyse this data in the raw form and obtain processed information, that can make complete sense. Data Visualisations represent data in pictorial form for the marketing managers to understand complex data diggings.

According to a fact, 3.5 trillions of e-mails are sent every day for the promotion of many companies; companies prepare ads, stockpile enough resources to deliver them to as many users as they can. With a slight observation, a considerable portion of receivers can be cut-off, who have a meagre – conversion rate. Doing so will not only lower the wastage of their resources but will also help them concentrate more on the people belonging to a higher rate of conversion, thus increasing the chances of the product being sold. For doing this, the implementation of supreme data visualisation is necessary.

Data Visualisation can take everyone by surprise. It is here that a meaningless looking pile of data starts making sense and delivers a specific result as per the likes of the end user or developer. It takes shape with the combined effort of ones creativity, attention, knowledge, and thinking. Data Visualisation can be useful, as well as harmful. To help your cause by not misleading your visualisation, here are some of the best practices for making your visualisation clear, useful and productive.

A. Plan your Resources
Create a sequence of steps by obtaining your requirements, your raw data, and other factors that might affect the result. This requires knowledge and experience for a data scientist to choose which method to use to for visualising your data. Planning the resources can be very helpful, as it will lead to greater efficiency with the efficient workload.

B. Know your Audience
The most essential and unavoidable step in creating great visualisations is knowing what to deliver. Focus on the likes of the audience, their mindsets, their queries, their perceptions and beliefs; and then plan effectively. It is not necessary that all the viewers will receive the information in the same way. For example, a probability density graph has a different meaning for an HR manager and a chief sales executive. So, it’s very vital that you know your target audience and prepare visualisations according to their perspective.

C: Predict after-effects
Predicting what might be the effect on the end users can add up to your cause. There can be a no-action session where everything is going positive in your way, while a downfall in a particular field may require some immediate action.

D: Classify your Dashboard
There are three main types of dashboards – strategic, analytical and operational. Following the below steps would let you know which dashboard suits best.

  • Strategic Dashboard: It represents a top notch level view of the inquiry line answered in a daily specific routine and presents KPIs in a minimally interactive way.
  • Analytical Dashboard: It provides a range of investigative approaches to a central specific topic.
  • Operational dashboard: It provides a regularly updated answer to a line of enquiry based on response to events.

E: Identify the Type of data

  • Data is of three types: categorical, ordinal and quantitative. Different types of visualisation work better with different kinds of data. A single relation of a data works best with line plot, two pieces of data work better with a scatter plot. A brief description of the type of data is given below:
    • Quantitative: Defines the number of data
    • Ordinal: Data that belongs to the same sequence. Ex: Medals – Gold, Silver and Bronze.
    • Categorical: Data that is of one type. Ex: Gender – Male, female and Other.

F: Use of Visual Features

  • Having done the above, a perfect choice of colour, hue, saturation can glorify your visualisation. It is just a matter of the presence of mind that draws attention.
  • Using the wrong hue and saturation configurations can bring ruin to all your efforts. A good set of visual features gives a final touch up to your data visualisation.

Concluding, modern technologies like machine learning and AI by itself will find no use for business corporates, if not for data visualisation. Data Visualisation has itself found its field of study and interests and finds its importance in every walk of analysing data.

{WAREHOUSE OPTIMIZATION USING AI @ DHL}

Artificial Intelligence is creating waves of disruption across many industries, be it manufacturing or human resources (HR). One of the major industries which AI has penetrated today is supply chain and logistics. All huge logistics and supply chain organizations have begun embarking on such journeys with Machine Learning and AI helping them across different use cases across the industry. Experts say that by 2020, AI could be completely transforming warehouse operations, with improvements in efficiency, profits and targets. The warehouse powered by AI would become more responsive and dynamic.

With the above in mind, we began our journey with DHL and were called in by the operations department to help them understand how AI can help them power their warehouse operations.

We conducted a half-day workshop to help them understand Supply Chain 4.0, the various types of AI and Machine Learning and walked them through various algorithms (like neural networks, genetic algorithms) and machine learning models trained on real DHL warehouse data, the interpretation of its results and how these results can be effective in optimizing their operations multi-fold.

The participants and the organizing team found this to be very useful and implementable in their day to day operations!

Get in touch with us for such customized trainings and to embark on the data science and AI journey!

PYTHON VS R – THE BURNING QUESTION

R and Python are both open-source programming languages with a large community. They are very popular among data analysts. New libraries or tools are added continuously to their respective catalog. R is mainly used for statistical analysis while Python provides a more general approach to data science.

While Python is often praised for being a general-purpose language with an easy-to-understand syntax, R’s functionality is developed with statisticians in mind, thereby giving it field-specific advantages such as great features for data visualization. Both R and Python are state of the art in terms of programming language oriented towards data science and hence learning both of them is, of course, the ideal solution. But R and Python require a time-investment, and such luxury is not available for everyone.

Let us see how these two programming languages relate to each other, by exploring the strengths of R over Python and vice versa and indulging in basic comparison between these two.

Python can do almost all the tasks that R can, like data wrangling, engineering, feature selection, web scraping and so on. But Python is known as a tool to deploy and implement machine learning at a large-scale, as Python codes are easier to maintain and remains more robust than R. The programming language is up to date with many data learning and machine learning libraries. It provides APIs for machine learning or AI. Python is also usually the first choice when there is a need to use the results of any analysis in an application or a website.

R has been developed by academicians and statisticians in over 2 decades. It is now one of the richest ecosystems to perform data analysis.  Around 12000 packages are available in CRAN (open-source repository) now. A rich variety of libraries can be found for any analysis one needs to perform, making R the first choice for statistical analysis, especially for specialized analytical work.

One major difference between R and other statistical tools or languages is the output. Other than R, there are very good tools to communicate results and make presentation of findings easy. In R, Rstudio comes with the library knitr which helps with the same, but other than that it lacks the flexibility for presentation.

R and Python Comparison

Parameter R Python
Objective Data analysis and statistics Deployment and production
Primary Users Scholar and R&D Programmers and developers
Flexibility Easy to use available library Easy to construct new models from scratch. I.e., matrix computation and optimization
Learning curve Difficult at the beginning Linear and smooth
Popularity of Programming Language. Percentage change 4.23% in 2018 21.69% in 2018
Average Salary $99.000 $100.000
Integration Run locally Well-integrated with app
Task Easy to get primary results Good to deploy algorithm
Database size Handle huge size Handle huge size
IDE Rstudio Spyder, Ipthon Notebook
Important Packages and library tydiverse, ggplot2, caret, zoo pandas, scipy, scikit-learn, TensorFlow, caret
Disadvantages Slow High Learning curve Dependencies between library Not as many libraries as R
Advantages
  • Graphs are made to talk. R makes it beautiful
  • Large catalog for data analysis
  • GitHub interface
  • RMarkdown
  • Shiny
  • Jupyter notebook: Notebooks help to share data with colleagues
  • Mathematical computation
  • Deployment
  • Code Readability
  • Speed
  • Function in Python

Source: https://www.guru99.com/r-vs-python.html

The Usage

As mentioned before, Python has influential libraries for math, statistics and Artificial Intelligence. While Python is the best tool for Machine Learning integration and deployment, the same cannot be said for business analytics.

R, on the other hand, is designed by experts to answer statistical problems. It can also solve problems on machine learning and data science. R is preferred for data science due to its powerful communication libraries. It is also equipped with numerous packages to perform time series analysis, panel data and data mining. But R is known to have a steep learning curve and therefore is not recommended for beginners.

As a beginner in data science with necessary statistical knowledge, it might be easier to use Python and to learn how to build a model from scratch and then switch to the functions from the machine learning libraries. R can be the first choice if the focus is going to be on statistics.

In conclusion, one needs to pick the programming language based on the requirements and available resources. The decision should be made based on what kind of problem is to be solved, or the kind of tools that are available in the field.

CAN AI SAVE MARINE LIFE?

In June 2018, a rescue attempt to save a pilot whale who was struggling to swim and breathe failed in Thailand. The autopsy revealed that the whale had more than 17 pounds of plastic bags and other trash clogging its stomach, which had made it difficult for the animal to digest food. Such examples have become fairly common nowadays.

According to UNESCO’s Facts and figures on marine biodiversity, by 2100, without major changes, more than half of all marine species will be at risk of extinction. Oceans contain the greatest diversity of life on Earth and protecting them has become one of the world’s most important challenges. One of the latest powers to take charge in this field is Artificial Intelligence.

How does Artificial Intelligence play a role?

AI can and is being used to complete tasks that are usually done manually by researchers, like identifying individual animals from photos for population studies to categorizing the many millions of camera trap photos gathered by field scientists. The big data that is received through advanced technologies like machine learning can be analyzed for various purposes. New advances in satellite observation, open data and machine learning now allow us to process the massive amounts of data being produced. Here are some examples where smart technology is making a difference.

In 2014, researchers at the University of California, San Diego, (UCSD) built an artificially intelligent camera intended to monitor and track endangered species. The submergible system records video when it hears a vocalization created by a marine animal and produces relevant data for biological monitoring. The SphereCam is powered by an Intel Edison compute module which allows for flexibility, energy efficiency and easy programming. In addition, the module can run on batteries for up to a week and fit inside the camera, preventing it from getting wet.

Global Fishing Watch uses satellite-based monitoring to track all fishing vessels in real-time to protect fisheries around the world. They collect satellite imagery and analyze boats’ movements with a specially designed form of machine learning to determine if they are fishing or sea-fearing vessels. Then they post all fishing boat data to their open source website, so that researchers, law enforcement agencies and the public can keep watch on key trends such as frequency and monitor if any fishing boats venture into protected waters.

Google has teamed up with Queensland University to create a detector powered by machine learning, which can automatically identify manatees (or sea cows) in ocean images. This detector can save the time, energy and resources that researchers spend while sifting through thousands of aerial photos to spot the animals, as the image recognition system will do this work for them.

How do all these devices and information help?

Information such as this helps conservationists track populations, identify the results of human interventions in manatee habitats and can play a key role in managing the future of this endangered species. Similar software can also be developed to track other marine life as well, such as humpback whales and other ocean mammals.

Researchers are also using computational sustainability — the ability to analyze Big Data and make predictions based on trends — to understand marine life and find solutions.

In partnership with Microsoft, The Nature Conservancy has combined traditional, academic research with cloud and AI technologies to map, in high resolution, ocean wealth. By evaluating the economic value of ocean ecosystem services- such as carbon storage and fishing activity- it will make better conservation and planning decisions possible.

IBM recently announced a new AI-power microscope capable of detecting plankton’s behavior, to predict the health of the environment. In a few years, IBM anticipates that these small autonomous AI microscopes will be networked in the cloud and deployed around the world, continually collecting information that will provide tremendous insight into the health and operation of the ocean’s complex ecosystem.

Combining AI and robotic technology could also reduce illegal activity by poachers through tracking as well as reduce emissions from cargo ships by suggesting the best routes. “Not only will a healthy ocean benefit its inhabitants, but the entire human race. AI is thus the next step”, writes Megan Ray from the Women of Green.

While AI is disrupting numerous industries and transforming lifestyle, it could also be a smart way to save marine life.

This content is not for distribution. Any use of the content without intimation to its owner will be considered as violation.

Page 2 of 1212345...10...Last »