SQL EVERY DATA SCIENTIST SHOULD KNOW

SQL (Structured Query Language) is a standard database language that is used to create, maintain and retrieve relational databases. It is used to make the work of data retrieval, manipulation, and updation swift and easy. Started in the 1970s, SQL has become a very important tool in a data scientist’s toolbox. Here are some reasons why SQL serves as an important tool for a data scientist.

 

  • Easy to Learn – SQL doesn’t require very high-level conceptual understanding and memorization of the steps. SQL is known for its ease of use which uses a set of declarative statements. The statements are structures in simple english language. Since data science, by its name, is all about the extraction of data and playing with it, there always comes a requirement for a tool that can fetch data from large databases easily. SQL is very handy at this.
  • Understanding the dataset – As a data scientist, you must master the understanding of the dataset you are working with. Learning SQL will surely give you an edge over others with less knowledge in the field. SQL will help you to sufficiently investigate your dataset, visualize it, identify the structure and get to know how your dataset actually looks.
  • Full Integration with Scripting Languages – As a data scientist, you will need to meticulously present your data in a way that is easily understood by your team or organization. SQL integrates very well with scripting languages like R and Python.
  • Manages Large Data Warehouses – Data science in most cases involves dealing with huge volumes of data stored in relational databases. As the volume of datasets increase, it becomes untenable to use spreadsheets. The best solution for dealing with huge datasets is SQL.

Here are some handy tips, that a data scientist must follow to improve their SQL experience:

1) Data Modeling – Understanding relational data models is foundational to both effective analysis and using SQL. An effective data scientist should know how to model one-to-one, one-to-many, and many-to-many relationships. On top of that, they should be familiar with denormalized data models such as the star and snowflake schema.

2) Aggregations – Data analysis is all about aggregations. Understanding how the ‘group by’ clause interacts with joins and effective use of the ‘having’ clause for filtering will be foundational in working with large data sets.

3) Window Functions – Some of the most powerful functions within SQL, these unlock the ability to calculate moving averages, cumulative sums, and much more.

4) ‘IN’ Considered Harmful –  Almost every query that uses the ‘in’ operator can be rewritten using joins and subqueries for better performance. ‘IN’ is typically lazy query writing and should be avoided.

5) Navigating Metadata – You can easily navigate through query table structures, data types, index cardinality, etc. Very useful if you’re digging around a SQL terminal frequently.

Considering the scope of SQL in the field of data science and other industries, it becomes an essential skill that a data scientist must possess. For most data science jobs, proficiency in SQL ranks higher than the other programming languages. The ability to store, update, access control and manipulate datasets is a great skill for every data scientist. Every data scientist must begin their data science learning path with SQL as the first stepping stone.

Programming in SQL is highly marketable as compared to other programming languages.

BEST PRACTICES IN PYTHON

Python is a versatile language that has attracted a broad base of people in recent times. This is primarily due to the simplicity, speed, and performance that Python offers. The popularity of Python grew exponentially during the last decade. According to an estimate, the previous five years saw more Python developers than the conventional Java/C++ programmers.

Why does Python have an edge over the other programming languages? Let’s find out!

  • Everything is an object in Python
  • Support for Object Oriented Programming – including multiple inheritances, instance methods, and class methods
  • Attribute access customization
  • List, dictionary and set comprehensions
  • Generators expressions and generator functions (lazy iteration)
  • Standard library support of queues, fixed precisions decimals, rational numbers.
  • Wide-ranging standard library including OS access, Internet access, cryptography, and much more.
  • Strict nested scoping rules
  • Support for modules and packages
  • Support for machine & deep learning
  • Parallel Programming

As a Python developer, you must know some basic techniques and practices which could help you by providing a free-flowing work environment. Some of these are listed below.

Create Readable Documentation

In python, the best practice is readable documentation. You may find it a little burdensome, but it creates a clean code. For this purpose, you can use Markdown, reStructuredText, Sphinx, or docstrings. reStructuredText and Markdown are markup languages with plain text formatting syntax to make it easy to markup text and convert it into a format like HTML or PDF. Sphinx is a tool to create intelligent and beautiful documentation easily, while reStructuredText lets you create in-line documentation. It also enables you to export documentation in formats like HTML.

Follow Style Guidelines

Python follows a system of community-generated proposals known as Python Enhancement Proposals(abbreviated as PEPs) which attempt to provide the basic set of guidelines and standards for a wide variety of topics for proper Python Development. One of the most widely referenced PEPs ever created is PEP8, which is also termed as the “Python community Bible” for properly styling your code.

Immediately Correct your Code

When creating a python application, it is almost always more beneficial in the long-term to acknowledge quickly and repair broken code.

Give Preferences to PyPI over manual Coding

The above will help in obtaining a clean and elegant code. However, one of the best tools to improve your use of Python is the huge module repository namely The Python Package Index (short for PyPI). Not considering the level and experience of the Python Developer, this repository will be very beneficial for you. Most projects will initially begin by utilizing existing projects on PyPI. The PyPI has over 10,000 projects at the time of writing. There’s undoubtedly some code that will fulfill your project needs.

Watch out for Exceptions

The developer should watch out for the exceptions. They creep in from anywhere and are difficult to debug.

Example: One of the most annoying is the KeyError exception. To handle this, a programmer must first check whether or not a key exists in the dictionary.

Write Modular and non-repetitive Code

A class/function should be defined if some operation is required to be performed multiple times. This will shorten your code, also increasing code readability and reducing debugging time.

Use the right data structures

The benefits of different data structures are very well known. This will result in higher working speed, storage space reduction, and higher code efficiency.

These were some of the points that every Python developer must follow for a smooth experience in Python. Python is a growing language and its increased use in the field of Data Analytics and Machine Learning has proved to be very useful for the developers. In the upcoming years, Python shall have a very bright future, and the programmers who are proficient in Python will have an advantage.

RENOWNED DATA SCIENCE PERSONALITIES

With the advancement of big data and artificial intelligence, the need for its efficient and ethical usage also grew. Prior to the AI boom, the main focus of companies was to find solutions for data storage and management. With the advancement of various frameworks, the focus has shifted to data processing and analytics. In more popular terms, this process today is known as Data Science. Few names standout and have a separate base of importance when the name data science comes into picture, largely due to their contributions to this field. Let us talk about some of these researchers and scientists who have devoted their life and study to reinvent the wheel.


Andrew Ng

Andrew Ng is one of the most prominent names among leaders in the fields of AI and Data science. He is an adjunct professor at Stanford University and also the co-founder of Coursera. Formerly, he was the head of the AI unit in Baidu. He is also an enthusiast researcher, having authored and co-authored around 100 research papers on machine learning, AI, deep learning, robotics and many more relevant fields. He is highly appreciated in the group of new practitioners and researchers in the field of data science. He has also worked in close collaboration with Google on their Google Brain project. He is the most popular data scientist with a vast number of followers on social media and other channels.

DJ Patil

The Data Science Man, DJ Patil, needs no introduction. He is one of the influencing personalities, not just in Data Science but around the world in general. He was the co-coiner of the term Data Science. He was the former Chief Data Scientist at the White House. He was also honoured by being the former Head of Data Products, Chief Scientist and Chief Security Officer at LinkedIn. He was the former Director of Strategy, Analytics, and Product / Distinguished Research Scientist at eBay Inc. The list does just goes on.

DJ Patil is inarguably one of the top data scientists around the world. He received his PhD in Applied Mathematics from the ‘University of Maryland College Park’.


Kirk Borne

Kirk Borne has been the chief data scientist and the leading executive advisor at Booz Allen Hamilton since 2015. Working as a former NASA astrophysicist, he was part of many major projects. At the time of crisis, he was also called upon by the former President of the US to analyse data post the 9/11 attack on the WTC in an attempt to prevent further attacks. He is amongst the top data science and big data influencers accounting for more than 250K followers on Twitter.

Geoffrey Hinton

He is known for his astonishing work on Artificial Neural Networks. Geoffrey was the brain behind the ‘Backpropagation’ algorithm which is used to train deep neural networks. Currently, he leads the AI team at Google and simultaneously finds time for the ‘Computer Science’ department at the ‘University of Toronto’. His research group has done some overwhelming work for the resurgence of neural networks and deep learning.

Geoff coined the term ‘Dark Knowledge’.

Yoshua Bengio

Having worked with AT&T & MIT as a machine learning researcher, Yoshua holds a PhD in Computer Science from McGill University, Montreal. He is currently the Head of the Montreal Institute for Learning Algorithms (MILA) and also has been a professor at Université de Montréal from the past 24yrs

Yann LeCun

Director of AI Research at Facebook, Yann has 14 registered US patents. He is also the founding director of NYU Center for Data Science. Yann has a PhD in Computer Science from Pierre and Marie Curie University. He’s also a professor of Computer Science, Neural Science and the Founding Director of the Data Science Center at New York University.

Peter Norvig

Peter is a co-Author of ‘Artificial Intelligence: A Modern Approach’ and ‘Paradigms of AI Programming: Case Studies in Common Lisp’, some insightful books for programming and artificial intelligence. Peter has close to 45 publications under his name. Currently the ‘Engineering Director’ at ‘Google’, he has worked on various roles in Computational Sciences at NASA for three years. Peter received his Ph.D. from the ‘University of California’ in ‘Computer Science.’

Alex “Sandy” Pentland

Named the ‘World’s Most Powerful Data Scientist’ by Forbes, Alex is a professor at MIT from the past 31 years. He has also been a chief advisor at Nissan and Telefonica. Alex has co-founded many companies over the years some of which include Home, Sense Networks, Cogito Corp and many more. Currently, he is on the board of Directors of the UN Global Partnership for Sustainable Data Development.

These are some of the few leaders from a vast community of leaders. There are many unnamed leaders whose work is the reason why you have recommender systems, advanced neural networks, fraud detection algorithms and many other intelligent systems that we seek help to fulfil our daily needs.

‘ARTIFICIAL INTELLIGENCE’ VS ‘MACHINE LEARNING’ VS ‘DEEP LEARNING’

Artificial Intelligence, Machine Learning and Deep Learning are terms that are often used interchangeably. But are they really the same? Let’s find out.

The easiest way to think of the relationship between the above terms is to visualize them as concentric circles using the concept of sets with AI — the idea that came first — the largest, then machine learning — which blossomed later, and the most recent being deep learning — which is driving today’s AI explosion — fitting inside both.

Graphically this relation can be explained as in the picture below.

As you can see in the above image consisting of three concentric circles, Deep Learning is a subset of ML, which is also a subset of AI. This gives an idea that AI is the all-encompassing concept that initially erupted, which was then followed by ML that thrived later, and lastly, Deep Learning that is promising to escalate the advances of AI to another level.

Starting with AI, let us have a more in-depth insight into the following terms.

Artificial Intelligence

Intelligence, as defined by Wikipedia, is “Perceiving the information through various sources, followed by retaining them as knowledge and applying them with real-life challenges.”

Machines built on AI are of two types – General AI and Narrow AI

General AI refers to the machines capable of using all our senses. We’ve seen these General AI in Sci-Fi movies like The Terminator. In real life, a lot of work has been done on the development of these machines; however, more research is yet to be done to bring them into existence.

What we CAN do falls in the hands of “Narrow AI”. These refer to the technologies that can perform specific tasks as well as, or better than, we humans can. Some examples are – classifying emails as spam and not spam and facial recognition on Facebook. These technologies exhibit some facets of human intelligence.

Where does that intelligence come from? That brings us to our next term -> Machine Learning.

Machine Learning

Learning, as defined by Wikipedia, is referred to as “acquiring information and finding a pattern between the outcome and the inputs from the set of examples given.” ML intends to enable artificial machines to learn by themselves using the provided data and make accurate predictions. Machine Learning is a subset of AI. More importantly, it is a method of training algorithms such that they can learn to make decisions.

Training in machine learning requires a lot of data to be fed to the machine which then allows the machine (models) to learn more about the processed information.

Deep Learning 

Deep Learning is an algorithmic approach for the early machine-learning crowd. Neural Networks form the base for Deep Learning and is inspired by our understanding of the biology of the human brain. However, unlike a biological brain where any neuron unit can connect to any other neuron unit within a certain physical distance, these artificial neural networks (ANN) have discrete layers, connections, and directions of data propagation.

For a system designed to recognize a STOP sign, a neural Network model can come up with a “probability score”, which is  a highly educated guess, based on the algorithm. In this example, the system might be 86% confident the image is a stop sign, 7% convinced it’s a speed limit sign, and 5% it’s a kite stuck in a tree, and so on.

A trained Neural Networks is one that has been analyzed on millions of samples until it is sampled so well that it gets the answer right practically every time.

Deep Learning can automatically discover new features to be used for classification. Machine Learning on the other hand, requires to be provided these features manually. Also, in contrast to Machine Learning, Deep Learning requires high-end machines and considerably significant amounts of training data to deliver accurate results.

Wrapping up, AI has a bright future, considering the development of deep learning. At the current pace, we can expect driverless vehicles, better recommender systems and more in the forthcoming time. AI, ML, and Deep Learning (DL) are not very different from each other; but are not the same.

THE BURNING QUESTION: TABLEAU OR POWER BI?

The concept of using pictures to understand patterns in data has been around for centuries. From existing in the form of graphs and maps in the 17th century to the invention of the pie chart in the mid-1800s, the idea has been exquisite. The 19th century witnessed one of the most cited examples of data visualisation, when Charles Minard mapped Napoleon’s invasion of Russia. The map depicted the size of Napoleon’s army along with the path of Napoleon’s retreat from the city of Moscow – and tied that information to temperature and time scales for a more in-depth understanding of the event.

Read more about data Visualisation in our previous blog – Practices on Data Visualisation.

In the modern world, when it comes to the search of a Business Intelligence (BI) or Data Visualisation tool, we come across two front runners. They are Power BI and Tableau. Both of these products are equipped with a set of handy features like drag-and-drop, data preparation amongst many others. Although similar, each comes with its particular set of strengths and weaknesses. Let us understand the differences between these two tools.

The tools will be compared on the following grounds:

  • Cost
  • Licensing
  • Visualisation
  • Integrations
  • Implementation
  • Data Analysis
  • Functionality

Cost
Cost remains a significant parameter when these products are compared. This is because at one end PowerBI is priced around 100$ a year while Tableau can be rather expensive up to 1000$ a year. PowerBI is more affordable and economical than Tableau and is suitable for small businesses. Tableau on the other hand, is built for data analysts and offers in-depth insight features.

Licensing
For a fact, the final choice depends on whether one wants to pay upfront cost for the software or not. If yes then Tableau should be the first choice else one should opt for PowerBI.

Visualisation
When it comes to visualisation, both the products have their strengths. PowerBI can prove to be better if the desired outcome is data with better visuals. PowerBI lets you easily upload datasets. It gives a clear and elegant visualisation. However, if the prime focus is visualisation, Tableau leads by a fair margin. Tableau performs better with more massive datasets and gives users efficient drill-down features.

Integrations
PowerBI has API access and pre-built dashboards for speedy insights for some of the most widely used technologies and tools like Salesforce, Google Analytics and Microsoft Products. On the contrary, Tableau has invested heavily in integrations and widely-used connections. A user can view all of the connections included right when he/she logs into the tool.

Implementation
This parameter is primarily dependent on factors like the size of the company, the number of the users and others. Power BI comes out to be fairly more straightforward on the grounds of implementation and requires a low level of expertise. However, Tableau, although is a little more complex, offers more variety. Tableau incorporates the use of quick-start applications for deploying small scale applications.

Data Analysis
Power BI offers speed and efficiency, and establishes relationships between data sources. On the other hand, Tableau provides more extensive features and helps the user in hypothesising data better.

Functionality
For the foreseeable future, any organisation which has users spending more than an hour or two per day using their BI tool might want to go with Tableau. Tableau offers a lot of features and minor details that are unmatched.

Feature Power BI Tableau
Date Established 2013 2003
Best Use Case Dashboards & Ad-hoc Analysis Dashboards & Ad-hoc Analysis
Best Users Average Joe/Jane Analysts
Licensing Subscription Subscription
Desktop Version Free $70/user/month
Investment Required Low High
Overall Functionality Very Good Very Good
Visualisations Good Very Good
Performance With Large Datasets Good Very Good
Support Level Low (Or through partner) High

It all depends upon who will be using these tools. Microsoft powered Power BI is built for the joint stakeholder, not necessarily a data analyst. The interface relies on drag and drop and intuitive features to help teams develop their visualisations. It’s a great addition to any organisation that needs data analysis without getting a degree in data analysis or any organisation having smaller funds.

Tableau is more powerful, but the interface isn’t quite as intuitive, which makes it more challenging to use and learn. It requires some experience and practice to have control over the product. Once this is achieved, Tableau can prove to be much more powerful for data analytics in the long run.

Learn PowerBI & Tableau today!

As an organization, we believe these technologies are of the utmost importance and every other organization should have their employees upskilled to know them. If you are looking for your employees to be trained, we can help you with that! Contact us today and we will be happy to chat with you and know your requirements!

Page 1 of 212