Data Analytics – CareerFoundry

What Is Data Wrangling?

Will Hillier — Wed, 09 Dec 2020 00:00:00 +0000

Data wrangling is the transformation of raw data into a format that is easier to use. But what exactly does it involve? In this post, we find out.

Manipulation is at the core of data analytics. We don’t mean the sneaky kind, of course, but the data kind! Scraping data from the web, carrying out statistical analyses, creating dashboards and visualizations—all these tasks involve manipulating data in one way or another. But before we can do any of these things, we need to ensure that our data are in a format we can use. This is where the most important form of data manipulation comes in: data wrangling.

If you’d like to get started handling data, check out this free data analytics short course.

In this post, we explore data wrangling in detail. When you’ve finished reading, you’ll be able to answer:

What is data wrangling (and why is it important)?
Data wrangling vs. data cleaning: what’s the difference?
What is the data wrangling process?
What tools do data wranglers use?

First up…

1. What is data wrangling and why is it important?

Data wrangling is a term often used to describe the early stages of the data analytics process. It involves transforming and mapping data from one format into another.

The aim is to make data more accessible for things like business analytics or machine learning. The data wrangling process can involve a variety of tasks. These include things like data collection, exploratory analysis, data cleansing, creating data structures, and storage.

Data wrangling is time-consuming. In fact, it can take up to about 80% of a data analyst’s time. This is partly because the process is fluid, i.e. there aren’t always clear steps to follow from start to finish. However, it’s also because the process is iterative and the activities involved are labor-intensive. What you need to do depends on things like the source (or sources) of the data, their quality, your organization’s data architecture, and what you intend to do with the data once you’ve finished wrangling it.

Why is data wrangling important?

Insights gained during the data wrangling process can be invaluable. They will likely affect the future course of a project.

Skipping or rushing this step will result in poor data models that impact an organization’s decision-making and reputation. So, if you ever hear someone suggesting that data wrangling isn’t that important, you have our express permission to tell them otherwise!

Unfortunately, because data wrangling is sometimes poorly understood, its significance can be overlooked. High-level decision-makers who prefer quick results may be surprised by how long it takes to get data into a usable format.

Unlike the results of data analysis (which often provide flashy and exciting insights), there’s little to show for your efforts during the data wrangling phase. And as businesses face budget and time pressures, this makes a data wrangler’s job all the more difficult. The job involves careful management of expectations, as well as technical know-how.

2. Data wrangling vs. data cleaning: what is the difference?

Some people use the terms ‘data wrangling’ and ‘data cleaning interchangeably. This is because they’re both tools for converting data into a more useful format. It’s also because they share some common attributes. But there are some important differences between them:

Data wrangling refers to the process of collecting raw data, cleaning it, mapping it, and storing it in a useful format. To confuse matters (and because data wrangling is not always well understood) the term is often used to describe each of these steps individually, as well as in combination.

Data cleaning, meanwhile, is a single aspect of the data wrangling process. A complex process in itself, data cleaning involves sanitizing a data set by removing unwanted observations, outliers, fixing structural errors and typos, standardizing units of measure, validating, and so on. Data cleaning tends to follow more precise steps than data wrangling…albeit, not always in a very precise order! You can learn more about the data cleaning process in this post.

The distinction between data wrangling and data cleaning is not always clear-cut. However, you can generally think of data wrangling as an umbrella task. Data cleaning falls under this umbrella, alongside a range of other activities.

These can involve planning which data you want to collect, scraping those data, carrying out exploratory analysis, cleansing and mapping the data, creating data structures, and storing the data for future use.

3. What is the data wrangling process?

The exact tasks required in data wrangling depend on what transformations you need to carry out to get a dataset into better shape. For instance, if your source data is already in a database, this will remove many of the structural tasks. But if it’s unstructured data (which is much more common) then you’ll have more to do.

The following steps are often applied during data wrangling. But the process is an iterative one. Some of the steps may not be necessary, others may need repeating, and they will rarely occur in the same order. But you still need to know what they all are!

Extracting the data

Not everybody considers data extraction part of the data wrangling process. But in our opinion, it’s a vital aspect of it. You can’t transform data without first collecting it. This stage requires planning. You’ll need to decide which data you need and where to collect them from. You’ll then pull the data in a raw format from its source. This could be a website, a third-party repository, or some other location. If it’s raw, unstructured data, roll your sleeves up, because there’s work to do! You can learn how to scrape data from the web in this post.

Carrying out exploratory data analysis (EDA)

EDA involves determining a dataset’s structure and summarizing its main features. Whether you do this immediately, or wait until later in the process, depends on the state of the dataset and how much work it requires.

Ultimately, EDA means familiarizing yourself with the data so you know how to proceed. You can learn more about exploratory data analysis in this post.

Structuring the data

Freshly collected data are usually in an unstructured format. This means they lack an existing model and are completely disorganized.

Unstructured data are often text-heavy but may contain things like ID codes, dates, numbers, and so on. To structure your dataset, you’ll usually need to parse it. In this context, parsing means extracting relevant information. For instance, you might parse HTML code scraped from a website, pulling out what you need and discarding the rest.

The result might be a more user-friendly spreadsheet containing the useful data with columns, headings, classes, and so on.

Cleaning the data

Once your dataset has some structure, you can start applying algorithms to tidy it up. You can automate a range of algorithmic tasks using tools like Python and R. They can be used to identify outliers, delete duplicate values, standardize systems of measurement, and so on. You can learn about the data cleaning process in detail in this post.

Enriching the data

Once your dataset is in good shape, you’ll need to check if it’s ready to meet your requirements. At this stage, you may want to enrich it.

Data enrichment involves combining your dataset with data from other sources. This might include internal systems or third-party providers. Your goal could be to accumulate a greater number of data points (to improve the accuracy of an analysis). Or it could simply be to fill in gaps…Say, by combining two databases of customer info where one contains telephone numbers, and the other doesn’t.

Validating the data

Validating your data means checking it for consistency, quality, and accuracy. We can do this using pre-programmed scripts that check the data’s attributes against defined rules.

This is also a good example of an overlap between data wrangling and data cleaning—validation is key to both. Because you’ll likely find errors, you may need to repeat this step several times.

Publishing the data

Last but not least, it’s time to publish your data. This means making the data accessible by depositing them into a new database or architecture.

End-users might include data analysts, engineers, or data scientists. They may use the data to create business reports and other insights. Or they might further process it to build more complex data structures, e.g. data warehouses. After this stage, the possibilities are endless!

4. What tools do data wranglers use?

Data wranglers use many of the same tools applied in data cleaning. These include programming languages like Python and R, software like MS Excel, and open-source data analytics platforms like KNIME. Programming languages can be difficult to master but they are a vital skill for any data analyst. However, Python is not that difficult to learn and it allows you to write scripts for very specific tasks. We share some tips for learning Python in this article.

There are also visual data wrangling tools out there. The general aim of these is to make data wrangling easier for non-programmers and to speed up the process for experienced ones. Tools like Trifacta and OpenRefine can help you transform data into clean, well-structured formats.

A word of caution, though. While visual tools are more intuitive, they are sometimes less flexible. Because their functionality is more generic, so they don’t always work as well on complex datasets. As a rule, the larger and more unstructured a dataset, the less effective these tools will be. Beginners should aim to combine programming expertise (scripting) with proprietary tools (for high-level wrangling). We’ve rounded up some of the best data wrangling tools in our guide.

Final thoughts

Data wrangling is vital to the early stages of the data analytics process. Before carrying out a detailed analysis, your data needs to be in a usable format. And that’s where data wrangling comes in. In this post, we’ve learned that:

Data wrangling involves transforming and mapping data from a raw form into a more useful, structured format.
Data wrangling can be used to prepare data for everything from business analytics to ingestion by machine learning algorithms.
The terms ‘data wrangling’ and ‘data cleaning’ are often used interchangeably—but the latter is a subset of the former.
While the data wrangling process is loosely defined, it involves tasks like data extraction, exploratory analyses, building data structures, cleaning, enriching, and validating; and storing data in a usable format.
Data wranglers use a combination of visual tools like OpenRefine, Trifacta or KNIME, and programming tools like Python, R, and MS Excel.

The best way to learn about data wrangling is to dive in and have a go. For a hands-on introduction to some of these techniques, why not try out our free, 5-day data analytics short course? To learn more about data analytics, check out the following:

Video: How I'd Learn Data Analytics If I Had to Start Over

Matthew Deery — Mon, 16 Oct 2023 13:41:33 +0000

What better way to plan how to go about becoming a data analyst this year than from an experienced data scientist with a time machine?

Okay, so maybe we don’t have an actual time machine, but that didn’t stop us from asking CareerFoundry’s Senior Data Scientist Tom Gadsby if he could do it all again…what would he change?

The data analytics world of 2023 comes with more online content, bigger professional communities, more sectors, and more machine learning opportunities. In this video Tom explains how he would take advantage of these changes if he were to become a data analyst from scratch again—as well as learning from his mistakes.

So whether it’s having more confidence, not worrying so much about whether you love maths or not, and how to plan a roadmap of your own learning, you’ll learn it all in Tom’s chat.

Want to learn more about breaking into data analytics?

Try the free data analytics short course or speak to one of our program advisors.

Otherwise, make sure to check out our other guides:

What is Cluster Analysis? A Complete Beginner Guide

Austin Chia — Thu, 12 Oct 2023 11:18:16 +0000

When conducting data analysis of large groups of datasets, you’re likely to be overwhelmed by the amount of information it provides.

In such scenarios, it’s recommended to separate out these data points based on their similarities to make things easier to start with.

Have you ever heard of cluster analysis?

It’s an essential way of identifying discrete groups in data among many data professionals, yet many beginners remain in the dark about what cluster analysis is and how it works.

In this blog post, we’ll introduce you to the concept of cluster analysis, its advantages, common algorithms, how they can be evaluated, as well as some real-world applications.

We’ll cover the following:

Cluster analysis: What it is and how it works
What are the advantages of cluster analysis?
Clustering algorithms: Which one to use?
Evaluation metrics for cluster analysis
Real-world applications of cluster analysis
Key takeaways

Join us as we dive into the basics of cluster analysis to help you get started.

1. Cluster analysis: What it is and how it works

To help you better understand cluster analysis, let’s go over the definition of what it is first.

What is cluster analysis?

Source: Wikimedia Commons

Cluster analysis is a statistical technique that organizes and classifies different objects, data points, or observations into groups or clusters based on similarities or patterns.

You can think of cluster analysis as finding natural groupings in data.

How does cluster analysis work?

Cluster analysis involves analyzing a set of data and grouping similar observations into distinct clusters, thereby identifying underlying patterns and relationships in the data.

Cluster analysis is widely used in data analytics across various fields, such as marketing, biology, sociology, and image and pattern recognition.

Cluster analysis varies by the type of clustering algorithm used.

2. What are the advantages of cluster analysis?

The concept of cluster analysis sounds great—but what are its actual advantages?

Here’s a list of them:

Identifying groups and relationships

Cluster analysis can help to identify groups and relationships in large datasets that may not be readily apparent.

This allows for a deeper understanding of the underlying structure of the data.

Likely the largest benefit of using cluster analysis is the ability to find similarities and differences in large datasets can help identify new trends and opportunities for further research.

Reducing data complexity

Cluster analysis can be used to reduce the complexity of large datasets, making it easier to analyze and interpret the data.

For example, by grouping similar objects together, the number of dimensions of data can be reduced. This might bring benefits of faster and more simplified analysis.

Clustering may also help rule out irrelevant data that do not have similarities. You’ll have a more streamlined analysis process as a result.

Improving visual representation

Cluster analysis often results in data visualizations of clusters, such as scatterplots or dendrograms.

These visualizations can be powerful tools for communicating complex information. Since cluster plots are simple for most to interpret and understand, this can be a good choice to include in presentations.

3. Clustering algorithms: Which one to use?

As mentioned, when starting a cluster analysis, you’re required to select from one of the appropriate clustering algorithms.

There are quite a few types of clustering algorithms out there, and each of them is used differently.

Here are the five most common types of clustering algorithms you’ll find:

1. Centroid-based clustering

Centroid-based clustering is a type of clustering method that partitions or splits a data set into similar groups based on the distance between their centroids.

Each cluster’s centroid, or center, is determined mathematically as either the mean or median of all the points in the cluster.

Source: By Chire – Own work, CC BY-SA 3.0

The k-means clustering algorithm is one commonly used centroid-based clustering technique. This method assumes that the center of each cluster represents each cluster.

It aims to find the optimal k clusters in a given data set by iteratively minimizing the total distance between each point and its assigned cluster centroid.

Other centroid-based clustering methods include fuzzy c-means.

2. Connectivity-based clustering

Connectivity-based clustering, also known as hierarchical clustering, groups data points together based on the proximity and connectivity of their attributes.

Simply put, this method determined clusters based on how close data points are to each other. The idea is that objects that are nearer are more closely related than those that are far from each other.

To implement connectivity-based clustering, you’ll need to determine which data points to use and measure their similarity or dissimilarity using a distance metric.

After that a connectivity measure (such as a graph or a network) is constructed to establish the relationships between the data points.

Finally, the clustering algorithm uses this connectivity information to group the data points into clusters that reflect their underlying similarities.

This is typically visualized in a dendrogram, which looks like a hierarchy tree (hence the name!).

3. Distribution-based clustering

Distribution-based clustering groups together data points based on their probability distribution.

Different from centroid-based clustering, it makes use of statistical patterns to identify clusters within the data.

Some common algorithms used in distribution-based clustering are:

Gaussian mixture model (GMM)
Expectation maximization (EM)

In the Gaussian mixture model (GMM), clusters are determined by finding data points that have a similar distribution.

However, distribution-based clustering is highly prone to overfitting, where clustering is too reliant on the data set and cannot accurately make predictions.

4. Density-based clustering

Density-based clustering is a powerful unsupervised machine learning technique that allows us to discover dense clusters of data points in a data set.

Unlike other clustering algorithms, such as K-means and hierarchical clustering, density-based clustering can discover clusters of any shape, size, or density.

Density-based clustering is especially useful when working with datasets with noise or outliers or when we don’t have prior knowledge about the number of clusters in the data.

Here are some of its key features:

Can discover clusters of arbitrary shape and size
Can handle noise and outliers
Does not require specifying the number of clusters beforehand
Can handle non-linear, non-parametric datasets

Here’s a list of some common density-based clustering algorithms:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
OPTICS (Ordering Points To Identify the Clustering Structure)
HDBSCAN (Hierarchical Density-Based Spatial Clustering and Application with Noise)

5. Grid-based clustering

Grid-based clustering partitions a high-dimensional data set into cells (disjoint sets of non-overlapping sub-regions).

Each cell is assigned a unique identifier called a cell ID, and all data points falling within a cell are considered part of the same cluster.

Grid-based clustering is an efficient algorithm for analyzing large multidimensional datasets as it reduces the time needed to search for nearest neighbors, which is a common step in many clustering methods.

4. Evaluation metrics for cluster analysis

There are several evaluation metrics for cluster analysis, and the selection of the appropriate metric depends on the type of clustering algorithm used and the understanding of the data.

Evaluation metrics can be generally split into two main categories:

Extrinsic measures
Intrinsic measures

Here are some common evaluation metrics for cluster analysis:

1. Extrinsic measures

Extrinsic measures use ground truth or external information to evaluate the clustering algorithm’s performance.

Ground truth data is the label data that confirms the class or cluster in which each data point belongs.

Extrinsic measures can be used when we know the true labels and want to evaluate how well the clustering algorithm is performing.

Common extrinsic measures include:

F-measure/F-score: This metric determines the accuracy of the clustering algorithm by looking at precision and recall.
Purity: This metric measures the fraction of data points that are correctly assigned to the same class or cluster they belong to.
Rand index: This is a measure of the similarity between the true and predicted labels of the clustering algorithm, ranging from 0 to 1. A higher value indicates a better clustering performance.

2. Intrinsic measures

Intrinsic measures are evaluation metrics for cluster analysis that only use the information within the data set.

In other words, they measure the quality of the clustering algorithm based on the data points’ relationships within the data set. They can be used when we do not have prior knowledge or labels of the data.

Common intrinsic measures include:

Silhouette score: This metric measures the similarity and dissimilarity of each data point with respect to its own cluster and all other clusters.
Davies-Bouldin index: This metric calculates the ratio of the within-cluster distance to the between-cluster distance. The lower the index score, the better the clustering performance.
Calinski–Harabasz index: Also known as the Variance Ratio Criterion, this measures the ratio of between-cluster variance and within-cluster variance. The higher the Calinski-Harabasz ratio, the more defined a cluster is.

These evaluation metrics can help us compare the performance of different clustering algorithms and models, optimize clustering parameters, and validate the accuracy and quality of the clustering results.

Using multiple evaluation metrics to ensure the clustering algorithms’ effectiveness and make robust decisions for cluster analysis is always recommended.

5. Real-world applications of cluster analysis

Cluster analysis is a powerful unsupervised learning technique that is widely used in several industries and fields for data analysis. Here are some real-world applications of cluster analysis:

1. Market segmentation

Companies leverage cluster analysis to segment their customer base into different groups.

Different customer attributes are analyzed, such as:

age
gender
buying behavior
location

Businesses can better understand their customer base and design targeted marketing strategies to meet their requirements.

2. Image segmentation in healthcare

Medical practitioners use clustering techniques to segment images of infected tissues into different groups based on certain biomarkers like size, shape, and color.

This technique enables clinicians to detect early signs of cancer or other diseases.

3. Recommendation engines

Large companies like Netflix, Spotify, and YouTube utilize clustering algorithms to analyze user data and recommend movies or products.

This technique examines user behavior data like clicks, duration on specific content, and the number of replays.

These data points can be clustered to find insights into user preferences and improve existing recommendations to users.

4. Risk analysis in insurance

Insurance companies utilize clustering analysis to segment various policies and customers’ risk levels.

By applying clustering techniques, an insurance company can better quantify the risk on their insurance policies and charge premiums based on potential risk.

5. Social media analysis

Social media apps can collect huge amounts of data from their users. The recent discussion around apps like TikTok or Meta’s new Twitter-like Threads are good reminders of this.

By clustering and examining their social interactions, users can be segmented based on age, demography, or purchasing behavior to lead to targeted ads, improving the overall engagement of ad placements.

6. Key takeaways

As you can see, cluster analysis is a powerful unsupervised learning technique.

To recap, here are some key takeaways:

It brings many advantages when analyzing data, like streamlining analysis and representing data through visualizations.
Clustering algorithms must be carefully selected according to their type for the best results.
Extrinsic and intrinsic measures must be assessed to determine the effectiveness of your clustering.
Cluster analysis can be applied to different industries.

What’s next? To get started with some practical work in data analytics, try out CareerFoundry’s, free 5-day data analytics course, or better still, talk to one of our program advisors to see how a career in data could be for you.

For more related reading on other areas within data analytics, check out the following:

What is a Business Systems Analyst?

Elliot Gunn — Tue, 16 Aug 2022 10:13:08 +0000

What job titles come to mind when you think about roles in analytics?

Everyone knows about data scientists, data analysts, machine learning engineers, and data engineers as common career paths.

However, a career in data can take you in surprising directions, as there are many more roles that fly under the radar. We’ve previously explored roles such as big data engineer, healthcare data analyst, and operations analyst.

Today, we’re demystifying the business systems analyst, a uniquely cross-functional position that plays a critical role in improving a company’s processes for better operational efficiency. We’ll answer all your questions about what exactly is business systems analysis, and share a guide on how to become one, including:

What is systems analysis, and what does a business systems analyst do?
How much can you earn as a business systems analyst?
Business systems analyst job description examples
How to become a business systems analyst
Wrap-up

You can also use the clickable menu to skip ahead to any section. Well, let’s get started!

1. What is systems analysis, and what does a business systems analyst do?

Systems analysis might seem like an intimidating and abstract concept, but at its core, it’s really just a helpful problem-solving framework.

In the business world, this means making use of technology to create solutions to improve business operations and outcomes. It evolved from a need to correctly identify solutions for businesses with large and complex internal processes, or IT architectures, while also ensuring that any proposed solution is feasible and aligned to actually solve the identified problem.

These problems can vary, but they generally fall under the scope of anything related to the company’s internal processes, which can be client-facing or not. These analysts work to improve operational efficiency, cut work duplication, strategise how processes can be streamlined to reduce cost and improve revenue, and other key performance indicators (KPIs) relevant to their sector.

A business systems analyst plays a critical role in being the link between cross-functional teams (which often include business stakeholders, IT staff, project managers, developers, solution architects, and user experience designers) to troubleshoot problems and jointly develop a solution. They typically work closely with the business team to understand the problem, define it, narrow its scope, and then specify the requirements for a solution.

Then, they meet with technical developers to architect a solution. They may also perform quality assurance (QA), an important step in software testing which ensures that the proposed solution meets predefined expectations.

2. How much can you earn as a business systems analyst?

Learning about whether a role is the right fit is incomplete without a discussion about the salary.

What can you expect to earn as a business systems analyst? While this is a hard question to answer without a full consideration of prior (and relevant) work experience, location, and seniority, we can still get a sense of what to expect from taking a look at job postings online.

We browsed through dozens of postings to find that, as a business systems analyst is a specialized role that requires a technical skillset, you can expect to earn a fairly high salary right out of the gate. It generally starts at around $40 an hour as an entry-level analyst, or up to $75 an hour for more senior roles.

The upper end of the pay bracket can vary greatly by industry too. For example, this “IT Business Systems Analyst” posting quotes an annual salary between $140,000 to $160,00 in San Francisco, likely because it’s in healthcare and requires that the candidate have strong domain experience in healthcare analytics.

There are different ways to get into this role than you might think. Just ask CareerFoundry graduate Brittany, who when we talked to her was using her new qualification in the Data Analytics Program to transition in her healthcare firm to the Business Intelligence department, seeking to use data for social good.

A typical posting for entry-level roles quotes between $80,000 to $120,000 a year, which often does not include additional benefits and hybrid or remote work opportunities.

For a better understanding of salaries in the data field, check out our salary guides for data scientists and for different industries hiring data analysts.

3. Business systems analyst job description examples

As business systems analysis is a rather abstract concept, it might be helpful to take a closer look at two different example job descriptions to get a more concrete sense of what different types of business systems analysts actually do on a day to day basis.

Data Business Systems Analyst

A data business systems analyst leads any process change that involves a data or analytics solution.

This includes using their knowledge of SQL to code queries for analysis, programming automated test creation, and conduct end user testing, otherwise known as user acceptance testing (UAT). They start by extracting and narrowing down any requirements that match the business goal, which can involve interviews with internal stakeholders to get an accurate sense of current business processes.

Once they have created a plan, they work with technical teams to architect a solution, and remain actively involved through the development process by playing a key role in rigorously testing the solution for edge cases. They often bring strong experience in data analytics tools or frameworks and between 3-5 years of prior experience, with an average annual salary of $90,000 to $125,000.

Senior Business Systems Analyst

You can attain the senior business systems analyst title after about 7 to 10 years of prior work experience.

With this seniority comes greater responsibility: they partner with high-level stakeholders and domain specialists across all functions of the business to provide solutions to identified problems.

They do so by managing projects of fairly high complexity, interviewing end users to identify business requirements, and translating that into solutions. This can include anything from everyday process improvements to larger, and more systematic, policy changes.

As project managers, they often are in charge of maintaining documentation, testing, and providing regular reports to senior stakeholders on the progress of integrating the solution into the business. The average salary for senior business systems analysts is quite high, with many earning well over $150,000.

4. How to become a business systems analyst

If this all sounds good so far, it may be time to brush off that resume and take a look at what you need when applying to job postings!

When it comes to applying to business systems analyst roles, you’ll likely achieve quick success in the field with a bachelor’s degree. As a business systems analyst is generally a more complex role, an advanced degree won’t hurt either. Employers favor candidates with bachelor’s degrees in computer science, information systems, business administration, or related fields.

Advanced degrees typically feature more specialized domain knowledge related to the industry you are applying to. For example, a business systems analyst in healthcare might also have a graduate degree in life sciences or biotechnology.

You may also find it useful to gain a certification in data analytics in order to master some of the tools used in the field.

As the role is technical at its core, evidence of strong informational technology skills, whether with hardware or software, is important. Some fundamentals include knowledge of Microsoft Office Suite (Excel, Word, PowerPoint). Employers also look for experience with a programming language, knowledge of popular enterprise tools such as Salesforce, Tableau, or Power BI.

Some want experience working with enterprise resource planning (ERP) platforms or implementing content management systems like Adobe Experience Manager. Others are interested in expertise on managing commerce systems including Shopify Plus and Salesforce Commerce Cloud.

5. Wrap-up

In the analytics world, lesser known titles like the business systems analyst often go under the radar, despite their great prospects and potential for career growth. In this post, we’ve demystified what the role entails, what systems thinking is, and the steps you can take to get your first job in it.

As the title can span many different technologies, it’s best to take a look at job postings on platforms like Indeed and Upwork to get a sense of how you can best tailor your resume to the specific requirements for this unique role.

Interested in learning more about analytics roles and the field of data analytics in general? Why not try out this free, self-paced data analytics course? You may also be interested in the following articles:

15 of the Best Free Open Data Sources for 2023

Elliot Gunn — Mon, 30 Jan 2023 16:00:24 +0000

The rise of open data has been critical to improving access to data analysts working on their own projects, government officials crafting policy, and academics conducting cutting-edge research across a vast array of fields.

Because anyone with a computer and some programming skills can download and access these high quality datasets, open data represents a radical shift towards the democratization of knowledge.

In this article, we’ll explain what differentiates open data from other data sources, why it’s important to consider using it in your personal or work projects, and take you through 15 high-quality and well-regarded open data sources you can explore when you need some inspiration!

If you’d like to start working with open data right away, why not try this free 5-day data analytics course for beginners.

Here’s what we’ll cover:

What is open data?
Why are open data sources important?
The best free open data sources
Summary

1.What is open data?

Open data simply means that the data can be used by anyone for any purpose. This allows anyone to transform, augment, share, and build both non-commercial and commercial applications on it.

Open data emerged alongside a broader drive in tech towards open source software and hardware. Many companies, academic institutions, think tanks, non-profits, and individual researchers have come together to share their data freely.

2. Why are open data sources important?

It’s important to use data that you have the right to use, and publish, especially if you are making your work public, or creating something for commercial use. Whether you’re writing for a business, academic, or non-expert audience, your readers will be interested in knowing where your data originated from, or how the data in a dataset was collected and obtained.

Most proprietary datasets prohibit the use of data for commercial purposes, which means you can’t use them if you’re looking to sell something based on that data, without obtaining express permission from the creator. As it can take quite a long time to obtain permission, it’s almost always better to go with one of the many open datasets available online.

In fact, there has been a push in recent years for governments and non-profit entities to publish their datasets online to increase transparency and accountability. In the U.S., for example, the OPEN Government Data Act was enacted to encourage more evidence-based policy making.

3. The best free open data sources

Open data sources: Journalism and research

1. FiveThirtyEight

FiveThirtyEight is a news site that is well known for their memorable visualizations with their signature style and formatting.

They have published some of their data, and code, that go into their graphics. These are hosted on Github, and are an ideal dataset for beginners to work with as they have been cleaned for easier analysis.

Their datasets range from sports (NFL Predictions), politics (political donations), to culture (the Bechdel test applied to movies).

2. The New York Times

As one of the most popular news sites in the world, the New York Times needs no introduction. On their developer portal, they make it easy for you to work with one of their ten APIs, which let you access article metadata, best sellers lists, top stories, and more. Data is returned as JSON files, so you’ll need to have a decent grasp of programming fundamentals before trying this out.

3. The Pew Research Center

The Pew Research Center is a well-regarded think tank that regularly runs public opinion polls among other research functions that are primarily data driven and use rigorous methodological standards. They work on a broad range of topics and often go beyond American analysis. For example, they conduct cross-national studies through the international Global Attitudes survey, and they created Data Labs to establish new ways of obtaining data to improve their current collection.

Open data sources: Government

4. The U.S. Government

The U.S. Government has published over 335,221 datasets which you can filter by format, geospatial boundaries, categories, and organizations. The datasets available here span a broad range of categories: agriculture, climate, energy, local government, maritime, ocean, and older adults health. They are currently highlighting a dataset on rivers included in the Inland Electronic Navigation Chart (IENC) program, which covers thousands of miles of navigable waterways.

5. Ontario

The Canadian province of Ontario wants data to be “open by default”; this means you have access to a rich source of more than 2,700 listed datasets across categories like justice and public safety, environment and natural resources, and infrastructure and transportation. Although not all of them are ready for public access yet, it’s worth bookmarking this tab to keep an eye on when new datasets get released.

6. India’s Open Government

India’s Open Government Data Portal contains 4,738 items in its catalog of datasets. You can explore datasets by sector (Census, Water and Sanitation, Finance, Animal Husbandry), groups, state, or API. If you’re not sure where to get started, the homepage offers some useful highlights that can inspire your next project. Under the visualization carousell, you can take a look at the most viewed visualizations. Or, you can check out what the “high value dataset” currently is.

7. Singaporean Open Datasets

The Singaporean open dataset homepage looks like a dashboard because it is partially one: you can examine visualizations under “Singapore at a glance” to look at national statistics, which might give you an idea for your project. More advanced analysts will appreciate their developer resources page which explains how you can get access to one of their fourteen real-time datasets, including taxi availability, ultraviolet index, the weather forecast, and the pollutant standards index.

8. City of London

The City of London in the United Kingdom has published 1,101 datasets ranging from sport, to planning, and to art and culture. These can be downloaded in a wide range of formats, and can be filtered by the level of geographical boundary (e.g. local authority, borough, or ward) and source publisher. A particularly interesting dataset tracks daily Reservoir levels in London from 1989 to present day.

Open data sources: Science and technology

9. Open Science Data Cloud

If you’re interested in using the same data that researchers work with across fields and disciplines, head on over to the Open Science Data Cloud. This platform enables the scientific community to share their extremely large datasets–think terabyte and petabyte-size, which requires more advanced programming knowledge of how to handle and train big datasets.

10. NASA

NASA publishes its datasets from its science missions; you can check out the handy visualization here for an overview of what you can access, including everything from national geospatial data assets, to ocean chemistry, to snowmelt timing maps.

There are also two other NASA data sites worth checking out: the Planetary Data System and the Earth Observing System Data and Information System (EOSDIS). These are great datasets for any project with an environmental focus.

11. CERN

The European Organization for Nuclear Research, or more commonly known as CERN, has published more than three petabytes of data from research findings on particle physics. Their Open Data portal contains data from their Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator. For instance, you can use data from ATLAS, an experiment in experimental particle physics.

12. International Energy Agency

What if you have a project involving energy production and consumption? The International Energy Agency (IEA) hosts the Atlas of Energy site to share time series data from 1973 to present day statistics on energy.

This is part of a wider IEA ecosystem of analytics tools, including country-level data, databases, and a unique flow energy balance in a Sankey diagram. There are many datasets available: you obtain data on global CO2 emissions levels per capita, renewables production, and electricity generation.

Open data sources: International organizations

13. European Commission

European public sector datasets have been collected and published via the European Commission data.europa site. They span more than 1.5 million datasets across 36 countries, making this one of the largest data repositories online.

You can verify the quality of a certain dataset by checking its metadata quality, which helpfully grades data based on indicators such as interoperability, reusability, and contextuality.

14. World Bank

The World Bank publishes open datasets on global development data–this means that you can browse their datasets by any country or indicator (for example GDP or population). Their site goes beyond providing a catalog of datasets. Check out their DataBank tool, which is a web application that lets you perform quick analysis and easy visualization using their time-series data, right on their site, and export or share your created charts or tables.

Some of their interesting datasets include debt flows for 120 developing countries, the World Bank’s own financial reporting, and the Living Standards Measurement Study that collects microdata on households to better quantify household behavior.

15. The WHO

The World Health Organization invites you to use its Global Health Observatory, which features a comprehensive range of health data for many countries. Their data is grouped into a long list of themes, including: assistive technology, immunization, neglected tropical diseases, and tobacco control.

If you’re looking for inspiration, check out one of their featured dashboards at the bottom of the page. The Triple Billion dashboard tracks the improvement of health of billions of people by 2023 based on a few key indicators.

4. Summary

We’ve taken you through a tour of some of the best open data sources available for use, for free, right now. Let’s quickly review what you need to know when you’re embarking on your next project in search of a dataset to use:

Consider using an open data source: It’s important to use data that you have the right to use, and publish, and open data generally includes a license you can use and cite in your own work. You also increasingly have access to a growing source of free and open datasets online, which can only enrich your analysis.
Obtain high-quality data from credible organizations: In our article, we’ve provided a helpful guide to fifteen of the best sources you can use in journalism, research, the public sector, science and technology, and international trends.

Has this piqued your interest in learning more about analytics roles and the field of data analytics in general? Why not try out this free, self-paced data analytics course? You may also be interested in the following articles:

What is SQL? The Complete Guide

Elliot Gunn — Tue, 06 Sep 2022 15:44:02 +0000

Although it was created in the 1970s, SQL remains one of the most popular languages today, given its proximity to how massive datasets are managed, stored, and then retrieved. Learning SQL is also an excellent entry point into more advanced programming in data science and machine learning, too. So, what is SQL exactly?

In this article, we’ll cover the basics of SQL and help you understand why being able to write SQL queries, even basic ones, is an incredibly useful and marketable skill to have in the field of data analytics. Feel free to use the clickable menu to skip to any section:

What is SQL?
What is SQL used for?
What are the benefits of using SQL?
How can I learn SQL?
What are the best courses for learning SQL?
Key takeaways

Let’s get started!

1. What is SQL?

Before we dive into the world of SQL, let’s first understand how it fits into the lifecycle of data. When data is generated, it needs to be stored somewhere. Most companies use databases such as Oracle, MySQL or Microsoft Access. But raw data in a database does not lend itself easily to analysis without first being transformed. Enter SQL (pronounced “sequel”), which stands for structured query language, a language created in 1974 by IBM to query data in relational database management systems (RDBMS). At its simplest, this means enabling users in retrieving subsets of data, running aggregations, cleaning it, and more.

You can also use SQL to write data to a database, but the most common use case by data analysts is to read and retrieve the specific data they need for their analysis.

In the decades since, it has become the standard for data queries: you might have encountered slightly different types of SQL syntax depending on the database you used, whether it’s MySQL or PostgreSQL. Most tutorials cover SQL commands such as “Select”, “Where”, “Order By”, “Insert To”, as they are universally used across databases.

2. What is SQL used for?

Compared with other programming languages, SQL is much easier to learn, but difficult to master. The syntax is incredibly distinct and structured, a simple one looks like this:

SELECT column_name(s)

FROM table_name

WHERE condition

GROUP BY column_name(s)

ORDER BY column_name(s)

But it can quickly incorporate much more complex logic through window functions. This snippet depicts how you can integrate aggregation functions within a database query:

SELECT column_name(s),

       duration_seconds,

       SUM(duration_seconds) OVER

         (PARTITION BY column_name(s) ORDER BY start_time)

         AS running_total,

       COUNT(duration_seconds) OVER

         (PARTITION BY column_name(s) ORDER BY start_time)

         AS running_count,

       AVG(duration_seconds) OVER

         (PARTITION BY column_name(s) ORDER BY start_time)

         AS running_avg

  FROM table

 WHERE start_time < 'date'

Some queries can run into more than a thousand lines of code! These examples demonstrate why SQL is so essential to the data analytics profession. If you are already familiar with performing data transformations and aggregations from knowledge of DataFrames in Python or in R, the logic behind SQL will come easily to you.

Related reading: Exploring the ISNULL() SQL Function

3. What are the benefits of using SQL?

On a technical level, SQL is a language designed to let you query the data directly. This has important implications: you will experience much better performance than using other languages to achieve the same outcome. Oftentimes, you can write one line in SQL that would take many more lines of code in another language like Python. This saves analysts a lot of time to focus more on the aggregations instead of learning the ins-and-outs of a specific language.

Related watching: SQL vs. Python vs. R

Having SQL in your back pocket is also beneficial for practical reasons. The vast majority of companies use it to access their massive relational databases. It is a highly portable, analytical skill relative to learning other enterprise BI tools or languages. It doesn’t matter if your company uses Python or R, PowerBI or Tableau. The odds are high that they use SQL and are hiring for basic knowledge in that. Hence, learning SQL tremendously increases your competitiveness in the market. Even if SQL doesn’t feature highly in the job description, many companies often choose to screen candidates with a SQL coding challenge as it is a useful way to test for coding ability.

4. How can I learn SQL?

As a beginner, the quickest way to start learning is through online programming question banks. There is little to no developer environment setup when you try coding puzzles on platforms like LeetCode, Codewars, or HackerRank. You can select questions by difficulty and work your way up to writing increasingly advanced queries. If you’re stuck, you can find answers online and reverse engineer these solutions to understand the query logic better before trying again.

The added benefit of practicing through question banks is that you’re likely to see questions that will appear in an interview’s coding screen too, so you’ll gain familiarity with what to expect, both in style and difficulty level, well before you start applying for roles that require SQL.

If you are a complete beginner to programming and would like a gentler approach to learning the basics of SQL, check out our guide to SQL to learn some beginner queries. You could also check out some popular tutorials such as SQL Zoo and W3 Schools. These interactive tutorials invite you to write SQL queries in a guided walkthrough of the basics.

If you want more direct guidance with learning SQL, there’s a bevy of SQL certifications out there to really deep-dive into the world of this extremely useful programming language.

The bottom line is that the best way to learn SQL is by writing it and becoming familiar with conceptualizing queries. Interacting directly with databases will help you improve both the query logic and query optimisation, which are increasingly important things to know when you need to retrieve very large datasets.

5. What are the best courses for learning SQL?

A quick Google search will lead you to many SQL courses and tutorials online, with free and paid options. We recommend that you take a look at this short list of courses that we have taken a closer look at for quality and depth of coverage:

The CareerFoundry Data Analytics Program

Our flagship 8 month data analytics course features a strong focus on databases and SQL. Check out the curriculum on this page (scroll down and click on “Data Immersion” then click on “Achievement 3”) to see how it quickly covers the basics (filtering, summarizing, cleaning, table joins) before moving on to more complex, industry-level commands such as subqueries and common table expressions (CTEs). It finishes with a section on how to best present your SQL results to colleagues and stakeholders. This comprehensive overview enables you to be interview ready when the time comes to job search.

The Springboard Data Analytics Bootcamp

This 6 month online bootcamp is for professionals with prior experience working with programming tools, and best for those looking to switch roles or dive into more advanced material more quickly. The SQL component here covers the basics: best practices, database tools, the differences between structured and unstructured databases, and includes more advanced material with Mode SQL.

The General Assembly Data Analytics Course

This shorter one week course (or ten weeks part-time) provides a lighter introduction to data analytics. Its beginner-friendly curriculum provides hands-on experience with the popular BI tool Tableau as well as some SQL knowledge. Students work to integrate their newfound knowledge into creating a capstone project that can be added to their analytics portfolio. As this program is relatively short, it’s best for students who are curious about data analytics, or for professionals looking to upskill and learn new tools to apply to their work.

6. Key takeaways

What is SQL? We’ve taken a look at the history behind SQL and how its important role in querying relational databases has led to its immense popularity in the decades since. The best way to learn SQL is through hands-on experience, but it can be overwhelming to know where to get started given the wide range of resources online.

We hope the resources suggested here can kickstart your journey into the world of SQL, whether your interest stems from a career change into analytics, or if you would like to use your newfound skills to add more value at work.

What Is Spatial Analysis, and How Does It Work?

Elliot Gunn — Thu, 22 Dec 2022 09:38:47 +0000

There are few fields within data analytics that are more exciting than spatial analysis.

Even if you haven’t yet worked on a project involving geodata, it’s likely that you’ve already interacted with applications that have deeply incorporated spatial analysis into their core functionality.

In recent years, spatial analytics has become more and more important with the rise of big data and the integration of the Internet of Things into everyday life. The growth of open-source and enterprise tools has also made it easier for beginners to incorporate spatial attributes into data science projects. While this all sounds great, it can be a bit of a challenge figuring out where to start learning more about spatial data.

In this handy introductory article, we’ll break down the complexity of spatial analysis and help you understand what it is, why it matters, how to perform your own spatial analysis, and take a look at real-world use cases to understand its potential.

If you’d like to get your hands into some data analytics, check out this free 5-day data short course.

Here’s what we’ll cover:

What is spatial analysis?
How does spatial analysis work?
The spatial analysis process
Examples of spatial analysis
Summary and next steps

1. What is spatial analysis?

Spatial analysis, sometimes known as spatial data science, works on transforming geographical features into usable data points for quantitative analysis. This can include using characteristics like distance between places, location, boundaries, and networks in statistical analysis or machine learning.

Analysts who know how to use spatial analytical tools can enrich their analysis by making use of geospatial data, which are often accessible through open source tools like OpenStreetMap, and through private APIs such as Mapbox. Geospatial data includes things like longitude and latitude, satellite images, and zip codes.

As a subset of data science, spatial analysis features all the classic interlocking components of analytics. You can create complex models and simulations and produce visualizations that incorporate spatial data. This can broaden the value of your insights by seeing how patterns surface in a given space, whether it is as small as a county or as large as a continent.

Understanding how the use of and interaction with space in any setting is a key part of any analysis. For example, making use of spatial data can help triage resources in search and rescue operations or identify and predict the next best location on where to extend public transit based on projected population growth in a city. If you’ve used ride-sharing applications like Uber or Lyft, you’ve benefited from their engineers working to create algorithms that connect drivers to customers at a price determined by demand across a city, and help plan the most efficient routes to get you to your destination.

If you’re not yet convinced, think of the enormous and valuable datasets generated by smartphones tracking your health performance indicators as you move throughout the places you live in and travel to. It presents an unprecedented opportunity for analysts to bring new data to existing questions in fields like healthcare, science, business, and urban planning.

2. How does spatial analysis work?

Spatial analysis is most different from other kinds of analysis in the types of data it uses. If you’re new to this, you might be unfamiliar with the wide range of data types and formats that are used to capture geospatial attributes. These include vector, raster, shapefiles, geoTIFFS, geoJSONs,and more. If you’ve only worked with CSV files and Python dataframes, these might sound a bit intimidating to work with, but they’re really quite straightforward to understand and use.

We’ll start with how spatial data can be divided into two categories: geometric and geographic data. Geometric data is more straightforward to understand as it simply relates to a two-dimensional mapping system. You have encountered it if you’ve ever used applications like Apple or Google Maps. Geographic data is determined by its location on a sphere, like the planet we live on! You use the latitude and longitude to pinpoint a specific location or draw boundaries. We obtain the data through satellites and store these in global position systems (GPS), with many use cases.

How data is stored looks different in spatial analysis too. The two most common data formats you might encounter are vector and raster files. Vector files simply use points (coordinate pairs), lines (connected points), and polygons (connected lines). This transforms a location (points) into grouped boundaries (polygons) that can represent political entities (countries) or administrative jurisdictions (different school boards). Vector data is usually stored in shapefiles, which can be proprietary (e.g. Esri file geodatabase) or open source (e.g. GeoJSON).

Raster data is more commonly seen in satellite imagery or in geographical information systems (GIS). Each pixel within it holds a value that corresponds to geographic space. This value can be a unit of measurement, such as temperature, elevation, or population density. Hence, analysts can manipulate these variables to see variations through time and space.

3. The spatial analysis process

As spatial analysis involves industry-specific data formats and spatial tools, much of it is likely unfamiliar to the average data analyst unaccustomed to working with maps and mapping data. We’ll take you through each step in a typical spatial analytical workflow to demystify the process!

Data collection & transformation

As always, we begin with data collection. You likely have a question in mind, whether it’s in research or one tasked by a client to answer. If you don’t have access to spatial data, there are many open-source spatial datasets you can download to enrich your existing datasets. For example, you can look at the Natural Earth website for many public domain map vector and raster datasets, or the Earth Data repository with data compiled by NASA. You can also check out an article we wrote on the ten great places you can check out to find free and open datasets.

If you do have access to the data you need, you can use SQL to read and create dataframes for analysis. As a final step at this stage, do remember to carry out data preprocessing to clean your data, standardize units of analysis, merge multiple datasets, and input any missing data.

Data exploration & visualization

Next, it’s time to get a sense of what your data looks like. It’s especially important to do this with mapping data, where trends (or the absence of) are much more easily seen when you overlay a map with the attributes you’re interested in learning about. How you proceed next depends on what tools you have access to. If you have a license to ArcGIS Pro, an industry leader in GIS software, the user friendly interface makes it easy for beginners to get started with creating maps for visualization and exploration without advanced coding ability

For the rest of us who would like to learn more about geospatial analysis, it’s far more likely that we’ll need to learn how to use open-source Python packages like geo maps and datashader to manipulate and make sense of our spatial data.

These are just a few examples of the many fantastic tools available to Python users in the ecosystem that support vector or raster formats, big and small datasets, and whether you can deploy the visualization into an interactive application that your end users can play around with for their own analysis.

Data modeling & prediction

We now turn to every analysts’ favorite step in the typical data science workflow: building a model for prediction.

In spatial analysis, the focus is on modeling relationships between geographic locations to predict your target variable of choice, given the number of features in your model. These features can be both geographic and non-geographic features – in fact, you’re encouraged to use both as it’s a powerful way to make use of multi-dimensional analysis for better results.

There are a few things to take note of with conducting spatial analysis. It can be useful to gain a theoretical grounding to understand how to use spatial autocorrelation, account for heterogeneity of features over a geographic region, how to perform regression analysis with spatial features, and creating new variables by interacting geographic and non-geographic variables.

Learn more: What is data modeling?

4. Examples of spatial analysis

Now that we’ve had a solid understanding of the nuts and bolts of where to access spatial data and how to analyze different data formats, let’s turn to some real-work examples to see how powerful spatial tools can be when put to use in the industries we work in.

Competitive retail analysis

Imagine a new fast food chain looking to start expanding into more locations nationwide. This can involve considerable resources and financial risk.

How do they decide where the first franchise should be located? What methods should they use to determine between different cities? This is a hard problem, especially when we consider how the quality of a location can be especially important for a fast food chain which gets most of its revenue from being near dense population centers.

Here, spatial analysis can offer some insight and a way in which to think through the problem. We can construct a dataset with information about each city’s retail competitors, city gross domestic product, age distribution, median household income, and more. If this is a fast food chain that primarily sells hot food, we can imagine it might do better in colder regions, and we can add data on weather variation to capture this.

Once the model has been built and visualizations created, we can then gain a holistic picture of each city’s competitive profile as a whole, instead of narrowly focusing on non-spatial attributes and missing key factors that would drive a successful first launch.

Urban planning & design

Urban planners have always been the leaders in adopting geospatial frameworks to better understand and serve the cities they are responsible for making long term plans for. In fact, many advanced academic programs often combine an understanding of leading geospatial tools such as GIS together with a traditional curriculum on urban planning.

Oftentimes, urban planners are faced with competing options for how to develop or revitalize an economically-depressed region or area of a city. They try to find ways to determine which project has the best potential return on the city’s investment. This is where geospatial analysis shines: planners can use a dataset with historical information about the geography of the region’s economic sectors over time, enrich it with statistics such as population density, housing developments, weather, and household income.

They can then create a model to predict what will work in the future, for a certain plot of land, based on what has worked in the past. Presenting their visualization and conclusions through a methodologically sound approach can help persuade key political decision makers to make evidence-based decisions for a better outcome.

5. Summary and next steps

Spatial analysis offers so much potential in taking your analytics project to the next level.

If you’re interested in learning more about career pathways in data analytics, consider gaining an understanding of spatial analysis. As more and more companies embrace cutting edge techniques and tools, having spatial analytics in your data toolbelt will set you apart from the competition. To summarize the core concepts to get started on with spatial analysis, we recommend keeping these steps in mind when you embark on your first project:

Data collection & transformation: A good analyst remembers the mantra: garbage in, garbage out. You’ll want to make sure your data was collected appropriately for the question you’ve been tasked to answer. Make use of the public domain datasets on offer as they can often enrich existing datasets you have, allowing you to include a spatial component for analysis.
Data exploration & visualization: With mapping data, you’ll want to know what your collected data looks like. Depending on your skill level, or access to enterprise tools, you have many options in which to start plotting data for exploration. For more advanced programmers, there are many libraries for Python and R users that can read different data formats, handle large datasets, and may allow you to deploy the application for end user interaction
Data modeling & prediction: Now that you have a sense of some trends in your data, you can model relationships between geographic locations to predict your target variable of choice. This is where spatial analysis shines: using both geographic and non-geographic features enables multi-dimensional analysis for better results.

Has this piqued your interest in learning more about analytics roles and the field of data analytics in general? Why not try out this free, self-paced data analytics course? You may also be interested in the following articles:

SQL Interview Questions (and How To Answer Them)

Elliot Gunn — Mon, 12 Sep 2022 15:03:05 +0000

A quick look at job postings for data analysts will reveal how knowledge of SQL is listed as a requirement for most of them. As interviews tend to be a stressful experience, this blog post aims to help demystify different aspects of SQL interviews, and how you can best prepare ahead of time to ace them. We’ll take a look at specific types of SQL interview questions and share some tips on how to study for them!

What should I expect from an SQL interview?
Types of SQL interview questions for data analysts
How to ace your SQL interview
Summary and next steps

Related watching: What tools do data analysts use?

1. What should I expect from an SQL interview?

Whiteboard test

A whiteboard test is one where you solve the technical coding challenge in real time, in front of an audience, while either writing code or conceptual diagrams on a whiteboard. These typically aren’t just evaluating your technical acumen but your non-technical skills as well, as interviewers look for how you communicate your solutions, think through the challenge in a systematic way, and handle questions on the fly.

Oftentimes, interviewers are not looking for you to know how to write a perfect SQL query; being able to explain how it would work through pseudo code can be good enough. The emphasis is less on whether you remember the SQL syntax perfectly and more on your knowledge of how SQL works and whether you would know how to retrieve the required data from a database.

Live coding

Most companies will introduce a live coding test (or two) as part of the technical screen. This typically involves an in-person or virtual exercise completed live and with the interviewer(s) watching how you tackle the question and code the solution. Whether in-person or remotely, you will generally do this by using your preferred code editor such as PyCharm or Visual Studio Code.

As you can type your solutions on a computer, the bar is a little higher than in a whiteboard test. You’ll be expected to have stronger knowledge of SQL syntax since you can run your queries live to verify if they returned the correct tables, or if any syntax errors pop up.

Take-home task

The take-home assignment, a staple of data science interviews, is less commonly seen in SQL interviews. That said, it can come up in companies that value the opportunity to look more in-depth at your SQL skills within the context of a larger analytical question. The focus here is less on your ability to solve a niche, abstract SQL question, and more on how you use SQL with the end goal of extracting business insights.

Companies typically present you with synthetic data that matches the distribution of data they deal with daily and a few business questions to answer. At a minimum, the solution should include clean, optimized SQL syntax, since you have ample time to work on it and run queries to check for errors. You may be asked to present your results in a Jupyter notebook or through a short slide deck presentation.

2. Types of SQL interview questions for data analysts

The types of SQL interview questions that data analysts get generally fall into these three categories of increasing difficulty.

SQL interview questions: Defining SQL terms

Sometimes, the initial interview screen will involve conceptual questions to test whether you have sufficient knowledge of SQL and how it relates to databases. This may involve questions as basic as listing the types of joins in SQL, what a common table expression is (CTE), how window functions work, what an index is, or trickier questions like the difference between similar clauses (e.g. having versus where).

Try not to overlook basic questions like these when preparing for SQL interviews, as companies want to ensure that your knowledge of SQL is not surface-level or overly focused on writing queries without understanding how it fits into the world of database management.

SQL interview questions: Clarifying SQL queries

The next level of difficulty involves testing your knowledge of how SQL queries actually work. Interviewers will present you with a pre-written query and ask you a few questions related to its purpose and structure.

They may ask you to find any errors and ask questions similar to the following: describe how you would rewrite it correctly? How would you modify the query to return a result that would better illuminate more profitable lines of business? What SQL syntax would you use to transform the data type for a certain column? A more challenging question this category might ask you to predict what the query returns, especially if the query is more complex with multiple group bys and aggregations.

SQL interview questions: Writing SQL queries

The hardest level involves writing a query for a stated problem. You will need to have a strong grasp of the basics for this stage, such as knowing the correct order for SQL statements, and being fluent in using common keywords like select, from, where, order by, and as. You will also need to know how to use aggregation functions to count or find minimum and maximum values, grouped by a category or date, and join as many tables as necessary to retrieve the correct subset of data.

To show off your SQL skills, learn how to use more advanced techniques such as window functions. These are functions that create a subset of data (partition) over which you can run aggregations. Other techniques include common table expressions (creation of tables for later use) and subqueries (queries within queries).

Another way to earn extra points here is to pay attention to query performance. This simply means optimizing the query to run quickly when dealing with very large datasets. Even if the sample dataset you’re presented with is just a small, toy dataset, it’s worth signaling your awareness of this concept by including it in the syntax. Ensure that your query includes limits or indices, and mention how you would run expensive queries on a schedule to reduce computation cost.

3. How to ace your SQL interview

The best way to ace an interview is to prepare, prepare, prepare. This is especially true for the SQL interview, where so much of it focuses on technical knowledge that cannot be attained any way else but through consistent practice and familiarity.

With whiteboard tests, it helps to replicate the interview environment by purchasing your own whiteboard to practice on. Ask a friend to roleplay as the interviewer so you can get comfortable writing code without being allowed to use technological aids like Github CoPilot or searching StackOverflow for answers, all in front of an audience.

Brush up on your SQL syntax by practicing on as many questions as you can. Live coding questions are especially tough as not many are used to being watched while coding. To help with the nerves, it’s critical to get exposure to as many types of SQL questions as possible. These questions vary greatly in complexity and can be tricky to prepare for. Many aspiring programmers turn to online programming question banks like LeetCode and HackerRank, which have a wide range of SQL coding challenges.

Other users often post their solutions online, and reverse-engineering better solutions is a great teaching tool as well. The great thing about practicing on these well-regarded platforms is that many companies draw their questions from there, so you’ll likely come across questions that you have worked on previously.

With take home tests, you can exceed expectations and really stand out as a candidate by developing strong verbal and written communication skills. This means including comments and section headers in the Jupyter notebook solution, or learning how to connect the technical results to the business questions they posed.

As most candidates focus on developing strong technical skills, you’ll make an impression if you always make sure that the SQL code is not only clean and optimized, but also situated within the larger and more important context of how the results impact business outcomes.

4. Summary and next steps

Interviews are tough, and tech interviews are known for being especially grueling, with their many rounds of technical interviews that come in different formats that make it hard to prepare well for. We’ve given you a broad overview of what you can expect to encounter in a typical SQL interview, the types of SQL interview questions you might receive, and our best strategies in preparing them, whether it’s a live whiteboard test or a remote take-home assignment.

Always remember to draw the technical results back to the big picture of business outcomes, as companies appreciate that you understand SQL is ultimately one of many available tools used by analysts to drive strategic impact.

Interested in learning more about SQL and the world of data analytics as a whole? Check out this list of SQL certifications to brush up on your knowledge. Or why not try out our free, self-paced data analytics short course?

You may also find yourself interested in the following articles:

What Are CRUD Operations?

Elliot Gunn — Wed, 07 Dec 2022 10:43:53 +0000

Create, read, update, and delete—or CRUD—stand for the four fundamental operations in computer programming. They help structure storage processes and management in basic computer applications.

As CRUD is a foundational concept in programming, it’s essential to gain a thorough understanding of what they are, and how to perform each operation appropriately in data analytics, where they serve as the main way in which you interact with databases—if that’s the field you’re looking to get into.

While this sounds important, it might not be clear how CRUD operations are used in everyday analytics workflows. What do CRUD operations look like? How can we apply the four operations to the analytics process?

In this beginners guide, we’ll take you through the basics of CRUD, including:

What are CRUD operations?
CRUD operations: Create
CRUD operations: Read
CRUD operations: Update
CRUD operations: Delete
Applications of CRUD operations
Next steps

Ready to learn more about CRUD operations? Let’s dive in!

1. What are CRUD operations?

Before we can understand the importance of CRUD operations, we need to first take a step back at why it’s necessary in the first place and how this relates to data storage.

Most companies need to store their data in a way that can be accessible at a later time and when access to power has been turned off. This is known as persistent storage, or simply storing generated data or documents to a saved file. Without persistent storage, users cannot retrieve their data for later analysis. Hence, when data is created, it needs to be stored somewhere. This typically looks like a hard drive.

Once we have a place to store data, we will need to keep it organized for easier retrieval and capacity management. Enter the data analyst’s favorite helpers: the relational database and its tools for reading and transforming the data. At its simplest, a database consists of tables with rows and columns. Depending on the tools you feel more comfortable with, you can either use a GUI or programming language to execute CRUD operations on the database.

After the data has been stored, you might want to change it to update records with new data, or delete records that have been deemed no longer necessary. As analysts, our day-to-day work consists of cleaning data, transforming variables to create new columns, performing aggregations across tables, and inputting missing records. If we have new data being generated every hour, we might want to create a data pipeline to make sure our tables continuously update to reflect these new additions. None of these actions would be possible without CRUD operations.

This brings us back to what CRUD is all about. It helps us make use of the power of persistent storage and relational databases. At a high level, we can see how CRUD operations inform and carry out database management and design. They also make it easy for database engineers and analysts to work with databases, ensure appropriate security controls, and can have higher performance efficiency than one of our favorite database languages, SQL. If you haven’t yet encountered SQL, do check out our complete beginner’s guide that shows you how to write SQL queries, and you can reference our handy cheat sheet to the eight most important SQL commands you’ll need.

Next, we’ll look at the individual elements of CRUD in detail: create, read, update, delete.

2. CRUD operations: Create

The first step of CRUD operations is Create, which does exactly what it implies: it creates an entry. Adding new rows to a table can be done with the Create command.

As with any programming method, you have multiple options of doing so. In SQL, you can do the same thing with the Insert command.

The Insert Into command lets you add values for specific columns:

INSERT INTO housing_table (price, city, num_bedrooms, type)

VALUES ($500000, “Toronto”, 4, “detached”)

You can add values directly without referencing the specific column. This is useful when you are adding values for all available columns:

INSERT INTO housing_table 

VALUES ($400000, “New York”, 1, “studio”)

You can also add data from another table to your table through a more complex SQL statement. This is a recommended way to import a large dataset from one table to another.

INSERT INTO housing_table [(price, city, num_bedrooms, type)]

SELECT price, city, num_bedrooms, type

FROM sales_table

The important thing to keep in mind here is that users cannot create new columns, only new rows. To add new columns, you might need to request special permission from the database administrator.

3. CRUD operations: Read

Retrieving the data you need can be done with the read function, which refers to the Select command.

This is likely the first way that most of us encounter SQL queries. Select is the way by which we retrieve the records we need in a table’s rows and columns.

A simple Select statement to retrieve the full table looks like this (the asterisk is a short form way of referring to all the rows and columns in a table):

SELECT * 

FROM housing_table

We’ll more likely want to look at a filtered version of the table by a given criterion or set of criteria. In this next statement, we only want to look at a table of two columns, ordered by city name, and limiting the number of retrieved rows to 10.

SELECT price, city

FROM housing_table

ORDER BY city

LIMIT 10

There are many further clauses available for use under the Select statement, which allow you to add window expressions, filtering rows based on a value or condition, and performing group by aggregations.

4. CRUD operations: Update

We can use the Update command to edit existing data quickly. In this next example, we’ll update a record at id number 249 with the following new data points:

UPDATE housing_table

SET city = “Shanghai”, type = “condo”, price = “350000”

WHERE id = 249

Hence, you need to specify the columns to be updated together with the new values. It’s suggested that you limit the number of rows retrieved as otherwise this might create concurrency issues, which refers to creating conflicting versions of your dataset in the process of trying to update and access data.

5. CRUD operations: Delete

Finally, we turn to the delete function and command. This allows you to remove records given specified conditions. A simple statement removing records that belong to a condition, entries belong to the city of Tokyo, looks like this:

DELETE FROM housing_table

WHERE city = “Tokyo”

You can also remove entire columns:

ALTER TABLE housing_table

DROP COLUMN type

If you no longer need the table itself, you can delete it entirely, too:

DELETE FROM housing_table

Unsurprisingly, the delete function can lead to catastrophic outcomes if tables or important columns were accidentally deleted due to a coding or communication error. This is where extra caution must be taken to ensure that the action being taken has been confirmed with multiple parties to be correct before proceeding.

It’s also helpful to verify if users have access to a hard or soft delete: hard deletes are a permanent removal of records, while a soft delete only involves updating the row without changing the underlying data.

6. Applications of CRUD operations

It’s easy to imagine how CRUD operations work in practice when virtually most companies in today’s business environment deal with data in some shape or form. We’ll take a look at one example of how you might interact or implement CRUD operations in a business environment.

With a better understanding of practical applications, you could even start to assess new ways of better integrating them into existing data analytics workflows or propose projects that can yield new insights.

Airline industry

Airlines not only deal with massive amounts of data, which can include flight prices, schedules, ticket sales, staffing levels, and travel locations, they also need to ensure that the available data is up to date given how flight prices and schedules can fluctuate each second due to market demand and scheduling changes.

We can imagine that the internal IT team manages hundreds or even thousands of relational databases connecting these data to each other, for access by many internal teams including sales, marketing, customer services, and human resources:

A flight prices table that contains departure and arrival location, dynamic pricing determined by an algorithm, number of available seats, number of seats sold, historical sales figures for the flight route, plane ID, and more.
A schedule table that contains information about the departure and arrival airports, time of departure, time of arrival, estimated travel time, historical cancellations, flight ID, and staffing levels, number of seats sold.
A staffing table that includes employee names, internal staffing identification number, home country, citizenship, contract status, length of employment, seniority status, and upcoming scheduled shifts.

If the airline decides to expand to offer flights to Southeast Asia, they can create new records to update the flight prices and schedule tables to reflect the new locations. When existing air stewards or stewardesses resign or retire, the human resources department will delete their records from the database.

If the business analytics team wants to do a study of unprofitable routes, they can read the flight prices table to retrieve historical sales figures for flights that have continuously undersold their capacity. Finally, the data team can create scripts that update scheduling changes in the schedules table once new information arrives about delays or early arrivals.

7. Next steps

We hope that this article has demystified CRUD operations. Together with databases, they allow us to perform powerful analytical workflows. The four operations form the backbone of modern analytical workflows.

Without CRUD operations, we wouldn’t be able to organize our data, retrieve the subsets that we need, transform existing records, or remove outdated and unimportant records.

Let’s quickly review the features that make up CRUD operations as they just might come in handy in your next analytics project:

Create: The Insert Into command lets you add new rows to a table, and is a best practice for transferring large datasets from one table to another.
Read: The Select function enables you to retrieve data, whether it’s the entire table or a table filtered by certain conditions.
Update: Update allows you to change existing data with newer values.
Delete: When columns, records or entire tables are no longer needed, perform the Delete command to do a soft or hard deletion.

You may also be interested in learning about CASE statements, which are often used when running CRUD operations.

What Exactly Is Poisson Distribution? An Expert Explains

Elena Petrova — Tue, 28 Apr 2020 07:00:00 +0000

If you’re just getting started with data analytics, you’ll be getting to grips with some relatively complex statistical concepts.

One such concept is probability distribution—a mathematical function that tells us the probabilities of occurrence of different possible outcomes in an experiment. There are six main types of distribution, but today we’ll be focusing on just one: the Poisson distribution.

By the end of this article, you’ll have a clear understanding of what the Poisson distribution is and what it’s used for in data analytics and data science. If you’d like a hands-on introduction to this world in general, why not try out CareerFoundry’s free 5-day data course?

I’ve divided our guide as follows:

What is the Poisson process?
What is the Poisson distribution?
What is the Poisson distribution used for?
Key takeaways

So, what exactly is a Poisson distribution? Allow me to explain!

1. What is the Poisson process?

Before we talk about the Poisson distribution itself and its applications, let’s first introduce the Poisson process.

In short, the Poisson process is a model for a series of discrete events where the average time between events is known, but the exact timing of events is random. The occurrence of an event is also purely independent of the one that happened before.

So let’s bring this theory to life with a real-world example. We all get frustrated when our internet connection is unstable. If we assume that one failure doesn’t influence the probability of the next one, we might say that it follows the Poisson process, where the event in question is “internet failure”. All we need to know is the average time between these failures. However, there is a set of criteria that needs to be met:

The events of such a process are independent of each other.
The average rate of event occurrences per unit of time (e.g. per month) is constant.
Two events (e.g. internet failure or no internet failure) cannot occur simultaneously.

In our internet example, we assume that the events are independent and unrelated; that is, one instance of internet failure doesn’t affect the probability of the next instance. But sometimes, this might not be the case.

Another frequently given example for a Poisson process is Uber arrivals. However, this is not a true Poisson process because the arrivals are not completely independent of one another. Even for buses that do not run on time, we cannot be sure that their late arrival doesn’t affect the arrival time of the next bus.

On the other hand, cases such as customers calling a help center or visitors landing on a website are more likely to be independent and would probably be considered a more solid example of the Poisson process.

2. What is the Poisson distribution?

While the Poisson process is the model we use to describe events that occur independently of each other, the Poisson distribution allows us to turn these “descriptions” into meaningful insights. So, let’s now explain exactly what the Poisson distribution is.

The Poisson distribution is a discrete probability distribution

As you might have already guessed, the Poisson distribution is a discrete probability distribution which indicates how many times an event is likely to occur within a specific time period. But what is a discrete probability distribution?

Right, let’s first align on the concepts! A probability distribution is a mathematical function that gives the probabilities of possible outcomes happening in an experiment. As you might already know, probability distributions are used to define different types of random variables. These variables can be either discrete or continuous. When talking about Poisson distribution, we’re looking at discrete variables, which may take on only a countable number of distinct values, such as internet failures (to go back to our earlier example).

Given all that, Poisson distribution is used to model a discrete random variable, which we can represent by the letter “k”. As in the Poisson process, our Poisson distribution only applies to independent events which occur at a consistent rate within a period of time. In other words, this distribution can be used to estimate the probability of something happening a certain amount of times based on its event rate.

For example, if the average number of people who visit an exhibition on Saturday evening is 210, we can ask ourselves a question like “What is the probability that 300 people will visit the exhibition next week?”

Getting hands-on with Poisson distribution

So far, I’ve covered lots of theory. Now it’s time to delve into the mathematical side of Poisson distribution.

First, let’s consider the formula used to calculate our probabilities. Discrete probability distributions are defined by probability mass functions, also referred to as pmf. In statistics, a probability mass function is a function that gives you the probability that a discrete random variable (i.e., “k”) is exactly equal to some value. So, Poisson distribution pmf with a discrete random variable “k” is written as follows:

Hang on, don’t run away just yet! Let’s break it down:

P(k events in interval) stands for “the probability of observing k events in a given interval”; that’s what we’re trying to find out.
” e “ is the Euler’s number, which is a mathematical constant with an approximate value of 2.71828.
” λ “ represents lambda, which is the expected number of possible occurrences. It is also sometimes called the rate parameter or event rate, and is calculated as follows: events/time * time period.
” ! “ is the symbol used to represent the factorial function. Factorials are products of each whole number from 1 to k. So, in terms of the formula above, the factorial function tells us to multiply all whole numbers from our chosen number down to 1. For example, if “k” is 4, “k!” essentially means: 4! = 1 * 2 * 3 * 4. So, k! = 24.

To get a better grasp of how it works, let’s apply the formula to the following example.

The average number of internet failures in a household is 2 per week (“λ”). What is the probability of 3 (“k”) internet failures happening next week? Assuming that these are independent events with a constant average event rate and that can’t happen simultaneously, let’s fill in the data we have:

P (k; λ) = e-λ * λk / k!

= 2.71828 – 2 * 23 / 3!

= 0.13534 * 8 / 6

≈ 0.18

Seems like the probability of 3 internet failures happening next week is around 18%, which is not that high.

Calculating formulas manually can be a rather tedious process, and, as a data analyst or a data scientist, it’s highly unlikely that you’ll ever do it as we have above! There are certain tools and computer languages that enable you to analyze your data without having to go through such formulas manually.

One such language is Python, a programming language which is used to create algorithms (or sets of instructions) that can be read and implemented by a computer. We won’t go into detail about Python here; for the purpose of this post, you just need to know that it can be used to simplify the process of calculating a Poisson distribution for a given set of data.

If you’d like to learn more about what Python is, we’ve covered it in detail in this article: What is Python? A Complete Guide.

With that in mind, we’re now going to do the following:

Generate some random Poisson-distributed data with Python
Visualize our data

Generating and visualizing a Poisson distribution with Python

Below, you’ll see a snippet of code which will allow you to generate a Poisson distribution with the provided parameters (mu or also λ and size). In the code snippet itself, you’ll find explanations after the # sign, which is the way we do it in Python.

You can run this code either in your shell after installing Python to your local machine or simply by using the built-in shell at the official Python website.

{% highlight python linenos %}

import a poisson functionality from a scipy package

from scipy.stats import poisson

generates a Poisson distributed discrete random variable

data_poisson = poisson.rvs(mu=2, size=1000) # mu is λ (lambda)

will display the size we provided – 1000

len(data_poisson)

will display the data – [2, 1, 3, 1, 5, … ]

print(data_poisson) {: .+.’}
{% endhighlight %}

Now let’s consider how our Poisson distribution might look in visual form. We can plot our data using seaborn, a Python data visualization library based on matplotlib. You can learn more about Python’s various libraries and what they’re used for in this guide.

{% highlight python linenos %}
import seaborn as sns

creates an histogram like plot of our data points

ax = sns.distplot(data_poisson, norm_hist=True) ax.set(xlabel=’Poisson Distribution’, ylabel=’P(k events in interval)’) {: .+.’}
{% endhighlight %}

Here we can see the frequencies of an internet failure happening with event rate λ = 2.

We can also draw the probabilities. Below we see the probabilities of internet failures happening during the week. As we have already calculated, the probability of 3 internet failures happening next week is only 18%.

In case you would like to generate your own probability plot and experiment with values and plot parameters, here is the code block below. If you find it difficult to follow, as usual, just check out the comments starting with “ # ”.

{% highlight python linenos %}
from scipy.stats import poisson import matplotlib.pyplot as plt probabilities = []

defines distribution object with Î» = 2 rv = poisson(2)

gets probabilities for number of earthquakes

from 0 to 9 (excl. 10)

for num in range(0, 10): Â Â probabilities.append(rv.pmf(num))

plt.plot(probabilities, linewidth=2.0)

adds point on the plot with the 3 earthquakes

probability_of_3_earthquakes = rv.pmf(3) plt.plot([3], [probability_of_3_earthquakes], marker=’o’, markersize=6, color=”y”)

formatting

plt.grid(False) plt.ylabel(‘P(k events in interval λ)’) plt.xlabel(‘Number of earthquakes’) plt.title(‘Probability Distribution Curve’)

plt.show() {: .+.’}
{% endhighlight %}

3. What is the Poisson distribution used for?

Now we know what the Poisson distribution is and what it looks like in action, it’s time to zoom out again and see where the Poisson distribution fits into the bigger picture.

As you know, data analytics is all about drawing meaningful insights from raw data; insights which can be used to make smart decisions. Poisson distributions are commonly used to find the probability that an event might happen a specific amount of times based on how often it usually occurs. Based on these insights and future predictions, organizations can plan accordingly.

For example, an insurance company might use Poisson distribution to calculate the probability of a number of car accidents happening in the next six months, which in turn will inform how they price the cost of car insurance.

Likewise, a call center might use Poisson distribution to predict how many incoming calls they’re most likely to receive throughout the week based on an already known event rate. This could help them to decide how many people to employ for the call center, or how many hours to allocate to each employee.

As you can see, the Poisson distribution has many real-world uses, making it an important part of the data analyst’s toolkit.

4. Key takeaways

I’ve now covered a complete introduction to the Poisson distribution. There is certainly a lot more to be explored and plenty more exciting problems to solve, but hopefully this has given you a good starting point from which to continue your journey of discovery!

Before we finish, let’s summarize the main properties of Poisson distribution and the key takeaways from what I’ve covered:

Poisson distributions are used to find the probability that an event might happen a definite number of times based on how often it usually occurs.
The average number of outcomes per specific time interval is represented by λ and is called an event rate.
The events are independent, meaning the number of events that occur in any interval of time is independent of the number of events that occur in any other interval.
The probability of an event is proportional to the length of time in question (e.g. a week or a month).
The probability of an event in a particular time duration is the same for all equivalent time durations.

To learn more about Poisson distribution and its application in Python, I can recommend Will Koehrsen’s use of the Poisson process to simulate impacts of near-Earth asteroids. For a hands-on introduction to the field of data in general, it’s also worth trying out CareerFoundry’s free five-day data analytics short course.

And, if you’d like to learn more about discrete probability distributions, check out this beginner’s guide to Bernoulli distribution.

You’ll find further articles on the techniques and tools used by data analysts here: