Waterfall Chart

A waterfall chart or a bridge chart is used to graphically show the cumulative effect of several variables onto the primary metric of interest. For example, let’s say that your primary metric of interest is revenue generated by the company (measured in USD). Last year in the year 2015, the revenue generated by the company was $200,000 and this year the revenue generated was $230,000. We are interested in understanding what caused the increase in revenue between the two years. Let’s say our variable of interest was the region (North, South, East, and West). The revenue for each of the 4 regions for 2015 and 2016 is shown in the table below.

fig1

One way to show this data visually is to plot a bar chart of this data comparing the two years 2015 and 2016 with respect to the region, the resulting graph is as shown in the figure below.

waterfall2

A second way to show the same information is using a waterfall or a bridge chart. In this chart, we start with the revenue generated in 2015 and then show the incremental revenue generated by each region. The reformulated table looks as follows:

waterfall3

The resulting waterfall chart is shown below. Regions that add to the revenue (compared to last year) are shown in green while the regions that have performed worse (compared to last year) are shown in red. The starting bar shows the 2015 revenue and the ending bar shows the 2016 revenue.

waterfall4

The benefit of using a waterfall chart is that it can clearly highlight the areas that have done well and areas that need focused improvement.

A waterfall chart can also be used to study the impact of several variables on the primary metric. For example, the following chart shows the impact of several variables on company profitability. Can you determine which areas of the company have gotten worse and need additional focus to improve profitability?

waterfall5

Using Sigma Magic Software

Here are the steps to create the waterfall chart within Sigma Magic software. First add the waterfall chart template to your workbook by clicking on Graph > Waterfall Chart. Next enter the data for the plot as shown in the figure below on the Excel worksheet. There are currently no options that need to be specified for this analysis. Once all the data is entered, click on Compute Outputs to generate the graph.

waterfall6

For more details about this software, please visit http://www.sigmamagic.com/.

Changeover Time Reduction (SMED)

Background

SMED stands for Single Minute Exchange of Dies. It is one of the methodologies of Lean to reduce the changeover time. The name SMED comes from the automotive industry where dies are used to make car auto-body parts. These machines use large hydraulic forces to press a pair of dies onto a sheet metal in order to impart a specific shape to the sheet metal such as a front hood of a car. Due to the large forces involved, some of these machines can be pretty large and changing the dies from one shape to another requires significant effort and downtime. In the past, the changeover time could be several days.

If the changeover time is large, then making frequent changeovers is considered non value added time as the equipment is not productive during changeovers. Hence, the number of changeovers is usually minimized, which means these machines end up making hundreds of parts of one type before a changeover is initiated to another die. The numbers of parts that are made are larger than what is immediately required causing excessive inventory in the factory floor. One of the root causes of large inventories is the high changeover time required to change dies. Taiichi Ohno, the father of Lean along with Shigeo Shingo came up with a method to reduce the changeover time to less than 10 minutes. Hence, this methodology was named as the Single Minute Exchange of Dies (SMED).

The techniques developed is not only applicable for dies but can be used anywhere a changeover reduction is required. Don’t get confused with the name as well. A successful SMED program could reduce the changeover time from 1 day to less than 1 hour (it does not always have to be in minutes). In some cases, the changeover time could be reduced to less than 5-10 seconds (as in NASCAR pit crew). This concept does not apply to manufacturing alone. For example, when an aircraft lands at an airport and before it can take off for the next flight can be considered as a changeover time. During this changeover, the flight has to be refueled, cleaned, all the necessary checks have to be performed, luggage of the arriving passengers has to be removed and those of the departing passengers has to be loaded etc. The longer it takes to make this changeover the lower is the efficiency of the entire process as the plane is not productive and generating revenue during the changeover.

Why a large changeover time is bad?

First of all, changeover time is not value adding. The reason being that during changeover, no product is being produced. Though the changeover can help us make the right product that the customer requires, if we are able to reduce the changeover time to a smaller value, we can still accomplish what the customer wants without it impacting the production too much. Hence, from this perspective, a smaller changeover time is preferable. More importantly, a larger changeover time implies that the organization will not do frequent changeovers in order to maintain the productivity numbers. This means that even if a changeover is required, to maintain efficiency numbers, the changeover may be delayed. Running the process for longer than required means that the produce inventory is in excess of what is required; this is one of the wastes of Lean (over-production). In addition, this excess inventory needs to be stored in some location causing transportation waste; there is a chance of obsolescence of the parts and rework etc. Hence, it is clear that a large changeover time is not beneficial for any process. If the changeover time can be reduced, then inventories can be reduced and thus we reduce the waste in the process. Smaller inventory in the process also implies a smaller lead-time and hence faster reaction to that the customer expects.

Changeover reduction can facilitate the following benefits:

  • Lower cost (since there is less NVA activities, lower inventory levels)
  • Smaller lot sizes (since we can have more frequent changeover)
  • Faster response to customer demand

SMED Methodology

We can use the SMED methodology to reduce the changeover time. A typical SMED methodology consists of the following steps:

Step 1: Identify all the activities that are currently being performed during a changeover including the time for each activity. You may want to video tape the changeover activity with an on-screen timer to ensure that all activities are captured. In addition, you may want to observe multiple changeovers to ensure you capture the variation as well as document all possible activities. One thing to watch out for during the video tape process is that the process is not altered due to the videotaping process – you should try to capture the as-is situation as close to reality as possible.

Step 2: Classify each activity that is performed as either an internal activity or external activity. All the activities that are performed when the equipment is stopped are called internal activities. All activities that are performed when the machine is running are called external activities. For example, if you need to change the wheels of a car, hopefully the car has to be stopped when you perform the changeover – that would hence be classified as an internal activity. If you need to check the engine temperature or other parameters it could possibly be done by the driver when the car is still in motion and hence could be an external activity. The most significant way in which changeover time can be reduced is to convert all internal activities to external activities. During a changeover, a number of activities are performed such as:

    1. Allowing the machine tool to cool down (if required)
    2. Getting the required tooling & instruments to do the changeover
    3. Removing the old die (which may be attached using screws, fixtures etc.)
    4. Clean-up of the area as required
    5. Putting the new die (including fixtures, screws etc.)
    6. Calibration of the machine tool as appropriate
    7. Running the machine tool for the first few parts to ensure good quality output
    8. Removing the tools & instruments from the work area

By moving those activities that are performed when the machine is stopped to when the machine is still in production the overall changeover time can be reduced. For example, you could get all the required tooling & instruments to do the changeover while the machine tool is still in production. Before a flight lands at an airport, the flight attendants could clean up the aircraft as much as possible while the plan is still in the air – this would be an example of an internal activity that is converted to an external activity. Of course, not all internal activities can moved to external activities but we should try to move as many activities as possible.

Step 3: Review all internal activities that are left over to see if they can either be eliminated, made simpler or combined with other steps in order to reduce the total changeover time. Techniques that can be used here are:

  • Eliminate bolts & screws using quick release mechanisms
  • Eliminate adjustments and use standardized settings
  • Eliminate motion by reorganizing the work area
  • Modularize equipment to reduce changeover time.

When performing SMED, don’t ignore the people side. In addition to the above technical elements on reducing the changeover time, significant improvements can also be made by paying attention to the people element. For example, by having clear roles and responsibilities, coaching team members on the concepts of kaizen and continuous improvement to always look out for improvement opportunities and constantly improving the process, standardizing the work instructions so that work gets done in a consistent and repeatable fashion we can make significant improvements to our process as well.

A typical SMED event may last 1-3 days. This should not be seen as a one-time activity. Once a SMED activity is completed and the changeover time has been reduced, we should ensure that new process is followed for a period of time to standardize the new way of working. In the meantime, new ideas and suggestions may come from team members to further refine and improve the process. Once the process has been stabilized, this would be a good time to initiate a subsequent SMED activity to further reduce the changeover time. Remember that the ideal changeover time is zero.

It is also important to stress that once the changeover time has been reduced, the related processes should also be streamlined – for example you may need to recalculate the appropriate inventory levels, stock replenishment strategies etc. For example, there is no benefit of changeover time reduction for a flight landing if the aircraft sits at the terminal idle and we don’t capitalize on the reduced changeover time. The shorter the changeover time, the more the number of changeovers you need to plan in your process so that the benefits of reduced changeover times can really be leveraged by the organization.

SMED Software

You can use software such as Sigma Magic to facilitate your changeover analysis. The screenshot below shows the list of activities and classifies them as internal & external. You can perform a time study to capture the changeover time before (with the existing process) and after (after making the improvements to the process using the 3 steps described above).

SMED_Fig1

The output of the analysis software is shown below. It contains the summary of the data you have collected and the improvements you have made to the process along with a control chart which highlights the reduction in changeover time along with information on whether the new process is in control.

SMED_Fig2

References:

[1] Sigma Magic Software: http://www.sigmamagic.com.

Workload Balancing

Workload balance refers to all steps in a process having roughly the same workload so that the flow of work through these process steps is balanced.

Example

Let’s take a simple example of a three-step loan approval process. The first step is done by a Jack who enters all the customer information from a paper document into the computer. He takes on an average 10 minutes to do this activity for each customer. He has to enter information such as name, address, phone numbers, social security number, credit details etc. The second step in the process is done by Jill. She has to check the credit rating for each customer using an online portal and enter the credit score for this customer in the customer record. She takes roughly 4 minutes to do this activity for each customer. The final step is done by Ron who makes a decision whether to approve or reject the loan application based on the customer details and then sends a notification to the customer. He takes roughly 6 minutes to do this activity. Let’s also assume that each person does this activity immediately and there is no buffer or work-in-process between each step. If a customer calls in with the details, then roughly after 20 minutes he or she can get a notification whether their loan is approved or rejected. The total cycle time (20 min) is the sum of the cycle times for each step in the process (10 + 4 + 6) when there is no work-in-process or buffer between each step in the process.

wb_fig1The jobs processed by each step in the process are shown by the figure below.

wb_fig2

We would say that this process is not balanced. The reason is that each step in the process takes different amounts of time. The first step takes 10 minutes and while this person is working, the second person has to wait for his activity – so we do not have full utilization of the second resource. Once the first step is done, then the second person can start working on his activity. Let’s assume a TAKT time of 10 minutes – which means that on an average new customers are calling in every 10 minutes with a loan approval request. In this scenario, the first person is always busy since his activity takes 10 minutes. As soon as he finishes one job, a new customer calls and he has to repeat his activity again – no rest or break! The second person does his job in only 4 minutes so he is basically waiting for 6 minutes for the next job and then he works for 4 minutes on each job. A much better scenario compared to the first person. We are not fully utilizing this resource – his utilization at best is 40%. In addition, the first person may feel that they are being overworked while other people in the organization are having it “easy”.

A sample Cycle Time – TAKT Time chart is shown below.

wb_fig3

The red line corresponds to a TAKT time of 10 min. From this chart it can be seen that the cycle time for Jack is equal to the TAKT time at 10 min, while the cycle times for Jill and Ron are less than 10 min. Thus both Jill and Ron have waiting time while Jack is fully busy with his activities. Jack’s process is called the bottleneck process since his cycle time is the largest and this process controls the overall throughput of the system. So, in an hour we can expect that this process will process at most 6 applications (60 min / 10 min per application). The throughput of this entire process can only be improved if the cycle time of the bottleneck process is reduced.

wb_fig4

If all steps of the process have value added activities, then the only way to improve throughput is to re-arrange the workload between the three steps so that we can reduce the overall time. If we redistribute the activities so that we can take some of the actions being performed by Jack and distribute them to Jill and Ron then the workload for each step in the process is similar – we call such a process balanced. The throughput of the entire process can now increase to 9 applications (60 min / 6.67 min per application). In order to do workload balancing, one option is to redistribute the workload to other steps in the process. Other techniques that can be used to reduce the cycle time are to use SMED activities to reduce the cycle time of the bottleneck process, use Kaizen to reduce the cycle times, and finally use of visual aids and Standard Operating Processes (SOP) to minimize variation.

Benefits

The benefits of workload balance are:

  • All steps in the process have similar cycle times – so employee motivation is high
  • Throughput of the process is maximized
  • There are no points in the process where we need to accumulate inventory
  • Rework is reduced since we have less mistakes due to lack of overload of bottleneck process

Requirements

In order to implement workload balancing, we need to ensure that the following items are considered:

  • Each step in the process should be relatively free from quality problems
  • Each step should have equipment that does not breakdown often
  • Operators should be trained so that they don’t make too many mistakes
  • Operators should be cross-trained to cover absenteeism or any other absences.
  • Incoming work into the process should also be relatively stable with minimal rush orders

How to sustain the 5S program

How much time do you spend searching for stuff? It could be a simple tool to perform an operation or a report that you need to refer to or forward to others. The time you spend on searching is non-value adding from a customer’s point of view. It is a wasted effort that does not help in transforming a product or service to something a customer desires. In fact, it could make the customer wait longer while you are executing the search process for the needed tool or report. The idea is that an organized workplace could be an efficient workplace. While some may argue that they are able to perfectly find stuff even if it is disorganized but what my work for one may not work for the other. In companies and organizations where multiple people share workplaces – an organized workplace works for everyone. It reduces the stress that comes with having to constantly keep looking for stuff that you need to get your work done.

5S+1 is a lean technique to organize a workplace using a step-by-step approach. The name 5S+1 originated from the Japanese words for the process to get organized. The English equivalent of these Japanese words stands for: Sort, Set in Order, Shine, Standardize, and Sustain. More recently, a sixth S was added to this list (which is Safety) and hence some places in literature refer to this as 5S+1. Let’s go through each of these process steps to explain the approach/methodology:

  1. Sort: The first step is to sort through the clutter in a work area and identify those items which are needed and those that are not needed. This activity can be done periodically (say every year) as items would have accumulated over time in a work area. Sometimes, this process step also called a Red tag event where a red tag is attached to all items that are not required. Since, there are items that may be required in the future, a red tag will also contain some area where you can tag some notes to it indicating the reason for this disposition. A red tag room is created to store these red tag items just in case you have decided to throw items that were in fact useful. It is a good idea to keep the red tag room locked and have a defined process to get the items out of the room and dispose of the items that are truly not needed.
  2. Set in Order: The second step of this process is to arrange the items that are left over in such a way to minimize motion waste. Items that are used very often should be kept close to their point of use. Items that are used rarely can be kept in a slightly farther location.  A spaghetti chart can be used to help identify where different items should be placed. The theme behind the step is that there should be a place for everything and everything should be in its place. Once a home for each item is defined, generally the area is marked to make it clear what items goes where so that if any items are missing, it will become immediately obvious. Actions can be taken to trace the item and put it back in its place.
  3. Shine: The third step in the process is to ensure that whatever items are kept in the work area are in pristine working condition. There is no point keeping stuff that does not work! The process of shining the items can help detect if any problems exist with the tools and corrective actions can be taken to get them repaired so that we don’t have to waste time using tools that are broken. This step also acts as a preventive maintenance so that problem areas can be identified before the problem gets out of hand.
  4. Standardize: The fourth step in the process is to ensure that there is standard way of working. For example, the standards can define how the first three steps of this process are to be done. When should the sort activity happen, who should do it, how should it be done. What is the process for finding a home for each item, how should the third step be done and so on. We can also define standards for how an area should look like so that by looking at this picture for example, we can clearly identify if an area has been following the 5S steps. This step ensures that there is a structured approach to the current way of working.
  5. Sustain: The fifth step in the process is to ensure that this activity is not a one-time activity. How do we ensure that the 5S+1 process is sustained over a period of time? One approach is to have an audit process so that the 5S activity can be rated to see how well an area is being maintained. Without having any measurements in place, it would be hard to identify improvements and also if the process is being followed in a structured way in the first place. Other ideas that can help sustain the effort is to have a rewards and recognition program to sustain the 5S program over a period of time.
  6. Safety: This is the sixth step in the process (some people may call it the first step!) A safe workplace is an effective workplace. No business can continue to run profitably if it does not provide a safe working environment for its employees. Safety issues can result in lost time and productivity from workplace, increase in overhead costs due to medical expenses and possible lawsuits against the company. An unsafe workplace will also result in increased insurance costs for the company. More importantly it is just bad business – we need to take care of our employees, customers and anyone else we may deal with as a part of our business. In this step, we review all possible near misses and take steps to eliminate or minimize the near misses so that the number of accidents in the workplace can be reduced.

One tool we can use to check the effectiveness of the 5S+1 program is to use a 5S+1 Audit. An example form that can be used to audit the workplace is available in the Sigma Magic templates which you can access by selecting Lean and then 5S+1 Audit. A sample screenshot of this template is shown below.

fig1

The top section contains standard information about the area that is being audited such as the team name, team members, and date of the audit or any other observations by the auditors. It is usually recommended that there is a structured approach to the audit process with clear guidelines on who is on the audit team, when it will take place, how will the scores be used etc. that is shared with the organization. It may also be a good idea to include other lean team members (peers) on the audit process so that best practices can be shared between teams. It may also be a good idea to track the performance of a team audit scores over time to see if the area is getting better, consistent / repeatable, or getting worse. Based on this trend, appropriate corrective actions may be required.

fig2

The first three steps of the 5S+1 audit is shown above. This audit template was developed for a manufacturing or warehouse area and may need to be suitably modified for other areas/processes. Each element can be rated on a scale of 0-10. If there are multiple auditors, we can have a separate rating for each observer that is averaged or there can be a discussion among the members before they enter a common score on this template. The second approach is the preferred approach. Of course, there is no need to spend a lot of time debating if the score should be a 4 or a 5. Just go with one number! If the ratings are significantly different say one team members says 2 and other says 8, then each needs to explain the reasoning for their ratings and hopefully a common agreement can be reached.

The second set of three steps of the 5S+1 audit is shown below. Once the team “walks-through” the area and provides a rating for each of these elements, an overall score and observations of the audit team can be shared with the area and a print out of the score can be shared with the team which they can use for future improvements. As a part of reinforcing the right behaviors, we should always take the time to recognize and appreciate efforts taken by the team in improving the 5S scores. Also, ensure that the audit team is trained to avoid grade inflation and any partiality between teams. You may want to calibrate the scores given by the audit team on a random basis just to ensure that the process is working in a robust manner.

fig3

In some companies, the 5S+1 audit performance scores are linked to the incentive program. This provides another motivation for the team to keep their work areas neat and tidy. Sufficient communication should be provided to the organization to ensure that this exercise does not become a paper exercise and people see value in keeping their areas organized.

In summary, an organized workplace is an effective and efficient workplace. 5S+1 is a structured approach to keep the workplace organized. In order to sustain the program, a 5S+1 audit should be instituted so that this program can continue to run in a sustainable manner for a long time.

Feel free to share what are your thoughts on the 5S+1 approach and the 5S+1 audit? What works for you and what doesn’t?

Hypothesis Testing Introduction

Introduction

Most of the time we are working with samples and not populations since it is usually too time consuming and expensive to collect population data. A sample is a subset of the population data and if the sample is taken at random it can be fairly assumed that the sample is representative of the population. Sometimes we need to make a comparison of the means of the population from which these two samples were drawn from. For example, let’s say we want to compare the average height of all men and women in the country. We draw a random sample of all men and a random sample of all the women from that country. We can calculate the sample average for men and women and compare them. However, this may not help us answer the question about the population since the next time you draw a random sample you will most likely get totally different numbers. So, the question now is what conclusions can we draw for the average height of men and women for the entire population? In order to answer these types of questions, we need to use hypothesis testing. Hypothesis testing belongs to a class of statistical techniques called inferential statistics because we are inferring about the population from the sample data set.

When to use hypothesis testing?

Whenever we need to make decisions based on data and we are working with sample data rather than population data, we need to use hypothesis testing. Hypothesis testing considers the variation that exists in the data set before extrapolating and drawing conclusions about the population. Of course, we can always make a mistake when we extrapolate from a sample to a population but using hypothesis testing it is possible to minimize the amount of error that may occur. In one of our subsequent modules, we will cover how to control the errors during hypothesis testing. If you don’t use hypothesis testing and only draw conclusions based on the sample results, you may draw the wrong conclusions.

Hypothesis Methodology

The following methodology needs to be used anytime you want to perform hypothesis testing:

  1. What is the practical question you are trying to answer?
  2. Convert the practical question to a statistical question (formulate your hypothesis statements)
  3. Determine the hypothesis test that needs to be used
  4. Determine the sample size required to control your alpha and beta errors
  5. Collect a random sample of the required data
  6. Perform the test and get the confidence intervals and P-values
  7. Draw statistical conclusions
  8. Convert the statistical conclusions to a practical conclusion to obtain an answer to the question

Hypothesis Formulation

In this article, we will discuss how to formulate the hypothesis statements. In general, there are 2 hypothesis statements. One of them is called the null hypothesis (Ho) and the other is called alternative hypothesis (Ha). When we initially formulate these statements, we usually don’t know what the answer will be. The data we collect later in the process will help us select whether we need to conclude Ho or Ha.

There are three rules to consider when you are writing the hypothesis statements:

RULE 1: The first rule is that the hypothesis statements is always about the population parameters and not about the sample. What we mean by this is that we know the exact statistic about the sample data we collect and hence we don’t need to make any hypothesis about them. What we do not know for a fact are the conclusions about the population. Hence, we always make hypothesis about the population. For example, if we are interested in the average height of a person in a country (say Brazil) and our hypothesis is that the average is equal to 6 feet. We can represent the null hypothesis as follows:

hyp1Note that we use the Greek letter (mu) to represent the population average. If we are working with a sample of say 20 data points, we use the English letter (xbar) to represent the sample statistic. The wrong way to make the hypothesis statement is the following:

hyp1w

RULE 2: The second rule is that equality sign belongs to null hypothesis. There are two hypothesis we create each time, the null (Ho) and alternative (Ha). Only one of them can be true at any given time. If Ho is accepted as true then Ha would be false and vice-versa. When you are creating the hypothesis statements, you need to ensure that the equality sign is always allocated to the null hypothesis. For example, if we want to show the average height is 6 feet, then one possibility for the hypothesis statements is as follows (average height = 6 feet and average height is not equal to 6 feet). This can be written mathematically as follows:

hyp2

The wrong way to write these hypothesis statements are:

hyp2w

Of course, there are more possibilities of writing these statements depending on what you want to prove or disprove. For example, if you want to show that the average height of a person in Brazil is less than six feet then the corresponding hypothesis statements are as follows:

hyp4

In all these statements, you will find that the equality sign has been allocated to Ho.

RULE 3: The third rule is that we continue to believe in the null hypothesis (Ho) if we don’t have enough facts and data. Only sufficient data can disprove the Ho and we reject the null hypothesis and accept the alternative hypothesis as proven. Hence, usually, what we want to prove or disprove is part of the alternative hypothesis and the null hypothesis contains the status quo – no change. The null hypothesis can be thought as a statement of no difference or zero difference. For example, if we want to prove that providing training improves the average productivity of an organization. The null hypothesis would be that whether we provide training or not the average productivity is the same. The alternative hypothesis may be that training improves productivity. If we don’t have sufficient data, our conclusion would be that we don’t have sufficient data to prove that training has an impact on productivity.

hyp5

Note that if we are hypothesizing about the average value of a property, we use the Greek letter mu to denote the population average. If we are hypothesizing about the variation of a value, then we usually use the Greek letter, sigma which stands for standard deviation and finally, if we are hypothesizing about discrete values (say number of defects), we usually use the proportions (denoted by p). A proportion of defects or defectives is the number of defects divided by the total sample size. For example, if we have 50 items and out of which we have 4 defects, then the proportion is 4/50 = 0.08.

Examples

The following examples help illustrate how to write the right hypothesis statements. Try to write out the hypothesis statements for yourself first and then compare your answers to the ones provided below.

  1. Prove that the average salary in company A for entry level employees is greater than the average salary in company B.
  2. Prove that the average salinity in the sea water is greater than 35 g/L.
  3. Show that the variation in delivery times for supplier A is different from variation in delivery times for supplier B.
  4. Show that the average number of footfalls for four department stores, A, B, C, and D are different.
  5. Show that the manual process of recording exam marks has more defects compared to the automated process for recording exam marks.

Solutions to Examples

Here are some sample solutions to the examples shown above.

hyp_table
Exercise

Try to come up with the right hypothesis statements for the following problems.

  1. An engineer comes up with a recommendation to improve the productivity of a generator. Data was collected for 10 days with the old way of working and with the modification to the generator. Based on this analysis, we want to prove that the modification in-fact increases the productivity of the generator.
  2. It was hypothesized that employees who work for longer than 10 hours per day make more quality defects compared to employees who work less than 10 hours per day.
  3. We want to show that the breaking strength of a material supplied by the old supplier is significantly lower than the breaking strength of a similar material supplied by a new supplier.

 

 

Histogram

Introduction

A histogram is a graphical representation of data. It is used for continuous data when you want to determine the nature of distribution of the data. It gives you an idea of the central location of the data (roughly where the mean and median are located) and the amount of variation you have in your data (roughly the minimum and maximum values). By viewing the distribution of the data, you get an idea if the distribution is unimodal (has one peak), bimodal (has two peaks), or multimodal (has many peaks). It tells you if the distribution is symmetric or non-symmetric. If the distribution is not symmetric, does it have a long left tail or a long right tail? The shape of the distribution also gives you a clue if the distribution is close to a normal distribution or any other shape you may be interested in (such as uniform, exponential etc.) Hence, whenever you collect continuous data, it is always a good idea to plot the histogram which will give you more information about the nature of the data you have collected.

How to build a histogram

Step 1: Determine the bins

To construct a histogram, we first divide the data into “bins”. The bins are usually consecutive and non-overlapping intervals of equal size. There is no right or single way to determine the bin size. If the bin size is too large then the shape of histogram is not clearly shown a lot of bars of the histogram get grouped together. If the bin size is too small, then it is possible that no enough data points are available within each bin and you may find a lot of “gaps” or empty bins making it hard to visualize the shape of the histogram. Hence, we should always try to vary the size of the bin to see if our interpretation of the shape of the histogram is any different.

If you know the number of bins (# bins) you want, you can determine the bin size simply by using the formula:

hist_fig1

Usually, we do not know the number of bins in which case you can try one of the commonly used formulae which depends on the number of data points (n):

hist_fig2

Step 2: Place the data in the bins

Sort through the data and determine which bin it falls into. For each data that falls into a bin, increment the count by 1. For example, if the data is 1, 4, 3, 2, 5, 6, 3, 2, 4, 3, 2, 1 and the bins are as follows (bin size = 2), the following table is obtained at the end of placing all the data into bins.

hist_fig3
At the end of the exercise, the sum of all the values in the frequency column should be equal to the number of data points you started off with.

Step 3: Plot the histogram

Plot the bins on the horizontal (or vertical axis) and the frequency on the vertical (or horizontal axis). Depending on your objective, you may also want to superimpose the shape of the histogram on the histogram of the raw data in order to compare the theoretical shape with the raw data.

hist_fig4

Step 4: Interpret the data

Once you have plotted the histogram, look for the following items:

  1. Minimum value of the data. From the above histogram it can be seen that the minimum value is roughly close to 0 (note that it may not be possible to get the exact minimum value from a histogram since we are binning the data).
  2. Maximum value of the data. From the above histogram, the maximum value is roughly close to 10 (note that it may not be possible to get the exact maximum value from a histogram since we are binning the data).
  3. Range = Max – Min value. From the above histogram, the range is roughly 10 (note that it may not be possible to get the exact range from the histogram since we are binning the data).
  4. Whether the distribution is symmetric (about the center), if not is it left-tailed or right-tailed. From the above histogram, we may conclude that roughly it is symmetric. We would need more data / bins in order to really determine if the histogram is symmetric.
  5. Does the distribution follow the shape you have postulated prior to plotting the histogram? From the above histogram, it is hard to determine if the data follows, say a normal distribution. We usually need at least 30 data points to determine if a histogram follows a particular distribution.
  6. Are there any gaps in the histogram. From the above histogram, there does not seem to be any gaps in the data as all bars are placed contiguous to each other.
  7. Are there any outliers in the histogram. From the above histogram, there does not seem to be any outliers in the data as no bars are located significantly far from the rest of the histogram.

Example

Plot the histogram for the sunspot data. The data for the sunspot cycles can be obtained at: https://www.windows2universe.org/teacher_resources/suncycle_sheet.html.

A possible answer to the question is shown below for data from 1941-2000. From the figure below, we can conclude that the number of sunspots don’t necessarily follow the normal distribution. It has a range from 0 to 220 sunspots per year.

hist_fig5

Reference

All the charts in this article were created using the Sigma Magic software. Visit http://www.sigmamagic.com.

Surface & Contour Plots

What is a contour plot?
If you have a response variable (say z) which is a function of one input variable (say x), then we can represent this mathematically as z = f(x). We can plot the variation in z with variation in x in a 2-dimensional line graph. However, if the response variable z is a function of two input variables (say x and y) then we can represent this mathematically as z = f(x, y). In order to plot the variation in z, we now need a 3-dimensions to plot the value of z with changes in x shown in one axis, changes in y shown in one axis and the resulting changes in z shown in a third axis. Such a plot is called a 3-d surface plot. A sample 3-d surface for z = sin^2(x) + cos^2(x) is shown below.

contour1

Sometimes, we may be interested in determining values of x and y which give the same or constant z values. This sort of a plot which is typically plotted on a 2-d plane is called a contour plot. The lines that represent constant z values are called iso-lines. For example, on a weather map for a given area, we may be interested in determining areas of high and low pressures. We would typically use a contour plot to represent such a figure. The lines of constant pressure are also called isobars. A sample 2-d contour plot for z = sin^2(x) + cos^2(x) is shown below.

contour2

Contour plots provide visual clues on what variables to select for X and Y in order to optimize the response function.

How to create contour and surface plot in Sigma Magic?
The contour and surface plot can be created within the Sigma Magic software by clicking on Contour Plot within the Graph menu. Click on Update Inputs to open the dialog box. Specify the name of the X axis variable (without any spaces or special characters), the minimum and maximum values of X and the number of increments to generate intermediate values of X. Similarly, specify the parameters for the Y axis. Note that the name of the Y variable should be different from the X variable. Finally, define the response variable name and the function model equation Z= f(x,y).

contour3

When you click on the OK button, Sigma Magic will generate values for the X and Y variables and the response Z variable using the specified model equation. If no equation is specified, you can also define the equation in the Z column on the worksheet. This will automatically create the 3-d surface plot. If you need the contour plot, you will have to select the graph and change the graph type to contour plot using Excel functionality.

How to interpret contour plot?
It is usually easier to interpret a 3d surface plot compared to a 2d contour plot. However, there are a few things to consider when looking at a contour plot.

  1. Contour plots can indicate peaks or valleys within the range of X and Y at the center of concentric shapes.
  2. If the contour lines are spaced close to each other, then the values change rapidly while if the contour lines are spaced far apart then the z values change more slower.
  3. If there are multiple concentric shapes within the figure, then the figure usually points to a multi-modal distribution.
    4. Contour plots which contain no curves but straight lines may indicate a ridge shaped function or a planar surface (linear model).

Probability Plot

The probability plot is a graphical method of determining if the data follows a given distribution. One of the common ways of creating the probability plot is called a Quantile-Quantile (Q-Q) plot. The idea behind this plot is that if the data follows a certain distribution, then the quantiles for the given data should match the quantiles of the distribution. If the quantiles of the distribution are plotted on the X-axis and the quantiles of the data are plotted on the Y-axis then if the data follows the given distribution then the Q-Q plot should line close to the 45 degree line.

Let’s first clarify what we mean by a quantile. A quantile divides the data into equal subsets and the boundary values are the quantiles. For example, if we divide the data into 2 halfs where half the values are less than the quantile and half the values are greater than the quantile. This quantile is also called the median (or the 2-quantile). If we divide the data into 4 equal buckets where 25% of the data are less than Q1, 25% lie between Q1 & Q2, 25% lie between Q2 and Q3, and 25% are greater than Q3. We can say that Q1, Q2, Q3 are quartiles (or the 4-quantiles). If you divide the data into 100 equal buckets then we call those quantile values as percentiles etc.A q-quantile means breaking up the data into q equal-sized sets of data.

A Q-Q plot can be used to get an idea of the location (central value), scale (spread) and skewness (symmetric nature) of the distribution. It is more powerful way than just looking at a histogram. It is always a good idea to plot the data and understand the nature of the distribution rather than looking at P-values on a goodness of fit test. Here are some common interpretations of the Q-Q plot:

  1. If the Q-Q plot is flatter than the 45 degree line (Y=X) then spread of the data plotted on the horizontal axis is more than the spread of the data shown on the vertical axis
  2. If the Q-Q plot is S shaped, then one of the distributions is more skewed compared to the other distribution (i.e. of the tails is longer than the other)
  3. The intercept of the regression line is a measure of the location and the slope of the line is a measure of the scale of the distribution.

 

How to pick the right chart

Pictures they say is worth a thousand words! Whenever you collect data to analyze any problem or to quantify any process you should always consider plotting the data to see what the data is actually telling you. Of course, you need to pick the right chart as there are a lot of different ways and charts to plot the data. This module covers a high level overview of which chart to pick for a given data set.

The chart you pick depends on the type of data you have and the objective of plotting the data – what is the question you are trying to answer? There are separate charts for continuous data and separate charts for discrete data. The following flow charts can help you pick the right chart or charts to analyze your data set.

  • If you have continuous data and you want to determine if a process is stable then you can pick the time series chart, run chart or the control chart. A time series chart can help you visually determine if a process is stable while a run chart and control charts gives you specific markers or indicators to tell you if a process is stable.
  • If your data was collected in groups – for example, sales data was collected for different regions of the country (North, South, East, and West) and you want to compare the sales data by region, you could use the box plot, dot plot or the individual value plot.
  • If you want to determine the nature of your distribution – for example to check if a data follows a normal distribution then you could either use the probability plot or the histogram. For a more statistical analysis, you would perform the normality test.
  • If you are more interested in where the variability is coming up – for example you have too much variation in the manufacture of a product and you want to know if it is coming from the machine that is making it, the time of the day the product is being made, the operator who is making the product etc., then you could use either the Box Plot or the Multi-Vari chart.
  • If you have two sets of data and want to determine if they are correlated, for example if you have humidity values and product quality readings and you want to check if there is any correlation between humidity and product quality, you could use the scatter plot to understand this correlation.
  • If your data is discrete and you want to compare various groups then either a bar chart or a pie chart would be useful. A bar chart focuses on the absolute value of the data while a pie chart will focus on the relative value or proportion of the data since the total adds us to one circle (100%)
  • If you have a lot of categories in your data and want to determine the vital few from the trivial many (80-20 rule), then you could use the Pareto chart to identify the most important factors or causes.

It is always recommended that you should chart your data whenever you collect data in order to interpret what the data is telling you. However, you should be aware that making interpretations based on just the chart alone can be prone to mistakes – to make an actual decision you will need to perform additional statistical analysis. The chart should be used to identify potential trends which can be validated with further analysis. The chart should also be used to ensure that the analysis that you perform is correct and you are not misinterpreting the results or drawing the wrong conclusions.

To create any of the charts mentioned in this article, refer to http://www.sigmamagic.com/.

 

charts_selection

Scatter Plot

A scatter plot is used to visually determine if there is any relationship between two variables (say X and Y). This plot is usually used between two continuous variables. However, a scatter plot can also be drawn if the variables are not truly continuous – for example the variable may be discrete ordinal data but more than 2 categories. A scatter plot is a visual way of looking for relationship and thus may be subjective. If a more objective way of establishing a relationship is required then we can use some statistical tool like Regression analysis.

Data Collection

In order to plot a scatter plot, we need to collect the data in pairs. For example, let’s say we want to determine if there is a relationship between the electricity consumption in a city and its population. The electricity consumption is measured in KWH which is continuous data while the population is not truly continuous but can be approximated as continuous since there are a lot of possible values. In order to collect the data for this analysis, we may choose for example 20 different cities – determine the population for each of these cities along with their power consumptions for the given period (say a month).

Creating the Plot

In order to plot the scatter plot, we would plot one variable (say population) also called the explanatory variable on the horizontal or the X axis and the described variables (say power consumption) on the vertical or the Y axis. Each data point on the scatter plot would indicate one city. We would not explicitly identify the names of the cities as they are incidental to our analysis. Of course we can show then as well if we liked by adding labels.

A scatter plot can also be created by grouping the data. For example, we can identify some cities as belonging to a developing nation and some cities as belonging to an advanced nation. We can then plot each group in a different color to see if the relationship between these two variables are similar for the different groups. We can identify the relationship between groups.

Analysis

If there is no relationship between the two variables, then we would find that the scatter plot would be randomly distributed across the X-Y space with no apparent pattern. If there was a linear relationship between the two variables, then we would find that the scatter plot individual values or dots would fall pretty close to a straight line. The slope of the line could be positive or negative. A positive slope would indicate that as the population increases the energy consumption also increases. A negative slope would indicate otherwise (which is not what we would expect for this example). An example scatter plot is shown in the figure below.

scatter plot f1

However, depending on the type of data we are analyzing we could find either slope. In some cases the same set of variables may have both positive or negative slope depending on the situation. For example, if we are measuring the height of a group of people and relate it to age – it may have a positive slope for all data collected among the younger population (say between 5-15 years of age) while it may have no correlation if we collect data for the adult population (say between 25-40 years of age) and it may have a negative slope if we measure for the older population (say between 50-70 years of age due to factors like bone loss, osteoporosis etc.)

Reference

Use the Sigma Magic software to actually create a scatter plot. Scatter Plots in Sigma Magic software can be found under Graphs > Scatter Plot.