Overview of Cluster Analysis


Cluster analysis is the process of grouping a set of items such that items in the same group (also called a cluster) are more similar to each other compared to items from another group. The process of assigning particular objects to each cluster is called classification. Cluster analysis groups data based only on information found in the data that describes the objects and their relationships. Cluster analysis helps in the initial exploratory data part of data analysis to understand the nature and structure of groups that may be present in the data. Here are some examples of use of cluster analysis in the real world:

  • Biologists have classified all living things into various families. All the cat species belong to the Felidae family. Animals that belong to one family have similar characteristics compared to animals from another family.
  • World Wide Web (WWW) contains billions of web pages and any search result would probably return thousands of pages. Clustering can be used to group similar search results together which will help users understand the search results better.
  • Businesses may collect large amount of information on customers. Clustering can be used to understand the difference between customers and group them into a smaller number of groups for targeted marketing activities.
  • If you are trying find the optimal location of a Pizza restaurant chain, finding the nearest neighbors and grouping the population into clusters can help determine the best possible location for the Pizza chain.

Types of Clusters

Clusters can be classified into the following types:

  • Hierarchical clusters – In this case, the clusters are nested. One cluster can contain multiple sub-clusters underneath. The opposite of hierarchical clusters is partitional clusters in which the data items are divided into separate non-overlapping clusters. For example, classification of animals into various families is a hierarchical cluster.
  • Clusters can also be divided into exclusive clusters where each data item is assigned to one unique cluster, overlapping where one data item may belong to multiple clusters or fuzzy clusters where all data items can belong to all clusters and a probability is assigned for it to belong to a cluster. For example, for web searches, some topics may be classified under math, some topics may be classified under science and some topics may fall under both these categories.
  • Cluster analysis can also be broken up into complete vs. partial clustering. A complete clustering implies that every object is assigned to a cluster while in partial clustering; some items may not belong to any identified cluster and hence may be excluded from the classification. For example, during classification of business customers some customers may be classified under key customers and some of them may be classified as non-key customers while a third data set may not be sufficient yet to put them in either category and hence may be eliminated from the analysis for the time being. This would be an example of partial clustering.

Clustering Techniques

There are a number of algorithms that can be used to perform cluster analysis. The two main algorithms are K-means and agglomerative hierarchical clustering.

  • K-means Clusters: This is a prototype-based, partitional clustering method. This algorithm requires that we know the number of clusters (K) upfront. The initial K centers are randomly assigned and then the algorithm determines which of the data points lies close to these centers and assigns the data items to this cluster. Based on the data items allocated to that cluster, it then recalculates the new centroid and repeats the process until there is not much variation in the centroids. The end result is the location of the K centers and the data items assigned to each cluster.
  • Agglomerative Clusters: This produces a hierarchical cluster by starting with each point as a singleton cluster and then repeatedly merging the two closest clusters that are closest to each other. Once the two clusters are merged, update the proximity matrix to reflect the proximity between the new cluster and the original clusters. Repeat the process until only one cluster remains. Since there are multiple points within each cluster, there are multiple ways in which the distance between clusters can be calculated.


Each of the algorithms described above has some limitations:

  • K-means clustering is simple and efficient and can be used for a wide variety of data types. However, it cannot handle non-globular clusters or clusters of different sizes and densities. It can have trouble clustering data that contain outliers.
  • Agglomerative hierarchical clusters are expensive in terms of computation and storage but can produce better quality clusters. They can have trouble coming up with the right clusters if the original data set is noisy and high dimensional data like document data.

Control Charts – things to consider before depolyment

Let’s say that you have come up with a new process and you have deployed that process in your company. The process could be anything – from a simple process for how your sell your products to your customers or how you enter orders into your system or a complicated process of manufacturing and assembling a critical component to tight tolerances. How do you ensure that your process continues to run the way it was designed? One option is to design the process in such a way that mistakes cannot happen in the first place – say through automation or any other means. If you are 100% sure that there can be no chance for errors to creep in, then you don’t have to worry about your process and you can let it run rest assured that it will perform as intended. However, the real world is not so forgiving – if the process contains human elements then mistakes could happen. Similarly, if we are relying on automation, automation may fail under certain conditions. Hence, if it is critical that our process performs as intended, there is no option but to measure the critical parameters on a regular basis to ensure that the process is performing as intended. However, when you start the measuring process there are two considerations. First, is the measurement precise and accurate – is it reflecting reality. This is a topic for another discussion. Let’s assume that our measurements are reliable and truly reflect what is happening to the process. Now, the next question is whether the process is performing as intended. Note, that all processes have variation, it is practically impossible to get the same readings all the time and your measurements will also show variation. Now, the question is if you do see variation in your measurements, should you be alarmed that something is amiss or should you ignore the variation with the hope that it will come back to its original value? When should you take corrective action? If you don’t take corrective action when it is required, then you are letting the process run out of control and the results are less than expected. If you don’t need to take corrective action because the variation that you are seeing is the natural variation in the process, you may actually make the process worse by taking corrective action when none is required. A control chart helps you make this determination using statistical principles. It lets you know when the variation is caused by common cause or natural factors and you have nothing to worry about and when the variation is caused by special causes or extraneous factors and you are expected to take a corrective action. By implementing control charts in the right way, you can ensure that your process continues to stay in control and you continue to derive the benefits of the process you have put in place for a long period of time. Here are some factors to consider when designing and implementing control charts:

  • What are the variables that you need to track and monitor to ensure that process is in control? Should you only measure the outputs or should you also measure the inputs?
  • What type of data is appropriate for creating control charts? How do you handle cases when the data is continuous and when it is discrete?
  • What is the right frequency to collect the data for your control charts? Should you collect the data every hour, every day, every week, every fortnight, or every month?
  • How much data should you collect each time you decide to collect the data? Is one data point sufficient, how about 5 or 50?
  • Who should collect the data to ensure that the process is in control?
  • What control chart should you use to determine if the process is in control? What are the different options available and when do you use which option?
  • How often should you update the control limits that you use to determine if a process is in control?
  • How are you going to display these out-of-control situations to your workforce? Is it okay to display these out-of-control charts in your work area?
  • What actions are you going to take if the process in in control? How do you communicate and train your organization to act on these processes?
  • What are some best practices and lessons learned from other people who have deployed control charts that you can learn and leverage from?

As you can see, deploying control charts requires that you are able to answer all the above questions and deploy a robust strategy to ensure that your process stays in control. Failing to deploy an adequate process will ensure that your process is not sustainable in the long run and all the hard work you have expended to develop a new process will be undone and the benefits of the new process you have developed will be short-lived.

Waterfall Chart

A waterfall chart or a bridge chart is used to graphically show the cumulative effect of several variables onto the primary metric of interest. For example, let’s say that your primary metric of interest is revenue generated by the company (measured in USD). Last year in the year 2015, the revenue generated by the company was $200,000 and this year the revenue generated was $230,000. We are interested in understanding what caused the increase in revenue between the two years. Let’s say our variable of interest was the region (North, South, East, and West). The revenue for each of the 4 regions for 2015 and 2016 is shown in the table below.


One way to show this data visually is to plot a bar chart of this data comparing the two years 2015 and 2016 with respect to the region, the resulting graph is as shown in the figure below.


A second way to show the same information is using a waterfall or a bridge chart. In this chart, we start with the revenue generated in 2015 and then show the incremental revenue generated by each region. The reformulated table looks as follows:


The resulting waterfall chart is shown below. Regions that add to the revenue (compared to last year) are shown in green while the regions that have performed worse (compared to last year) are shown in red. The starting bar shows the 2015 revenue and the ending bar shows the 2016 revenue.


The benefit of using a waterfall chart is that it can clearly highlight the areas that have done well and areas that need focused improvement.

A waterfall chart can also be used to study the impact of several variables on the primary metric. For example, the following chart shows the impact of several variables on company profitability. Can you determine which areas of the company have gotten worse and need additional focus to improve profitability?


Using Sigma Magic Software

Here are the steps to create the waterfall chart within Sigma Magic software. First add the waterfall chart template to your workbook by clicking on Graph > Waterfall Chart. Next enter the data for the plot as shown in the figure below on the Excel worksheet. There are currently no options that need to be specified for this analysis. Once all the data is entered, click on Compute Outputs to generate the graph.


For more details about this software, please visit http://www.sigmamagic.com/.

Changeover Time Reduction (SMED)


SMED stands for Single Minute Exchange of Dies. It is one of the methodologies of Lean to reduce the changeover time. The name SMED comes from the automotive industry where dies are used to make car auto-body parts. These machines use large hydraulic forces to press a pair of dies onto a sheet metal in order to impart a specific shape to the sheet metal such as a front hood of a car. Due to the large forces involved, some of these machines can be pretty large and changing the dies from one shape to another requires significant effort and downtime. In the past, the changeover time could be several days.

If the changeover time is large, then making frequent changeovers is considered non value added time as the equipment is not productive during changeovers. Hence, the number of changeovers is usually minimized, which means these machines end up making hundreds of parts of one type before a changeover is initiated to another die. The numbers of parts that are made are larger than what is immediately required causing excessive inventory in the factory floor. One of the root causes of large inventories is the high changeover time required to change dies. Taiichi Ohno, the father of Lean along with Shigeo Shingo came up with a method to reduce the changeover time to less than 10 minutes. Hence, this methodology was named as the Single Minute Exchange of Dies (SMED).

The techniques developed is not only applicable for dies but can be used anywhere a changeover reduction is required. Don’t get confused with the name as well. A successful SMED program could reduce the changeover time from 1 day to less than 1 hour (it does not always have to be in minutes). In some cases, the changeover time could be reduced to less than 5-10 seconds (as in NASCAR pit crew). This concept does not apply to manufacturing alone. For example, when an aircraft lands at an airport and before it can take off for the next flight can be considered as a changeover time. During this changeover, the flight has to be refueled, cleaned, all the necessary checks have to be performed, luggage of the arriving passengers has to be removed and those of the departing passengers has to be loaded etc. The longer it takes to make this changeover the lower is the efficiency of the entire process as the plane is not productive and generating revenue during the changeover.

Why a large changeover time is bad?

First of all, changeover time is not value adding. The reason being that during changeover, no product is being produced. Though the changeover can help us make the right product that the customer requires, if we are able to reduce the changeover time to a smaller value, we can still accomplish what the customer wants without it impacting the production too much. Hence, from this perspective, a smaller changeover time is preferable. More importantly, a larger changeover time implies that the organization will not do frequent changeovers in order to maintain the productivity numbers. This means that even if a changeover is required, to maintain efficiency numbers, the changeover may be delayed. Running the process for longer than required means that the produce inventory is in excess of what is required; this is one of the wastes of Lean (over-production). In addition, this excess inventory needs to be stored in some location causing transportation waste; there is a chance of obsolescence of the parts and rework etc. Hence, it is clear that a large changeover time is not beneficial for any process. If the changeover time can be reduced, then inventories can be reduced and thus we reduce the waste in the process. Smaller inventory in the process also implies a smaller lead-time and hence faster reaction to that the customer expects.

Changeover reduction can facilitate the following benefits:

  • Lower cost (since there is less NVA activities, lower inventory levels)
  • Smaller lot sizes (since we can have more frequent changeover)
  • Faster response to customer demand

SMED Methodology

We can use the SMED methodology to reduce the changeover time. A typical SMED methodology consists of the following steps:

Step 1: Identify all the activities that are currently being performed during a changeover including the time for each activity. You may want to video tape the changeover activity with an on-screen timer to ensure that all activities are captured. In addition, you may want to observe multiple changeovers to ensure you capture the variation as well as document all possible activities. One thing to watch out for during the video tape process is that the process is not altered due to the videotaping process – you should try to capture the as-is situation as close to reality as possible.

Step 2: Classify each activity that is performed as either an internal activity or external activity. All the activities that are performed when the equipment is stopped are called internal activities. All activities that are performed when the machine is running are called external activities. For example, if you need to change the wheels of a car, hopefully the car has to be stopped when you perform the changeover – that would hence be classified as an internal activity. If you need to check the engine temperature or other parameters it could possibly be done by the driver when the car is still in motion and hence could be an external activity. The most significant way in which changeover time can be reduced is to convert all internal activities to external activities. During a changeover, a number of activities are performed such as:

    1. Allowing the machine tool to cool down (if required)
    2. Getting the required tooling & instruments to do the changeover
    3. Removing the old die (which may be attached using screws, fixtures etc.)
    4. Clean-up of the area as required
    5. Putting the new die (including fixtures, screws etc.)
    6. Calibration of the machine tool as appropriate
    7. Running the machine tool for the first few parts to ensure good quality output
    8. Removing the tools & instruments from the work area

By moving those activities that are performed when the machine is stopped to when the machine is still in production the overall changeover time can be reduced. For example, you could get all the required tooling & instruments to do the changeover while the machine tool is still in production. Before a flight lands at an airport, the flight attendants could clean up the aircraft as much as possible while the plan is still in the air – this would be an example of an internal activity that is converted to an external activity. Of course, not all internal activities can moved to external activities but we should try to move as many activities as possible.

Step 3: Review all internal activities that are left over to see if they can either be eliminated, made simpler or combined with other steps in order to reduce the total changeover time. Techniques that can be used here are:

  • Eliminate bolts & screws using quick release mechanisms
  • Eliminate adjustments and use standardized settings
  • Eliminate motion by reorganizing the work area
  • Modularize equipment to reduce changeover time.

When performing SMED, don’t ignore the people side. In addition to the above technical elements on reducing the changeover time, significant improvements can also be made by paying attention to the people element. For example, by having clear roles and responsibilities, coaching team members on the concepts of kaizen and continuous improvement to always look out for improvement opportunities and constantly improving the process, standardizing the work instructions so that work gets done in a consistent and repeatable fashion we can make significant improvements to our process as well.

A typical SMED event may last 1-3 days. This should not be seen as a one-time activity. Once a SMED activity is completed and the changeover time has been reduced, we should ensure that new process is followed for a period of time to standardize the new way of working. In the meantime, new ideas and suggestions may come from team members to further refine and improve the process. Once the process has been stabilized, this would be a good time to initiate a subsequent SMED activity to further reduce the changeover time. Remember that the ideal changeover time is zero.

It is also important to stress that once the changeover time has been reduced, the related processes should also be streamlined – for example you may need to recalculate the appropriate inventory levels, stock replenishment strategies etc. For example, there is no benefit of changeover time reduction for a flight landing if the aircraft sits at the terminal idle and we don’t capitalize on the reduced changeover time. The shorter the changeover time, the more the number of changeovers you need to plan in your process so that the benefits of reduced changeover times can really be leveraged by the organization.

SMED Software

You can use software such as Sigma Magic to facilitate your changeover analysis. The screenshot below shows the list of activities and classifies them as internal & external. You can perform a time study to capture the changeover time before (with the existing process) and after (after making the improvements to the process using the 3 steps described above).


The output of the analysis software is shown below. It contains the summary of the data you have collected and the improvements you have made to the process along with a control chart which highlights the reduction in changeover time along with information on whether the new process is in control.



[1] Sigma Magic Software: http://www.sigmamagic.com.

Workload Balancing

Workload balance refers to all steps in a process having roughly the same workload so that the flow of work through these process steps is balanced.


Let’s take a simple example of a three-step loan approval process. The first step is done by a Jack who enters all the customer information from a paper document into the computer. He takes on an average 10 minutes to do this activity for each customer. He has to enter information such as name, address, phone numbers, social security number, credit details etc. The second step in the process is done by Jill. She has to check the credit rating for each customer using an online portal and enter the credit score for this customer in the customer record. She takes roughly 4 minutes to do this activity for each customer. The final step is done by Ron who makes a decision whether to approve or reject the loan application based on the customer details and then sends a notification to the customer. He takes roughly 6 minutes to do this activity. Let’s also assume that each person does this activity immediately and there is no buffer or work-in-process between each step. If a customer calls in with the details, then roughly after 20 minutes he or she can get a notification whether their loan is approved or rejected. The total cycle time (20 min) is the sum of the cycle times for each step in the process (10 + 4 + 6) when there is no work-in-process or buffer between each step in the process.

wb_fig1The jobs processed by each step in the process are shown by the figure below.


We would say that this process is not balanced. The reason is that each step in the process takes different amounts of time. The first step takes 10 minutes and while this person is working, the second person has to wait for his activity – so we do not have full utilization of the second resource. Once the first step is done, then the second person can start working on his activity. Let’s assume a TAKT time of 10 minutes – which means that on an average new customers are calling in every 10 minutes with a loan approval request. In this scenario, the first person is always busy since his activity takes 10 minutes. As soon as he finishes one job, a new customer calls and he has to repeat his activity again – no rest or break! The second person does his job in only 4 minutes so he is basically waiting for 6 minutes for the next job and then he works for 4 minutes on each job. A much better scenario compared to the first person. We are not fully utilizing this resource – his utilization at best is 40%. In addition, the first person may feel that they are being overworked while other people in the organization are having it “easy”.

A sample Cycle Time – TAKT Time chart is shown below.


The red line corresponds to a TAKT time of 10 min. From this chart it can be seen that the cycle time for Jack is equal to the TAKT time at 10 min, while the cycle times for Jill and Ron are less than 10 min. Thus both Jill and Ron have waiting time while Jack is fully busy with his activities. Jack’s process is called the bottleneck process since his cycle time is the largest and this process controls the overall throughput of the system. So, in an hour we can expect that this process will process at most 6 applications (60 min / 10 min per application). The throughput of this entire process can only be improved if the cycle time of the bottleneck process is reduced.


If all steps of the process have value added activities, then the only way to improve throughput is to re-arrange the workload between the three steps so that we can reduce the overall time. If we redistribute the activities so that we can take some of the actions being performed by Jack and distribute them to Jill and Ron then the workload for each step in the process is similar – we call such a process balanced. The throughput of the entire process can now increase to 9 applications (60 min / 6.67 min per application). In order to do workload balancing, one option is to redistribute the workload to other steps in the process. Other techniques that can be used to reduce the cycle time are to use SMED activities to reduce the cycle time of the bottleneck process, use Kaizen to reduce the cycle times, and finally use of visual aids and Standard Operating Processes (SOP) to minimize variation.


The benefits of workload balance are:

  • All steps in the process have similar cycle times – so employee motivation is high
  • Throughput of the process is maximized
  • There are no points in the process where we need to accumulate inventory
  • Rework is reduced since we have less mistakes due to lack of overload of bottleneck process


In order to implement workload balancing, we need to ensure that the following items are considered:

  • Each step in the process should be relatively free from quality problems
  • Each step should have equipment that does not breakdown often
  • Operators should be trained so that they don’t make too many mistakes
  • Operators should be cross-trained to cover absenteeism or any other absences.
  • Incoming work into the process should also be relatively stable with minimal rush orders

How to sustain the 5S program

How much time do you spend searching for stuff? It could be a simple tool to perform an operation or a report that you need to refer to or forward to others. The time you spend on searching is non-value adding from a customer’s point of view. It is a wasted effort that does not help in transforming a product or service to something a customer desires. In fact, it could make the customer wait longer while you are executing the search process for the needed tool or report. The idea is that an organized workplace could be an efficient workplace. While some may argue that they are able to perfectly find stuff even if it is disorganized but what my work for one may not work for the other. In companies and organizations where multiple people share workplaces – an organized workplace works for everyone. It reduces the stress that comes with having to constantly keep looking for stuff that you need to get your work done.

5S+1 is a lean technique to organize a workplace using a step-by-step approach. The name 5S+1 originated from the Japanese words for the process to get organized. The English equivalent of these Japanese words stands for: Sort, Set in Order, Shine, Standardize, and Sustain. More recently, a sixth S was added to this list (which is Safety) and hence some places in literature refer to this as 5S+1. Let’s go through each of these process steps to explain the approach/methodology:

  1. Sort: The first step is to sort through the clutter in a work area and identify those items which are needed and those that are not needed. This activity can be done periodically (say every year) as items would have accumulated over time in a work area. Sometimes, this process step also called a Red tag event where a red tag is attached to all items that are not required. Since, there are items that may be required in the future, a red tag will also contain some area where you can tag some notes to it indicating the reason for this disposition. A red tag room is created to store these red tag items just in case you have decided to throw items that were in fact useful. It is a good idea to keep the red tag room locked and have a defined process to get the items out of the room and dispose of the items that are truly not needed.
  2. Set in Order: The second step of this process is to arrange the items that are left over in such a way to minimize motion waste. Items that are used very often should be kept close to their point of use. Items that are used rarely can be kept in a slightly farther location.  A spaghetti chart can be used to help identify where different items should be placed. The theme behind the step is that there should be a place for everything and everything should be in its place. Once a home for each item is defined, generally the area is marked to make it clear what items goes where so that if any items are missing, it will become immediately obvious. Actions can be taken to trace the item and put it back in its place.
  3. Shine: The third step in the process is to ensure that whatever items are kept in the work area are in pristine working condition. There is no point keeping stuff that does not work! The process of shining the items can help detect if any problems exist with the tools and corrective actions can be taken to get them repaired so that we don’t have to waste time using tools that are broken. This step also acts as a preventive maintenance so that problem areas can be identified before the problem gets out of hand.
  4. Standardize: The fourth step in the process is to ensure that there is standard way of working. For example, the standards can define how the first three steps of this process are to be done. When should the sort activity happen, who should do it, how should it be done. What is the process for finding a home for each item, how should the third step be done and so on. We can also define standards for how an area should look like so that by looking at this picture for example, we can clearly identify if an area has been following the 5S steps. This step ensures that there is a structured approach to the current way of working.
  5. Sustain: The fifth step in the process is to ensure that this activity is not a one-time activity. How do we ensure that the 5S+1 process is sustained over a period of time? One approach is to have an audit process so that the 5S activity can be rated to see how well an area is being maintained. Without having any measurements in place, it would be hard to identify improvements and also if the process is being followed in a structured way in the first place. Other ideas that can help sustain the effort is to have a rewards and recognition program to sustain the 5S program over a period of time.
  6. Safety: This is the sixth step in the process (some people may call it the first step!) A safe workplace is an effective workplace. No business can continue to run profitably if it does not provide a safe working environment for its employees. Safety issues can result in lost time and productivity from workplace, increase in overhead costs due to medical expenses and possible lawsuits against the company. An unsafe workplace will also result in increased insurance costs for the company. More importantly it is just bad business – we need to take care of our employees, customers and anyone else we may deal with as a part of our business. In this step, we review all possible near misses and take steps to eliminate or minimize the near misses so that the number of accidents in the workplace can be reduced.

One tool we can use to check the effectiveness of the 5S+1 program is to use a 5S+1 Audit. An example form that can be used to audit the workplace is available in the Sigma Magic templates which you can access by selecting Lean and then 5S+1 Audit. A sample screenshot of this template is shown below.


The top section contains standard information about the area that is being audited such as the team name, team members, and date of the audit or any other observations by the auditors. It is usually recommended that there is a structured approach to the audit process with clear guidelines on who is on the audit team, when it will take place, how will the scores be used etc. that is shared with the organization. It may also be a good idea to include other lean team members (peers) on the audit process so that best practices can be shared between teams. It may also be a good idea to track the performance of a team audit scores over time to see if the area is getting better, consistent / repeatable, or getting worse. Based on this trend, appropriate corrective actions may be required.


The first three steps of the 5S+1 audit is shown above. This audit template was developed for a manufacturing or warehouse area and may need to be suitably modified for other areas/processes. Each element can be rated on a scale of 0-10. If there are multiple auditors, we can have a separate rating for each observer that is averaged or there can be a discussion among the members before they enter a common score on this template. The second approach is the preferred approach. Of course, there is no need to spend a lot of time debating if the score should be a 4 or a 5. Just go with one number! If the ratings are significantly different say one team members says 2 and other says 8, then each needs to explain the reasoning for their ratings and hopefully a common agreement can be reached.

The second set of three steps of the 5S+1 audit is shown below. Once the team “walks-through” the area and provides a rating for each of these elements, an overall score and observations of the audit team can be shared with the area and a print out of the score can be shared with the team which they can use for future improvements. As a part of reinforcing the right behaviors, we should always take the time to recognize and appreciate efforts taken by the team in improving the 5S scores. Also, ensure that the audit team is trained to avoid grade inflation and any partiality between teams. You may want to calibrate the scores given by the audit team on a random basis just to ensure that the process is working in a robust manner.


In some companies, the 5S+1 audit performance scores are linked to the incentive program. This provides another motivation for the team to keep their work areas neat and tidy. Sufficient communication should be provided to the organization to ensure that this exercise does not become a paper exercise and people see value in keeping their areas organized.

In summary, an organized workplace is an effective and efficient workplace. 5S+1 is a structured approach to keep the workplace organized. In order to sustain the program, a 5S+1 audit should be instituted so that this program can continue to run in a sustainable manner for a long time.

Feel free to share what are your thoughts on the 5S+1 approach and the 5S+1 audit? What works for you and what doesn’t?

Hypothesis Testing Introduction


Most of the time we are working with samples and not populations since it is usually too time consuming and expensive to collect population data. A sample is a subset of the population data and if the sample is taken at random it can be fairly assumed that the sample is representative of the population. Sometimes we need to make a comparison of the means of the population from which these two samples were drawn from. For example, let’s say we want to compare the average height of all men and women in the country. We draw a random sample of all men and a random sample of all the women from that country. We can calculate the sample average for men and women and compare them. However, this may not help us answer the question about the population since the next time you draw a random sample you will most likely get totally different numbers. So, the question now is what conclusions can we draw for the average height of men and women for the entire population? In order to answer these types of questions, we need to use hypothesis testing. Hypothesis testing belongs to a class of statistical techniques called inferential statistics because we are inferring about the population from the sample data set.

When to use hypothesis testing?

Whenever we need to make decisions based on data and we are working with sample data rather than population data, we need to use hypothesis testing. Hypothesis testing considers the variation that exists in the data set before extrapolating and drawing conclusions about the population. Of course, we can always make a mistake when we extrapolate from a sample to a population but using hypothesis testing it is possible to minimize the amount of error that may occur. In one of our subsequent modules, we will cover how to control the errors during hypothesis testing. If you don’t use hypothesis testing and only draw conclusions based on the sample results, you may draw the wrong conclusions.

Hypothesis Methodology

The following methodology needs to be used anytime you want to perform hypothesis testing:

  1. What is the practical question you are trying to answer?
  2. Convert the practical question to a statistical question (formulate your hypothesis statements)
  3. Determine the hypothesis test that needs to be used
  4. Determine the sample size required to control your alpha and beta errors
  5. Collect a random sample of the required data
  6. Perform the test and get the confidence intervals and P-values
  7. Draw statistical conclusions
  8. Convert the statistical conclusions to a practical conclusion to obtain an answer to the question

Hypothesis Formulation

In this article, we will discuss how to formulate the hypothesis statements. In general, there are 2 hypothesis statements. One of them is called the null hypothesis (Ho) and the other is called alternative hypothesis (Ha). When we initially formulate these statements, we usually don’t know what the answer will be. The data we collect later in the process will help us select whether we need to conclude Ho or Ha.

There are three rules to consider when you are writing the hypothesis statements:

RULE 1: The first rule is that the hypothesis statements is always about the population parameters and not about the sample. What we mean by this is that we know the exact statistic about the sample data we collect and hence we don’t need to make any hypothesis about them. What we do not know for a fact are the conclusions about the population. Hence, we always make hypothesis about the population. For example, if we are interested in the average height of a person in a country (say Brazil) and our hypothesis is that the average is equal to 6 feet. We can represent the null hypothesis as follows:

hyp1Note that we use the Greek letter (mu) to represent the population average. If we are working with a sample of say 20 data points, we use the English letter (xbar) to represent the sample statistic. The wrong way to make the hypothesis statement is the following:


RULE 2: The second rule is that equality sign belongs to null hypothesis. There are two hypothesis we create each time, the null (Ho) and alternative (Ha). Only one of them can be true at any given time. If Ho is accepted as true then Ha would be false and vice-versa. When you are creating the hypothesis statements, you need to ensure that the equality sign is always allocated to the null hypothesis. For example, if we want to show the average height is 6 feet, then one possibility for the hypothesis statements is as follows (average height = 6 feet and average height is not equal to 6 feet). This can be written mathematically as follows:


The wrong way to write these hypothesis statements are:


Of course, there are more possibilities of writing these statements depending on what you want to prove or disprove. For example, if you want to show that the average height of a person in Brazil is less than six feet then the corresponding hypothesis statements are as follows:


In all these statements, you will find that the equality sign has been allocated to Ho.

RULE 3: The third rule is that we continue to believe in the null hypothesis (Ho) if we don’t have enough facts and data. Only sufficient data can disprove the Ho and we reject the null hypothesis and accept the alternative hypothesis as proven. Hence, usually, what we want to prove or disprove is part of the alternative hypothesis and the null hypothesis contains the status quo – no change. The null hypothesis can be thought as a statement of no difference or zero difference. For example, if we want to prove that providing training improves the average productivity of an organization. The null hypothesis would be that whether we provide training or not the average productivity is the same. The alternative hypothesis may be that training improves productivity. If we don’t have sufficient data, our conclusion would be that we don’t have sufficient data to prove that training has an impact on productivity.


Note that if we are hypothesizing about the average value of a property, we use the Greek letter mu to denote the population average. If we are hypothesizing about the variation of a value, then we usually use the Greek letter, sigma which stands for standard deviation and finally, if we are hypothesizing about discrete values (say number of defects), we usually use the proportions (denoted by p). A proportion of defects or defectives is the number of defects divided by the total sample size. For example, if we have 50 items and out of which we have 4 defects, then the proportion is 4/50 = 0.08.


The following examples help illustrate how to write the right hypothesis statements. Try to write out the hypothesis statements for yourself first and then compare your answers to the ones provided below.

  1. Prove that the average salary in company A for entry level employees is greater than the average salary in company B.
  2. Prove that the average salinity in the sea water is greater than 35 g/L.
  3. Show that the variation in delivery times for supplier A is different from variation in delivery times for supplier B.
  4. Show that the average number of footfalls for four department stores, A, B, C, and D are different.
  5. Show that the manual process of recording exam marks has more defects compared to the automated process for recording exam marks.

Solutions to Examples

Here are some sample solutions to the examples shown above.


Try to come up with the right hypothesis statements for the following problems.

  1. An engineer comes up with a recommendation to improve the productivity of a generator. Data was collected for 10 days with the old way of working and with the modification to the generator. Based on this analysis, we want to prove that the modification in-fact increases the productivity of the generator.
  2. It was hypothesized that employees who work for longer than 10 hours per day make more quality defects compared to employees who work less than 10 hours per day.
  3. We want to show that the breaking strength of a material supplied by the old supplier is significantly lower than the breaking strength of a similar material supplied by a new supplier.





A histogram is a graphical representation of data. It is used for continuous data when you want to determine the nature of distribution of the data. It gives you an idea of the central location of the data (roughly where the mean and median are located) and the amount of variation you have in your data (roughly the minimum and maximum values). By viewing the distribution of the data, you get an idea if the distribution is unimodal (has one peak), bimodal (has two peaks), or multimodal (has many peaks). It tells you if the distribution is symmetric or non-symmetric. If the distribution is not symmetric, does it have a long left tail or a long right tail? The shape of the distribution also gives you a clue if the distribution is close to a normal distribution or any other shape you may be interested in (such as uniform, exponential etc.) Hence, whenever you collect continuous data, it is always a good idea to plot the histogram which will give you more information about the nature of the data you have collected.

How to build a histogram

Step 1: Determine the bins

To construct a histogram, we first divide the data into “bins”. The bins are usually consecutive and non-overlapping intervals of equal size. There is no right or single way to determine the bin size. If the bin size is too large then the shape of histogram is not clearly shown a lot of bars of the histogram get grouped together. If the bin size is too small, then it is possible that no enough data points are available within each bin and you may find a lot of “gaps” or empty bins making it hard to visualize the shape of the histogram. Hence, we should always try to vary the size of the bin to see if our interpretation of the shape of the histogram is any different.

If you know the number of bins (# bins) you want, you can determine the bin size simply by using the formula:


Usually, we do not know the number of bins in which case you can try one of the commonly used formulae which depends on the number of data points (n):


Step 2: Place the data in the bins

Sort through the data and determine which bin it falls into. For each data that falls into a bin, increment the count by 1. For example, if the data is 1, 4, 3, 2, 5, 6, 3, 2, 4, 3, 2, 1 and the bins are as follows (bin size = 2), the following table is obtained at the end of placing all the data into bins.

At the end of the exercise, the sum of all the values in the frequency column should be equal to the number of data points you started off with.

Step 3: Plot the histogram

Plot the bins on the horizontal (or vertical axis) and the frequency on the vertical (or horizontal axis). Depending on your objective, you may also want to superimpose the shape of the histogram on the histogram of the raw data in order to compare the theoretical shape with the raw data.


Step 4: Interpret the data

Once you have plotted the histogram, look for the following items:

  1. Minimum value of the data. From the above histogram it can be seen that the minimum value is roughly close to 0 (note that it may not be possible to get the exact minimum value from a histogram since we are binning the data).
  2. Maximum value of the data. From the above histogram, the maximum value is roughly close to 10 (note that it may not be possible to get the exact maximum value from a histogram since we are binning the data).
  3. Range = Max – Min value. From the above histogram, the range is roughly 10 (note that it may not be possible to get the exact range from the histogram since we are binning the data).
  4. Whether the distribution is symmetric (about the center), if not is it left-tailed or right-tailed. From the above histogram, we may conclude that roughly it is symmetric. We would need more data / bins in order to really determine if the histogram is symmetric.
  5. Does the distribution follow the shape you have postulated prior to plotting the histogram? From the above histogram, it is hard to determine if the data follows, say a normal distribution. We usually need at least 30 data points to determine if a histogram follows a particular distribution.
  6. Are there any gaps in the histogram. From the above histogram, there does not seem to be any gaps in the data as all bars are placed contiguous to each other.
  7. Are there any outliers in the histogram. From the above histogram, there does not seem to be any outliers in the data as no bars are located significantly far from the rest of the histogram.


Plot the histogram for the sunspot data. The data for the sunspot cycles can be obtained at: https://www.windows2universe.org/teacher_resources/suncycle_sheet.html.

A possible answer to the question is shown below for data from 1941-2000. From the figure below, we can conclude that the number of sunspots don’t necessarily follow the normal distribution. It has a range from 0 to 220 sunspots per year.



All the charts in this article were created using the Sigma Magic software. Visit http://www.sigmamagic.com.

Surface & Contour Plots

What is a contour plot?
If you have a response variable (say z) which is a function of one input variable (say x), then we can represent this mathematically as z = f(x). We can plot the variation in z with variation in x in a 2-dimensional line graph. However, if the response variable z is a function of two input variables (say x and y) then we can represent this mathematically as z = f(x, y). In order to plot the variation in z, we now need a 3-dimensions to plot the value of z with changes in x shown in one axis, changes in y shown in one axis and the resulting changes in z shown in a third axis. Such a plot is called a 3-d surface plot. A sample 3-d surface for z = sin^2(x) + cos^2(x) is shown below.


Sometimes, we may be interested in determining values of x and y which give the same or constant z values. This sort of a plot which is typically plotted on a 2-d plane is called a contour plot. The lines that represent constant z values are called iso-lines. For example, on a weather map for a given area, we may be interested in determining areas of high and low pressures. We would typically use a contour plot to represent such a figure. The lines of constant pressure are also called isobars. A sample 2-d contour plot for z = sin^2(x) + cos^2(x) is shown below.


Contour plots provide visual clues on what variables to select for X and Y in order to optimize the response function.

How to create contour and surface plot in Sigma Magic?
The contour and surface plot can be created within the Sigma Magic software by clicking on Contour Plot within the Graph menu. Click on Update Inputs to open the dialog box. Specify the name of the X axis variable (without any spaces or special characters), the minimum and maximum values of X and the number of increments to generate intermediate values of X. Similarly, specify the parameters for the Y axis. Note that the name of the Y variable should be different from the X variable. Finally, define the response variable name and the function model equation Z= f(x,y).


When you click on the OK button, Sigma Magic will generate values for the X and Y variables and the response Z variable using the specified model equation. If no equation is specified, you can also define the equation in the Z column on the worksheet. This will automatically create the 3-d surface plot. If you need the contour plot, you will have to select the graph and change the graph type to contour plot using Excel functionality.

How to interpret contour plot?
It is usually easier to interpret a 3d surface plot compared to a 2d contour plot. However, there are a few things to consider when looking at a contour plot.

  1. Contour plots can indicate peaks or valleys within the range of X and Y at the center of concentric shapes.
  2. If the contour lines are spaced close to each other, then the values change rapidly while if the contour lines are spaced far apart then the z values change more slower.
  3. If there are multiple concentric shapes within the figure, then the figure usually points to a multi-modal distribution.
    4. Contour plots which contain no curves but straight lines may indicate a ridge shaped function or a planar surface (linear model).

Probability Plot

The probability plot is a graphical method of determining if the data follows a given distribution. One of the common ways of creating the probability plot is called a Quantile-Quantile (Q-Q) plot. The idea behind this plot is that if the data follows a certain distribution, then the quantiles for the given data should match the quantiles of the distribution. If the quantiles of the distribution are plotted on the X-axis and the quantiles of the data are plotted on the Y-axis then if the data follows the given distribution then the Q-Q plot should line close to the 45 degree line.

Let’s first clarify what we mean by a quantile. A quantile divides the data into equal subsets and the boundary values are the quantiles. For example, if we divide the data into 2 halfs where half the values are less than the quantile and half the values are greater than the quantile. This quantile is also called the median (or the 2-quantile). If we divide the data into 4 equal buckets where 25% of the data are less than Q1, 25% lie between Q1 & Q2, 25% lie between Q2 and Q3, and 25% are greater than Q3. We can say that Q1, Q2, Q3 are quartiles (or the 4-quantiles). If you divide the data into 100 equal buckets then we call those quantile values as percentiles etc.A q-quantile means breaking up the data into q equal-sized sets of data.

A Q-Q plot can be used to get an idea of the location (central value), scale (spread) and skewness (symmetric nature) of the distribution. It is more powerful way than just looking at a histogram. It is always a good idea to plot the data and understand the nature of the distribution rather than looking at P-values on a goodness of fit test. Here are some common interpretations of the Q-Q plot:

  1. If the Q-Q plot is flatter than the 45 degree line (Y=X) then spread of the data plotted on the horizontal axis is more than the spread of the data shown on the vertical axis
  2. If the Q-Q plot is S shaped, then one of the distributions is more skewed compared to the other distribution (i.e. of the tails is longer than the other)
  3. The intercept of the regression line is a measure of the location and the slope of the line is a measure of the scale of the distribution.