1  Processes and Plots

David Gerbing
The School of Business
Portland State University

1.1 Project the Past into the Future

1.1.1 Variation over Time

Doing data analysis to gain insight into business processes, with an emphasis on business forecasting, is the purpose of this course.

Data analysis: Analysis of the data values of one or more variables.

Many thousands of examples of variables exist, of which three follow.

  • OvertimeHours: The total hours of employee overtime for a company varies from month to month.
  • ArrivalTime: Arrival times of shipments vary with some shipments on time and others either early or late.
  • FuelCost: The cost of filling up a delivery truck’s fuel tank varies each time the tank is filled.

The data values for a variable naturally vary from one instance to another.

Variable: Property of the object of study that varies over different instances.

The variables examined in this document have values that vary over time.

  • One basis of predicting future values of these variables is the pattern of how their data values varied over previous times.

  • Another basis for prediction follows from how the values of different variables vary together, how they co-vary. The values of variables may vary together, usually not perfectly but as tendencies. Consider the following tendencies of co-varying variables:

    • The further the origin of the shipment from its destination, the more likely its cost increases.

    • Higher octane gasoline tends to cost more than lower octane gasoline.

    • The more time on social media, the lower tends to be the GPA

Applying statistics is the analysis of this variation of a single variable and co-variation of multiple variables. This concept leads to a concise, yet effective definition of the discipline of statistics.

Statistics: Analysis of variation.

Without variability there is no statistical analysis of data.

This material in this document addresses the variation of a single variable over time.1

  • 1 Prediction from the co-variation of multiple variables is discussed in another place, which forms the basis of regression analysis.

  • Forecast from past variation of a variable

    To forecast, to predict the future, discover the pattern of variation of one or more variables over past time periods and then extend that pattern into the future.

    1.1.2 Randomness vs Stability

    To uncover a pattern is complicated by the presence of randomness. The values of a variable not only vary, they at least partially vary according to random influences from one value to the next. Each data value is determined by an underlying stable component consistent with the underlying pattern and a random component that consists of many undetermined causes.

    Random variability: The sum of many undetermined influences that result in a value of data value that varies from the underlying pattern, known or unknown.

    Random variability is pervasive and its impact on data analysis profound.

    Implication of random variability

    Even if the structure of the process is understood, the exact next value cannot be known until that outcome occurs.

    For example, the hospital staff does not know when the next patient will arrive in the emergency room until the patient arrives. Nor do they know how many patients will be admitted on any one evening. A hospital may see 17 people admitted to the emergency room on one Saturday evening, and 21 people another Saturday evening. You do not know the amount of overtime hours in your department that will occur next month until next month happens. And you do not know how much the next tank of fuel will cost until you again fill up the tank.

    The opposite of randomness is pattern, stability and structure, the basic tendencies that underlie the observed random variation. The same hospital that admitted 18 and then 21 patients to the emergency room on a Saturday evening, admitted on the corresponding Wednesday evenings, 8 and 6 people. All four admittance numbers – 17, 21, 8, 6 – are different, but the pattern is that more people were admitted on a Saturday evening than on a Wednesday evening.

    Data analytics: Quest for a stable pattern that underlies obscuring random variation.

    1.1.3 The Future

    A central task of data analytics is to reveal and then quantify the underlying tendencies and pattern. The ever present random variation, however, obscures the underlying regularity, the pattern or signal. Sometimes the task of uncovering and quantifying structure is straightforward, and other times it is as much intuition and skill as it is the formal application of statistical principles. To delineate this stable pattern from the observed randomness, construct a set of relationships expressed as a model.

    Model: Mathematical expression that describes an underlying pattern apart from the random variation exhibited by the data.

    The outcomes of a process include a random component, but the model describes the underlying, stable pattern. The data consist of this stability with the added randomness that to some extent obscures the pattern.

    We need data from the past to forecast the future. Once we can describe a stable pattern, project this pattern into the future to generate a forecast. The process includes a sequence of basic steps.

    Decisions: Management decisions apply to the future.

    The ability to accurately forecast is necessary to business success. Try running a business in which the forecasted sales never materialize.

    1. Describe: Assess inherent variation in the data
    2. Infer: Build a model that expresses the knowledge of the underlying stable component that underlies this variation
    3. Forecast: From the model project this stable component into the future as the estimate of future reality
    4. Evaluate: Wait for some time to pass and then compare the forecast to what actually happened

    The knowledge obtained from this analysis begins with a description of what is, an inference of the underlying structure that culminates in a forecasting model, followed by a forecast of what will likely be, and then refinement of the model to improve the accuracy of the forecasts. The primary problem of identifying patterns from the past is the presence of sampling error.

    1.2 From Sample to Population

    Before discussing the application of sampling to forecasting, begin with the concept applied to estimating a single value, the mean.

    1.2.1 Samples vs Populations

    Video: Samples vs Populations [12:59]

    One primary problem we encounter in doing data analysis is that all the data of interest is not available for our analysis. We are limited to analysis of only some of the observations of interest. For a study of the number of employees who call in sick on a Friday before a holiday Monday, the data may not exist for many years in the past, and regardless, do not exist at all for future such Fridays. A study of the blood chemistry of those who have Type~II Diabetes cannot examine all people who have had, who do have, and who will have Type~II Diabetes. The length of time to complete a specific procedure is of interest, but the times of past procedures may not have been recorded, and cannot have been recorded for future instances of the procedure.

    The distinction here is the population of data values compared to a sample of data taken randomly from the full population. We want the entire population, but typically only get one sample.

    Population: Set of all existing or potential observations.

    Sample: A subset of the population, the data for a specific analysis.

    To collect data, randomly gather a subset of the population, illustrated in Figure 1.1. Then generalize results to the entire population.

    Figure 1.1: A population defined by a process, and a sample from the process, where each dot represents a single observation.

    The population is of primary interest, which requires that we generalize beyond the usual one sample of data. The population may consist of a usually large number of fixed elements in a given location at a given time, such as all registered voters. More generally, the population consists of outcomes of a process ongoing over time, in which case the population and its associated values such as its mean are hypothetical. Fixed population size or ongoing process, the population values, such as the population mean, are usually unknown to the analyst.

    Population value: True value based on the entire population, such as the population mean.

    One type of data analysis refers only describes the sample that is analyzed.

    Descriptive statistics: Summarize and display aspects of the sample data drawn from the larger population.

    Descriptive statistics are also referred to as summary statistics.

    Descriptive statistics

    Calculate descriptive statistics directly from the data.

    There is nothing to infer or guess regarding descriptive statistics. They are simply calculated, usually correctly with modern computers.

    Examples of descriptive statistics include the mean and standard deviation, and the value in the middle of the sorted data, the median.

    • Calculate the median for the midterm for a given class.
    • Calculate the mean ship time from a supplier for all shipments during the last 365 days.

    Compare estimation with calculation.

    Inferential statistics

    The estimation of unknown population values follows after the calculation of the corresponding descriptive statistics.

    Population values must be inferred from sample values with only guided guesses as to their true value.

    How are the elements in the sample selected from the larger population?

    Random sample: A sample in which every value of the population has an equal probability of selection.

    To select a pure random sample requires access to randomly generated numbers, each assigned to a corresponding element of the population, usually accomplished with a computer application such as R or Excel. Then, for example, to select a sample of 10 from the population, choose those elements with the lowest ten numbers. A pure random sample is difficult to implement completely. However, a good approximation of randomness of the sample is essential to properly generalizing statistical results computed from that sample to the corresponding population.

    Even if a sample is random, were the data sampled from the correct population? When drawing a sample from the larger population, distinguish between what is wanted vs what is obtained.

    Sampling frame: The actual population from which the sample is drawn, distinguished from the desired population.

    The results of an analysis can only be properly generalized to the sampling frame. The sampling frame determines the scope of generalization of results, which may or may not be the desired population of interest per se. Sometimes the population actually sampled is not the desired population.

    Consider an example that limits generalizing sample results. Suppose a researcher conducts a market research survey. Define the population of interest as all city residents, then. Draw a random sample of people listed in the phone book. Collect data by calling the people from 9am to 5pm. The results of this analysis can only be properly generalized to people .

    • with a phone
    • with a listed phone number in the phone book
    • who answer all phone calls
    • are available during the daytime hour

    The sampling frame is not the population of interest in this example, all city residents, so these survey results cannot be properly generalized to all city residents. Inappropriate generalization of the results from a sampling frame to the wrong population, usually the intended population, is one of the most egregious errors of data analysis. Unfortunately, this error is not too uncommon, nor can the usual statistical analysis of the data correct this problem. As the saying goes: Garbage in, garbage out.

    1.2.2 Sampling Fluctuations

    Video: Sampling Fluctuations [6:16]

    The sample of data values is only the starting point of statistical analysis. Statistics such as the sample mean, \(m\), provide a summary of this distribution of data values. The specific values in any one sample differ from the values in another sample. Accordingly, the value of a sample statistic of interest, such as \(m\), arbitrarily varies from sample to sample. Each sample outcome, such as \(m\), is an arbitrary result, which only partially reflects the true, underlying population value. Typically, only one sample is taken and so only one \(m\) is observed.

    The following reality is the basic motivating concern addressed by statistical inference.

    Multiple samples

    If many samples were taken, then a different value of the sample mean, \(m\), would be observed for each sample.

    Consider the simple scenario of flipping a fair coin. The randomness is apparent, but is there some stable aspect of reality that underlies this variation? The answer is yes. Sometimes obtain fewer Heads and sometimes more Heads, but the random variability of the number of Heads follows a specific pattern: the number of Heads on 10 flips tends to be around 50%. In general you will get around 5 Heads on 10 flips for a fair coin.

    To illustrate, flip a fair coin 10 times, a sample size of \(n=10\). Encode the outcome of each flip as a value of the variable Y. Score each Head as \(Y=1\) and each Tail as \(Y=0\).

    Sample 1: Now get the first sample. Do the 10 flips. Consider the result in Figure 1.2.

    Figure 1.2: Ten flips of the coin, get 5 heads.

    Calculate a sample mean, \(m\), for a given sample with the standard formula: \[m = \dfrac{\sum Y_i}{n}\]

    Result: \(m_1 = \dfrac{0+0+1+1+0+1+1+0+0+1}{10} = .5\)

    Sample 2: Now gather another sample of size \(n=10\). Flip the same fair coin another 10 times. Again, score Y=1 for a Head and Y=0 for a Tail. Again, calculate \(m\) for \(n=10\), with the result in Figure 1.3.

    Figure 1.3: Ten flips of the coin, get 6 heads.

    Result: \(m_2 = \dfrac{1+1+1+0+0+1+1+0+1+0}{10} = .6\)

    We see that one sample of ten flips yields \(m_1=.5\). However, a second sample yields a different result, \(m_2=.6\).

    As managers, we are not concerned with coin flips per se, but we all gamble in the presence of uncertainty. Random variation about an underlying structure is always present in the data we analyze to inform managerial decisions. The application of statistics to data analytics cannot provide certainty, but it can provide a tool to more effectively address uncertainty. We wish to construct a potentially useful form of knowledge, a model that delineates the underlying pattern from the observed random variation.

    For example, move away from coin flips to a real-world example. Consider the mean length of time required to complete a procedure. For example, assess the average length of time over 10 people it takes to complete an assembly. Sample 1: \(m\) = 14.53 minutes. Then, with the same tools, the same source of materials, and the same employees, repeat the measurement on a second assembly. Sample 2: \(m_2\) = 13.68 minutes.

    Sampling variability: The value of a statistic, such as the sample mean, \(m\), randomly varies from sample to sample.

    An essential concept of data analysis

    The variation of results across samples of coin flips applies to data analysis in general. All of our measurements are obtained from purported random samples with results that vary across samples.

    Why? Sampling variation of a statistic results from random variation of the data values from random sample to random sample. A descriptive statistic only describes a sample, but is limited in its generality to describe the corresponding population value. The focus of statistical analysis of data is the underlying, stable population values, not statistics calculated from an arbitrary sample.

    1.2.3 Law of Large Numbers

    Video: Law of Large Numbers [16:52]

    The population mean, \(\mu\), is an abstraction. The distinction between sample and population values is apparent from the way in which information about these values is obtained. Calculate sample statistics directly from the sample data. If the calculation is done correctly, the sample value is known exactly. In contrast, estimate the underlying reality, the population values.

    For example, to understand reality, management wants to know the value of the population mean, \(\mu\). The difficulty is that \(\mu\) is an abstraction, a hypothetical, never directly calculated, and so must be estimated without knowing its precise value. To estimate the unknown requires information from which to provide the basis for the estimate.

    Only one kind of information is considered in traditional statistical analysis, which employs what is called the The classical or frequentist model of statistics2 defines probability in terms of long-run averages. The classical model specifies to obtain information regarding a population value, such as \(\mu\), only from data, preferably lots of data.

  • 2 The primary competing model is the Bayesian model, in which other types of information are also considered in the estimation of a population value.

  • More of the same is better

    More information, more data randomly sampled from the same process, generally provides better estimation of underlying population values.

    The data used to estimate a population value of a process or population, must, reasonably enough, all be generated from the same process. Further, the sampling process must be random.

    Consider a coin flip in which the truth is that the coin is fair. How do we express the underlying reality, the true structure? The coin is fair, so unlike the typical data analysis, directly compute the reality, the value of \(\mu\), from the scoring protocol of Y=1 for a Head or Y=0 for a Tail, that the coin is fair. The probability of a Head (1) on a given toss is \(\frac{1}{2}\), as is the probability of a Tail (0). \[\mu=\frac{1}{2}(1) + \frac{1}{2}(0)=0.5\]

    Can data analysis reveal this reality of a fair coin? How can the analysis of sample data reveal a population value?

    The underlying reality results from the long-term outcome of 50% Heads. Random variability, however, as it always does, obscures knowledge of the underlying reality. To maximize the chances of winning the most favorable outcome on a single coin flip, a gambler needs to understand the underlying reality of what happens over the long term of many coin flips. Any one specific sample result does not, by itself, reveal the underlying reality. Flipping a coin 10 times will result in 7 Heads (the outcome of 11.7% of samples of 10 flips), a result far from the true reality of a fair coin but not so rare either.

    Betting on a 50% chance of getting a Head with a fair coin generally provides better outcomes than would betting with a false representation of reality. For example, suppose one sample result of 10 flips resulted in 7 Heads, which leads the analyst to conclude that there is a 70% chance of getting a Head on any one flip. A gambler betting against a fair coin with such false knowledge will lose far more often than winning.

    Here, we explore the relation between the underlying pattern, here \(\mu=0.5\) to the size of the sample used to estimate the underlying pattern. To illustrate the usefulness of larger samples, successively plot a mean over many sample sizes. Generate the data by flipping a coin a specified number of times and record if a Head or a Tail after each flip. Record the data values for a variable Y to indicate the number of Heads obtained. Score each Head as Y=1 and each Tail as Y=0. Over the entire population, one-half of all flips result in a 1 and one-half of all flips result in a 0.

    Running Mean: From a sequence of data values, re-calculate the sample mean, \(m\), after each new data value is collected.

    The resulting plot of the running mean shows the value of the sample mean, \(m\), as the sample size increases.

    Here, we simulate repeated coin flips. Anyone with a coin can conduct the analysis and then plot the running mean over as many flips as desired. Of course, such a task is exceedingly tedious. Fortunately the computer provides the data as if we have many coin flips.

    Question of interest

    How close does the mean computed from sample data, \(m\), tend to be to the true value of \(\mu=0.5\), as the sample size increases?

    Computer simulation allows us to easily explore what happens for different sample sizes. Here, the simulation is of different numbers of coin flips, from 1 to the specified maximum number of flips. A different result follows each time the simulation is run.

    The lessR function simFlips() provides the simulation, plotting the running mean as the sample size increases as if repeatedly flipping the same coin.

    • Required: n, the maximum number of flips
    • Default: prob=.5, probability of a Head, coin is fair
    • Optional: enter ?simFlips for all options, e.g., pause=TRUE for each flip

    Mirroring the reality of actually flipping a real coin, every time a simulation is performed, even for the same sample size, generally obtain a different result.

    Begin with a simulated sequence of only 10 flips, with the result in Figure 1.4.

    simFlips(n=10)

    Figure 1.4: Running mean over a sequence of Heads (1) and Tails (0) for 10 coin flips.

    The value of the sample mean of 10 coin flips estimates the true value of the entire population, \(\mu\), which equals 0.5. The first (simulated) coin flip was a Tail, and the second flip was a Head. Together, these 10 flips resulted in 4 Heads.

    \[{m} = \dfrac{\sum Y_i}{n} = \dfrac{\textrm{Number of Heads}}{\textrm{Number of Flips}}= \dfrac{4}{10}=0.4\]

    Now consider 250 flips (and a certain appreciation that the computer will do the work of all those “flips” of the coin for us). The simulation result appears in Figure 1.5.

    simFlips(n=250)

    Figure 1.5: Running mean over a sequence of Heads (1) and Tails (0) for 250 coin flips.

    The value of \(m\) for each sequence of coin flips is {relatively unstable} for the first 50 or so coin flips, and then after 200 or so flips, the {estimate of \(\mu=0.5\) nicely settles down}. After 250 coin flips:

    \[{m} = \dfrac{\sum Y_i}{n} = \dfrac{\textrm{Number of Heads}}{\textrm{Number of Flips}} = \dfrac{117}{250}=0.468\]

    The estimate of \(\mu\), the sample mean, \(m\), is closer to \(\mu=0.5\) at the end of 250 flips than at the end of a sample of 10 flips.

    Now consider a much larger sequence of simulated coin flips, \(n=10,000\). Find the simulation result in Figure Figure 1.6.

    simFlips(n=10000)

    Figure 1.6: Running mean over sequence of Heads (1) and Tails (0) for 10000 coin flips.

    Again, there is much fluctuation in the value of \(m\) for small sample sizes, but much stability in the estimate as the sample size continues to grow. After 10,000 coin flips:

    \[{m} = \dfrac{\sum Y_i}{n} = \dfrac{\textrm{Number of Heads}}{\textrm{Number of Flips}}= \dfrac{5112}{10,000}=0.511\]

    Even after 10,000 coin flips the true value of \(\mu\) is not attained, but a closer estimate resulted.

    How close is the estimate to the actual population value? Not a big surprise here: We conclude that the more information, the better the estimate.

    From the sample to the population

    Generalize information obtained from a sample, such as the sample mean, to the population as a whole.

    Sample the data values from the same population so that a common \(\mu\) and \(\sigma\) underlie each data value \(Y_i\).

    Law of Large Numbers: The larger the random sample of data values all from the same population, the closer the estimate tends to be to the underlying population value.

    This result of large samples is only a tendency. There is no guarantee that a statistic from any one larger sample yields a closer estimate to the population value, but generally, more data provides better estimation.

    1.3 Patterns Over Time

    A fundamental property of process output is that the outcome of any process varies over time. The data and visualization of the values of a variable over time are the result of an on-going process. There are two basic ways to plot the data values of a variable to reveal the pattern of their variability over time: run chart and time series plot. Each of these techniques is discussed in the following material.

    1.3.1 Run Chart

    Video: Application and Definition [7:02]

    1.3.1.1 Processes

    The process is the core unit for organizing business activities. To evaluate process performance, consider variables that generate values over time such as:

    • time to complete the procedure
    • daily inventory for physical parts
    • ongoing satisfaction measured with customer ratings

    Business Process: Structured set of procedures that generate output over time to accomplish a specific business goal.

    A functioning business is essentially a set of interrelated business processes that ultimately lead to delivery and servicing of the product Managing a business is managing its processes, so evaluating the on-going performance of the constituent business processes is a central task for managers.

    Consider some examples of outcome variables for business processes.

    • Supply Chain: Ship Time of raw materials following the submission of each purchase order
    • Manufacturing: Length of a critical dimension of each machined part
    • Production: Amount of cereal by weight in each cereal box
    • Order Fulfillment: Pick time, elapsed time from order placement until the order is boxed and ready for shipment
    • Accounting: Time required to forward a completed invoice from the time the order is placed
    • Sales: Satisfaction rating of customers after purchasing a new product
    • Health Care: Elapsed time from an abnormal mammogram until the biopsy

    1.3.1.2 Definition and Example

    Consider the time dimension of an ongoing process. Effective management decisions about when to change the process, to understand its performance, to know when to leave it alone, to evaluate the effectiveness of a deliberate change, require knowledge of how the system behaves over time. Evaluation of a process first requires accurate measurements of one or more relevant outcome variables over time.

    Run Chart: Plot of the values of a variable identified by their order of occurrence, with line segments connecting individual points.

    Index: The ordinal position of each value in the overall sequence, usually numbered from 1 to the last data value.

    Plot the values on the vertical axis. On the horizontal axis display the Index.

    A run chart may also be called a sequence chart, an example of what is generally called a line chart.

    As an example of a run chart, consider pick time, the elapsed time from order placement until the order is packaged and ready for shipment. Pick time is central to the more general order fulfillment process, and requires management oversight to minimize times and to detect any bottlenecks should they occur. The variable is Hours, the average number of business hours required for pick time, assessed weekly, in file pick.csv.

    First, get the data.

    #d <- Read("https://web.pdx.edu/~gerbing/data/pick.csv")
    d <- Read("~/Documents/BookNew/Ch02/data/pick.csv")
    
    >>> Suggestions
    Recommended binary format for data files: feather
      Create with Write(d, "your_file", format="feather")
    To read a csv or Excel file of variable labelsvar_labels=TRUE
      Each row of the file:  Variable Name, Variable Label
    Read into a data frame named l  (the letter el)
    
    More details about your data, Enter:  details()  for d, or  details(name)
    
    Data Types
    ------------------------------------------------------------
    double: Numeric data values with decimal digits
    ------------------------------------------------------------
    
        Variable                  Missing  Unique 
            Name     Type  Values  Values  Values   First and last values
    ------------------------------------------------------------------------------------------
     1     Hours    double     24       0      17   3.3  1.9  4.2 ... 6.4  4.1  6.7
    ------------------------------------------------------------------------------------------

    Obtain the run chart and associated statistics with the lessR function Plot() to display the values in sequence. To inform the function that the values should be plotted sequentially, in the order in which they occur in the data table, set the run parameter to TRUE. Plot() automatically connects adjacent points with a line segment. The run chart appears in Figure 1.7.

    Plot(Hours, run=TRUE)
    >>> Suggestions
    Plot(Hours, run=TRUE, size=0)  # just line segments, no points
    Plot(Hours, run=TRUE, lwd=0)  # just points, no line segments
    Plot(Hours, run=TRUE, fill="on")  # default color fill 
    
         n   miss     mean       sd      min      mdn      max 
         24      0     3.53     1.54     1.00     3.70     6.70 
    
    ------------
    Run Analysis
    ------------
    
    Total number of runs: 12 
    Total number of values that do not equal the median: 24
    Total number of values ignored that equal the median: 0 

    Figure 1.7: Weekly average pick time.

    The center line, the median, is automatically added if the values of the variable tend to oscillate about the center

    What does the manager seek to understand from a run chart?

    Random influences: contribute to every process outcome, obscuring the underlying process characteristics.

    A primary task of process management is to assess process performance in the context of this random variation, to know

    • The average level of performance of the process
    • The amount of random variation about the average level inherent in the process

    The next task is to actively manage process performance. We see from Figure @ref(fig:pt) there is a concerning trend for pick time to be deteriorating toward the end of data collection. The last three data values are above the median and the maximum value of 6.70 is obtained as the last data value. Are these larger pick time values random variation or do they signal a true deterioration in the process? More data would answer that question, data carefully scrutinized by management. Adjust the average level of performance up or down to the target level, as needed. Continue to minimize the random variation about the desired average level of performance

    1.3.2 Process Stability

    Video: Process Stability [3:45]

    Video: Process Stability with Error [4:20]

    Processes always exhibit variation but that variation can result from underlying stable population values of the mean and standard deviation.

    Stable process or system in control: Data values generated by the process result from the same overall level of random variation about the same mean.

    The run chart of a stable process displays random variation about the mean at a constant level of variability. .

    A stable process or constant-cause system produces variable output but with an underlying stability in the presence of random variation. The output changes but the process itself is constant. W. Edwards Deming popularized the stable process as perquisite to establishing quality control of process output. He describes a stable process as follows.

    W. Edwards Deming, “Some Principles of the Shewhart Methods of Quality Control,”Mechanical Engineering, 66, 1944, 173-177.

    There is no such thing as constancy in real life. There is, however, such a thing as a constant-cause system. The results produced by a constant-cause system vary, and in fact may vary over a wide band or a narrow band. They vary, but they exhibit an important feature called stability. … [T]he same percentage of varying results continues to fall between any given pair of limits hour and hour, day after day, so long as the constant-cause system continues to operate. It is the distribution of results that is constant or stable. When a … process behaves like a constant-cause system … it is said to be in statistical control.

    The model on which a forecast is based depends on the type of pattern the data exhibit. For a stable process, the forecast of the next value is the mean of the process because the process has a constant mean and a constant level of variability about that mean.

    Forecast of a stable process

    Stable process model: \(Y_{t+1} = m_Y + e_t\)
    Forecast from a stable process: \(\hat Y_{t+1} = m_Y\)

    As always, there is the ever-present random error. The forecasted value is generally not equal to the value that occurs as the process continues to operate.

    To illustrate this random, sampling error, consider the following four stable processes shown in Figure 1.8, Figure 1.9, Figure 1.10, and Figure 1.11. To better compare the processes, their visualizations all share the same y-axis scale, from \(-3\) to \(3\). In the following figure captions, \(m\) is the sample mean and \(s\) is the sample standard deviation. For illustrative purposes, each run chart of each of the four stable processes is illustrated with the median as the center line.

    Figure 1.8: Stable process with almost no random error: \(m=-0.0001\), $s=0.0010.

    Figure 1.9: Stable process with a small amount of random error: \(m=0.0079\), \(s=0.1146\).

    Figure 1.10: Stable process with an intermediate amount of random error: \(m=0.0957\), \(s=0.6788\).

    Figure 1.11: Stable process with much random error: \(m=0.2576\), \(s=1.3364\).

    What differs across these four stable processes is their variability. All four processes are stable, but the variability of their output differs. The sample standard deviation of these four stable processes varies from 0.0010 to 1.3364. The definition of a stable process is not a small variability of output but rather a constant level of variability about the same mean.

    To create a forecast you first need to understand the underlying process. You must first identify the pattern you will project into the future. If you view data with a large amount of random error, such as appears in Figure 1.11, you need to recognize the process as stable. Specifically, you need to recognize that the fluctuations are not the regular fluctuations of seasonality but instead are irregular with no apparent pattern. With no seasonality and a constant mean, the forecast for the next value remains the mean of the previous values.3

  • 3 Assumptions of a stable process are better evaluated from an enhanced version of a run chart called a control chart, not, however, discussed here. An analytic technique for evaluating the presence of seasonality will be presented later.

  • 1.3.3 Non-Stable Processes

    Video: Non-Stable Processes [11:35]

    Some other patterns found in data for a variable collected over time are described next.

    1.3.3.1 Outlier

    Of particular interest in the analysis of a process is an outlier.

    Outlier: A data value considerably different from all or most other of the data values in the sample.

    The presence of an outlier indicates that a special cause, a temporary event, likely resulted in a deviant data value. Figure 1.12 contains an outlier.

    Figure 1.12: Otherwise stable process with an Outlier.

    Forecast from a process with an outlier: Understand why the outlier occurred and ensure the conditions that generated the outlier do not occur for the forecasted values.

    Always the concern when observing an outlier is to understand how and why the outlier occurred. Almost always this is an important understanding because it often leads to a change in the procedure, presumably for the better. The most trivial cause is a simple data entry error, not the type of data on which to base management decisions.

    1.3.3.2 Level Shift

    Another pattern is a process in which some event occurs that shifts the level of the process up or down, essentially transforming one process into another. The next figure illustrates a stable process without error, the underlying structure, not the data, which then shifts, shown in Figure 1.13.

    Figure 1.13: Structure of two stable processes, the second process a level shift of the first process.

    The following figure illustrates this process as observed in reality. Random error partially obscures the underlying stability followed by the upward shift of the process mean to define a new, stable process, shown in Figure 1.14.

    Figure 1.14: Data recorded from two stable processes, the second process a level shift of the first process.
    Forecast from the pattern after a level shift

    Ensure there is enough data from which to discern the underlying structure after the shift.

    Once a process has changed, such as a level shift, the data values that occurred before the change are no longer relevant for discerning the current underlying signal.

    1.3.3.3 Trend

    Another type of non-stability is trend.

    Trend: Long-term direction of movement of the data values over time, a general tendency to either increase or decrease.

    The example in Figure 1.15 is a positive, linear trend with a considerable amount of random error obscuring the underlying signal.

    Figure 1.15: Data recorded from a positive linear trend.

    Without random error, a linear trend plots as a straight line, either with + slope or - slope.

    Forecast trend

    Extend the trend “line” into the future.

    The trend line can be an actual straight line, linear, or curvilinear, such as exponential or logarithmic growth or exponential decay.

    1.3.3.4 Seasonality

    Another common, non-stable pattern in data over time is seasonality.

    Seasonality: Pattern of fluctuations that repeats at regular intervals, such as the four quarters of a year.

    This first plot, in Figure 1.16, is of the underlying structure, the signal without the contamination of random error.

    Figure 1.16: Seasonal process with perfect structure, that is, without random error characteristic of data.

    In the plot in Figure 1.17, some random error is added.

    Figure 1.17: Seasonal process with perfect structure, that is, with some random error.

    The process illustrated in Figure 1.18 exhibits much random error that tends to obscure the underlying signal.

    Figure 1.18: Seasonal process with much random error.

    The forecast for data that reflects seasonality with no pattern of increase or decrease depends only on the season.

    Forecast seasonality

    Estimate and apply the seasonal effect for the season at which the forecasted data value occurs.

    As always, the more random error the more difficult to estimate the seasonal effects. How these seasonal effects are estimated is discussed later.

    1.3.3.5 Trend with Geometric Seasonality

    Consider a successful swimwear company that generally experiences more robust sales each year but suffers from relatively lower fall and especially winter sales. The highest quarter for sales is spring (probably most in late spring) as customers prepare for summer. Sales are up in general but comparatively less so for fall and winter.

    Geometric seasonality: The seasonal swings up and down increase in size forward across time.

    Following is an example of the underlying structure of quarterly geometrical seasonality with overall positive, linear trend, the data available as shown below.

    Because there is no random error for the sales variable YStable, the plot in Figure 1.19 is of the structure, not actual data, which is always confounded with random sampling error. The optional parameter scale_x specifies a custom numbering of the x-axis to better highlight the seasonality over the default numbering by starting each season, Quarter 1 or Winter, at a vertical line on the plot. Completely optional, but this custom numbering provides a visual guide to help delineate the seasons. The three values of the vector of values for scale_x are the starting value, -1, the last value, 40, and the interval between values listed on the axis, 4.

    For example, consider Time 31, a Winter quarter. Sales are low, then increase much at Time 32 for Spring, remain high but diminish slightly for Time 33, Summer, and then diminish again for Time 34, the fourth quarter, Fall.

    Figure 1.19: Perfect geometric four seasons with trend.

    Given data, that is structure plus sampling error, the underlying, stable pattern is obscured to some extent but remains apparent.

    Figure 1.20: Four seasons with trend with random error.

    With even more random error shown in Figure 1.21, it is easy to miss the seasonal sales signal and falsely conclude that the up and down movements of the data over time is due only to random error, though the trend remains obvious.

    Figure 1.21: Seasonal process with much random error.

    To help delineate the stable pattern, the signal, from the noise, Figure 1.22 highlights the groups of four seasonal data values from Figure 1.21 and explicitly number their seasonality for three of the groups.

    Figure 1.22: Annotated four seasons with trend with much random error.
    Forecast trend with seasonality

    Project the trend “line” into the future and then add the seasonal effect for each corresponding forecasted value.

    Discerning signal from noise and then forecasting from the signal is the goal of these types of forecasts, not always easy but always at least easier with more data. Fortunately, analytic methods exist for disentangling the trend and seasonal components from each other and from the random error. This separation of components is available in R and shown later.

    Here, the purpose is to encounter some of the various patterns in time series data and to understand better how the random error always present in data obscures the underlying pattern. Also, the goal is to not only rely upon analytic software such as found in R to understand time series data but also develop some visual intuition from examining these visualizations. The more the analyst can discern structure from a visualization often leads to more effective use and understanding of the results provided by the analytic software.

    1.3.3.6 Random Walk

    [No video]

    One of the simplest data over time generating processes is a random walk. The next value of the process is the current value plus random error. There is no underlying trend or stability. What happens next does not depend on the past, just a random variation added to the present. The overall path (trajectory) is unpredictable.

    Random walk: Process over time in which the next value is the present value plus a random influence.

    Model of a random walk

    \(Y_{t+1} = Y_t + e_t\)

    For random walk data, the next value is generated from the random error added to the current value of the series \(Y_t\). The next data value of a random walk model is the present value plus the influence of random error.

    Forecast a random walk

    The forecasted next value of a random walk is the current value.

    The only term left after removing the random error \(e\) is \(Y_t\), which is the forecast, \(\hat Y_{t+1} = Y_t\). The population mean of \(e\) is zero, so the next value of \(Y\) is just as likely to increase as decrease. This forecast is the simplest possible: the current value of \(Y\) is the forecasted value.

    The random walk model is not only simple, but descriptive of many real-world processes. Many financial forecasters maintain that the behavior of general stock market indices over time, such as for the S&P 500, and for many individual stocks as well, are best characterized by a random walk.

    Interpreting random walk data is, however, not always straightforward. Data generated by a random walk model with an average error of zero display no trends over time. The time series “meanders”, moving up and down, sometimes misleadingly appearing to follow pattern over a small number of trials. Some of these misleading patterns are demonstrated in the following example.

    Figure 1.23: Although only meandering, the random walk time series on the left appears to be increasing, and the random walk time series on the right appears to be decreasing.

    A time series accurately described by a random walk meanders and drifts, but not in a predictable direction. The search for an underlying stable, long-term pattern in random walk data is futile.

    How can a set of time series data be identified as a random walk? The key to identifying a random walk follows from the definition of the underlying model, \(Y_{t+1} = Y_t + e_t\) Subtracting values at two successive time periods yields, \(Y_{t+1} - Y_t = e_t\) For a random walk time series, the only distinction between successive observations is random error. d

    A test for random walk data is the systematic analysis of these differences for all successive values of the original data Y. Computing these differences for all values yields a new time series.

    Differenced time series: \(\Delta Y = Y_{t+1} - Y_t\)

    The \(\Delta\) refers to the difference of successive observations from the original time series Y.

    Analysis of the differenced time series, \(\Delta Y\), indicates whether or not the original time series Y is a random walk. If the differences are nothing more than random error that exhibit no pattern or structure except variation about the mean, the original time series Y is a random walk.

    Random walk process

    The differenced time series for a random walk is a stable process.

    If the model correctly identifies the underlying structure, and if this structure is removed from the data, then all that is left, \(Y_{t+1} - Y_t\), is random error.

    When a random walk is identified, no more search need be conducted for uncovering underlying structure such as trend or seasonality.

    1.4 Time Series

    Video: Time Series and Data
    [5:10]

    A time series orders the values of a variable by time just as does a run chart. The time series, however, also provides the corresponding dates or times.

    Time Series chart: Plotted sequence of values against the corresponding dates and/or times at which the values were recorded, usually at regular intervals.

    The time series is one of the basic concepts for forecasting. Examples include ship times and inventory levels, any process that generates values over time.

    Discover the underlying structure, disentangled from the random variation.

    Forecast: Extrapolate the past structure into the future.

    Analytic forecasting methods are important but the most important analysis is to visualize the time series to directly view any pattern present over time. If you believe the underlying processes will remain unchanged, an intuitive forecast informally extends the pattern over time simply by drawing an extension of the pattern. If you believe the underlying process will change, such as a rise in interest rates or some type of political or international issue, then adjust the forecast accordingly.

    1.4.1 Data

    The data for the following examples is the stock price of Apple, IBM, and Intel stock from 1985 through mid-2022, obtained from finance.yahoo.com. The data were obtained from three separate downloads and then concatenated into a single data file. Read the stock price data either from the web or locally as part of a data file included within lessR called StockPrice, which is adjusted for stock splits. Make sure to enclose the file reference in quotes within the call to the lessR function Read().

    d <- Read("StockPrice")

    The output of Read() provides much useful information regarding the data file read into R. Always review this information to make sure that the data was read and formatted correctly.

    We see that there are three variables in the file, date, Company, and Price, each formatted differently. To generate a time series, the date/time variable must be formatted accordingly, here, of type Date. Because the file is an Excel file, this formatting has already been done.

    We also see that there are 1350 rows of data in the file, with 450 unique dates. The dates are repeated for each of the three companies, and the dates are monthly. There is no missing data.

    Following are some sample rows of data. The first column of numbers are the row numbers.

    • The first four rows of data, which are the first four rows of Apple data.
    • The first four rows of IBM data.
    • The first four rows of Intel data.
            date Company    Price
    1 1985-01-01   Apple 0.101050
    2 1985-02-01   Apple 0.086241
    3 1985-03-01   Apple 0.077094
    4 1985-04-01   Apple 0.074045
              date Company    Price
    451 1985-01-01     IBM 12.71885
    452 1985-02-01     IBM 12.49734
    453 1985-03-01     IBM 11.94072
    454 1985-04-01     IBM 11.89371
              date Company    Price
    901 1985-01-01   Intel 0.379217
    902 1985-02-01   Intel 0.345303
    903 1985-03-01   Intel 0.342220
    904 1985-04-01   Intel 0.339137

    Now that we understand the data read into R for analysis we can now create different visualizations.

    1.4.2 Plot a Single Time Series

    Video: Plot a Single Time Series
    [8:23]

    Set up the time series visualization by plotting share price vs. date. Plot with the lessR Plot() function of the form Plot(x,y). The x-variable is the date, here named date, and the y-variable is Price.

    If we wish to visualize the data for only one company, then we need to select just the rows of data for that company. Select specified rows from the data table for analysis according to a logical condition. - The R double equal sign, == means is equal to - The == does not set to equality, it evaluates equality, resulting in a value that is either TRUE or FALSE - The expression (Company==“Apple”) evaluates to TRUE only for those rows of data for which the data value for the variable Company equals “Apple”

    Specify the logical condition for selecting rows of data with the rows parameter.

    Plot(date, Price, rows=(Company=="Apple"))

    Time series of Apple stock price.

    Fill the area under the curve with the area_fill parameter. Set the value to on to obtain the fill color for the given color theme.

    Plot(date, Price, rows=(Company=="Apple"), area_fill="on")

    Time series of Apple stock price with the default fill color.

    Or, specify a specific color.

    Plot(date, Price, rows=(Company=="Apple"), area_fill="mistyrose")

    Time series of Apple stock price with a custom fill color.

    View all the many R color names from the lessR function showColors(), which creates a pdf file of all the color names, showing a color patch for each name.

    The color parameter sets the color of a line, edge, or border, how the object is viewed from its exterior. The lwd parameter sets the line width of the line, here made thicker from its default value of 1.

    Plot(date, Price, rows=(Company=="Apple"),
         color="red4", lwd=3, area_fill="seashell")

    Time series of Apple stock price with a custom border color.

    Further customize the visualization by setting a black background and a transparent fill color. Customize this formatting of the visualization with the lessR function style(). Here, set the background to black by setting the sub_theme parameter. Styles set with style() are persistent, that is, they remain set across the remaining visualizations until explicitly changed.

    style(sub_theme="black")

    Set the transparency level with the parameter transparency, which can be shortened to trans. Either way works. The value is a proportion from 0, no transparency, to 1, complete transparency, that is, invisible.

    Plot(date, Price, rows=(Company=="Apple"),
         color="steelblue2", area_fill="steelblue3", trans=.55)

    Time series of Apple stock price with a transparent fill color against a black backround.

    1.4.3 Plot Multiple Times Series

    1.4.3.1 Plot on the Same Panel

    Video: Plot on the Same Panel
    [6:29]

    The variable Company is a categorical variable with three values: Apple, IBM, and Intel. Plot() will plot a different visualization for each value of a specified categorical variable according to the by parameter. The by instructs Plot() to visualize multiple plots on the same panel.

    Plot(date, Price, by=Company)

    Time series of stock price for Apple, IBM, and Intel plotted on the same panel.

    Set the parameter stack to TRUE to stack the plots on top of each other. When stacked, the Price variable on the y-axis is the sum of the corresponding Price values for each Company. The y-value for Apple at each date is its actual value because it is listed first (alphabetically by default). The y-value for IBM is the corresponding value for Apple plus IBM’s value. And, for Intel, listed last, each point on the y-axis is the sum of all three prices.

    Plot(date, Price, by=Company, stack=TRUE, trans=0.4)

    Stacked time series of stock price for Apple, IBM, and Intel plotted on the same panel.

    The default color assigned to each of the companies is not necessarily retained for different companies.

    Many different color palette’s are available for visualizations. It is also possible to define your own palettes but here we stay with pre-defined versions.

    Sequential color palette: A sequence of colors of the same hue but different shades.

    One way to plot with a sequential palette is to use one of the lessR pre-defined palettes. The names are a plural color name, ordered as 30 degree implements around the 360 degree color wheel, plus grays. The colors are reds, rusts, browns, olives, greens, emeralds, turquoises, aquas, blues, purples, violets, magentas, and grays. Make sure to enclose the name in quotes in the Plot() function call. Here, access the reds value for the fill parameter.

    Plot(date, Price, by=Company, stack=TRUE, trans=0.4, fill="reds")

    Figure 1.24: Stacked time series of stock price for Apple, IBM, and Intel plotted on the same panel.

    1.4.3.2 Plot on Different Panels

    Video: Plot on Different Panels
    [2:55]

    The by1 parameter indicates to plot each time series on a separate panel. Remember that the background has earlier been set to black with the style() function. To reset to the default color theme, just enter the function call with nothing between the parentheses, or specify the default theme of "colors".

    Plot(date, Price, by1=Company)

    Figure 1.25: Time series of stock price for Apple, IBM, and Intel, each plotted on a different pane.

    Spruce up the plot with the transparent blue fill against the black background, shown in Figure 1.26, as done with a previous visualization.

    style(sub_theme="black")
    Plot(date, Price, by1=Company,
         color="steelblue2", area_fill="steelblue3", trans=.55)

    Figure 1.26: Time series of stock price for Apple, IBM, and Intel, each plotted on a different panel with customized colors.

    Try the same with orange colors, shown in Figure 1.27.

    Plot(date, Price, by1=Company,
         color="orange", area_fill="orange3", trans=.55)

    Figure 1.27: Time series of stock price for Apple, IBM, and Intel, each plotted on a different panel with customized colors.

    Using lessR, it is easy to make attractive, presentation-ready data visualizations, including time series visualizations.

    1.4.4 Data Preparation for Dates

    Any data analysis system, such as Excel or R, stores data values differently depending on what the data values represent. Common storage types are integers, numbers with decimal digits (double precision), character strings, and dates. Each variable, represented by a column of data values in the data table, can contain only one type of data value.

    The material in this section is based closely from an excerpt of R Visualization: Derive Meaning from Data, David Gerbing, CRC Press, May, 2020, pp 170-172

    Variable type: The type of the data values for the variable such as integers, double precisions, characters, or dates.

    To plot a time series with Plot() we need a variable formatted as type Date. To read dates into R, consider two possibilities.

    • Read from an Excel file that has stored the dates as the Excel Date type.
    • Read the dates from character strings, stored in a plain text file.

    Previous examples read the date variable directly from an Excel file, so the conversion to an R Date type occurred automatically. When reading data from a text file, the analyst must explicitly do the conversion from text to a date.

    1.4.4.1 Read Excel Date Type Directly as Dates

    With Excel data files, getting dates into R as a variable of type Date is easy. Excel has its own Date type, which Excel invokes automatically when it recognizes a character string as a date. Enter the characters of a date into Excel such as from the keyboard or read from a text data file. As shown with the screen pic of an Excel worksheet in Figure 1.28, from the characters that form a date, Excel automatically converts the character string it recognizes as a date to a Date type.

    Figure 1.28: Excel date field.

    When an Excel file with dates formatted as the Date type, R then properly interprets the dates as dates when read into R. That is, when reading an Excel Date into R, the variable with the date data values is automatically set as a type Date.

    1.4.4.2 Convert Character Strings to Dates

    For text data files, such as comma (csv) or tab delimited files, all data values, including dates, are stored as character strings. That is the definition of a text file, a collection of plain, unformatted, individual characters.

    When reading data from a text file, R internally and automatically converts numeric data values from character strings as stored in the data file to their corresponding type for variables. However, when R reads data values from a text file as character strings that represent dates, there is no automatic conversion to an R date format, such as type Date. Instead, by default, R reads the dates into the data frame as type character. That is, R interprets the dates such as “7/1/2022” read from text data files no differently from any other character string, such as “ABC” or “Hi There”.

    The complication is that many different expressions of a date exist, such as, for the first day of July, 2022, represented as “July 1, 2022” or “7/1/2022” or “1-7-15”, and many other possibilities. Instead of trying to guesstimate the different date formats as they might apply to a variable with values as dates, R requires that the user do the transformation. To do the conversion, run a function that transforms the data values of dates read into R as type character to the R variable type Date. Invoke the function as.Date() to do this conversion.

    One possibility for converting the character strings values to date values is to read the data first into Excel and let Excel implicitly do the conversion. More elegantly, the conversion can also be done within R. The key to the date conversion is the format parameter of the as.Date() function, which links each part of the date character string – year, month and day – to the corresponding part of the date with a corresponding formatting code. Present the parts of the format code in the order in which the parts of the date appear in the corresponding character string. R needs to be instructed as to how a particular date is represented in the text data file. The format parameter provides that instruction.

    Each formatting code begins with the % sign. The code %Y identifies a four-digit year in the character string that contains a date. Identify a year expressed as two digits with %y. To identify a month expressed numerically, as digits from 1 to 12, use %m. Identify a month spelled out alphabetically in full with %B, and an abbreviated alphabetic month with %b. Identify the day of a month with %d. Include any delimiters in the character string such as a hyphen, forward slash, blank, or comma in the format expression in the same position found in the character string.

    R has a default date format, the ISO 8601 international standard, defined by a four-digit year, a hyphen, a two-digit month, a hyphen, and then a two-digit day. For example, express the first day of July, 2022 with this standard as “2022-07-01”. Dates expressed as character strings in this format, or the closely related form with forward slashes in place of the hyphens, need no explicit format parameter in the call to as.Date() to convert to type Date. All other date representations embedded in a character string require the format parameter for R to correctly transform a string of characters to an R date.

    Table @ref(tab:dates) presents a variety of character string expressions of a single date, Sept 1, 2022. The output of each of these function calls output to the same date, "2022-09-01".

    as.Date() function calls that transform character strings to the same date: “2022-09-01”. Find these formats defined in the strptime() function help file, but for most practical purposes the example in the following table will suffice for the conversion.

    Alternate expressions of dates in character strings that equal "2022-09-01" with a call to as.Date() with the proper format.
    Examples of Date Formats, applied to a single date constant
    as.Date("2022-09-01", format="%Y-%m-%d")
    as.Date("2022/9/1")
    as.Date("2022.09.01", format="%Y.%m.%d")
    as.Date("09/01/2022", format="%m/%d/%Y")
    as.Date("9/1/15", format="%m/%d/%y")
    as.Date("September 1, 2022", format="%B %d, %Y")
    as.Date("Sep 1, 2022", format="%b %d, %Y")
    as.Date("20220901", format="%Y%m%d")

    To apply these calls to as.Date() to transform all the column of data values for a variable, first read the data values stored in the web text file PPStechLong.csv into the R data frame d.

    d <- Read("http://web.pdx.edu/~gerbing/data/PPStechLong.csv")

    The output from Read() shows us the variables that it just read, as well as the data type for how the data values for these variables are represented in R. The output indicates that the variable named date has been read as type character.

       Variable                  Missing  Unique
           Name     Type  Values  Values  Values  First and last values
    --------------------------------------------------------------------------
     1     date character   1350       0     450  1/1/85  2/1/85 ... 5/1/22  6/1/22
     2  Company character   1350       0       3  Apple  Apple ... Intel  Intel
     3    Price    double   1350       0    1331  0.101  0.086 ... 44.072  43.39
    --------------------------------------------------------------------------

    Before we can covert the character string values of variable date in the first column of our data file to type Date we need to know the format of the dates. Read() lists sample data values for each variable, so examine its output. The first date value for the variable date is the character string 1/1/1985. We see that the variable date formats the dates as number of the month, number of the day of the month, and then a four-digit year.

    Referring back to the formats displayed in Table @ref(tab:dates), the following function call to as.Date() converts one specific date stored as a character string, “9/1/22”, to an R Date data type expressed in the international standard: “2022-09-01”.

    as.Date("9/1/22", format="%m/%d/%y")
    [1] "2022-09-01"

    The R format of dates in this form is "%m/%d/%y". Ultimately, convert all the data values for a variable, not just one specific date. To do so, apply the date format to the entire column of data values for the variable date.i, equivalent to the Excel Fill Down a formula from one cell to all remaining cells in the column of data values.

    Convert the variable named date in the d data frame from type character to the Date type as follows. Identify the variable by both its name and the data frame in which it is stored. The R notation to identify a variable is data frame name, \$, variable name. In this example, d$date refers to the variable named date in the d data frame. The <- indicates an assignment of the value on the right to the value that is pointed to on the left.

    Application of the base R function as.Date().

    This example takes the column of date values for the variable date stored in the d data frame, applies the as.Date() transformation with the specified format, and then names the transformed variable with the same name, d$date. The result is that the Date data values of the transformed variable replace the original character string data values.

    Note that this transformation takes place in the d data frame as it exists within your active R session. As with any work on a computer file, including Excel, nothing is changed in the original data stored in your computer file until you save the changes, which we are not doing here. To save the changes we would call the function Write().

    To verify the correct specification of the date format of the transformed variable date in the d data frame, can use the lessR function details(). Here, abbreviate db() to invoke the brief version. The output of db() is the same output that Read() provides. The name of the data frame appears between the parentheses, with d as the default value. No need to specify the name unless named something other than d. Look for both the correct display of the date as well as the variable type of Date.

    > db()
    
       Variable                  Missing  Unique
           Name     Type  Values  Values  Values  First and last values
    --------------------------------------------------------------------------
     1     date      Date   1350       0     450  1985-01-01 ... 2022-06-01
     2  Company character   1350       0       3  Apple  Apple ... Intel  Intel
     3    Price    double   1350       0    1331  0.101  0.086 ... 44.071  43.389
    --------------------------------------------------------------------------

    After the as.Date() transformation, the type of the variable date is now Date. Further, the dates are now stored in the R default ISO standard of a four-digit year, two-digit month, and two-digit month, each separate by a hyphen.

    1.4.4.3 Internal Storage of Dates

    Both R and Excel store dates internally as integers, called serial numbers, sequentially numbered by day from an arbitrarily set origin. R uses the origin “1970-01-01”, which corresponds to serial number 1. Windows and Macintosh Excel versions originally managed to base their date systems on different origins. Recent versions of Macintosh Excel (since 2011) standardize on the Windows origin, which is “1900-01-01”. Earlier versions of Macintosh Excel used “1904-01-01”.

    For reference, as.Date() also converts serial numbers to R Date fields. To convert, provide the serial number as an integer and then the corresponding origin. Listing 8.4 shows this conversion for the Excel origin and the R origin.

    The following two applications of as.Date() convert different integers (serial numbers) that each identify the same date, “2015-09-01”, but with different origins. - as.Date(42246, origin=“1900-01-01”) - as.Date(16679, origin=“1970-01-01”)

    The serial number 42246 corresponds to the date of Sept 1, 2015 in Excel. The serial number 16679 corresponds to the same date in R.