Visualize Data Collected over Time

Plot ordered data values collected over time in one of two ways that correspond to how the values are labeled.

Run Chart

Create a run chart from two variables, the \(x\)-variable as the sequence of consecutive integers from 1 to the number of data values, the Index values, and the \(y\)-variable that specifies the corresponding data values to be plotted. Meaningful for sequentially ordered numerical data values such as by time, plot a run chart of a single variable with the option of generating the Index values by specifying the name of the \(x\) variable, the first variable typically listed, as .Index. The name begins with a \(.\) so as to not confuse with an existing variable. Analogous to a time series visualization, the run chart plots the data values sequentially, but without dates or times. An analysis of the runs can also be obtained.

d <- Read("Employee")

## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

The data values for the variable Salary were not collected over time, but for illustration, here create a run chart of Salary as if the data were collected over time. The indices, the sequence of integers from 1 to the last data value, are created by Plot() by specifying the \(x\)-variable as .Index. Invoke the run parameter to instruct Plot() to plot the data in sequential order as a run chart.

Plot(.Index, Salary)

##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
##  
## ------------
## Run Analysis
## ------------
## 
## Total number of runs: 21 
## Total number of values that do not equal the median: 36

The default run chart displays the plotted points in a small size with connecting line segments. Change the size of the points with the parameter size, here set to zero to remove the points entirely. Fill the area under the line segments with the parameter area_fill, here set to the default on but can express any color. Remove the center line with the parameter center_line set to off. Display the analysis of the runs with the parameter show_runs set to TRUE.

Plot(.Index, Salary, size=0, area_fill="on", center_line="off", show_runs=TRUE)

##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
##  
## ------------
## Run Analysis
## ------------
## 
## size=2  Run  5:    5    6
## size=2  Run  6:    7    8
## size=3  Run  7:    9   10   11
## size=2  Run  8:   12   13
## size=5  Run 10:   15   16   17   18   19
## size=4  Run 17:   26   27   28   29
## size=2  Run 18:   30   31
## size=2  Run 20:   33   34
## size=3  Run 21:   35   36   37
## 
## Total number of runs: 21 
## Total number of values that do not equal the median: 36
## 
## Values ignored that equal the median: 
##     #29 69547.6
## Total number of values ignored: 1

Time Series Chart

Dates

Create a time series from two variables, the \(x\)-variable as a date, and the \(y\)-variable that specifies the corresponding measured values to be plotted. Internally, the \(x\)-variable is stored as a variable of R type Date. Traditionally, the Date variable is created prior to calling Plot() with the R function as.Date(). However, Plot() can also implicitly convert a character string numeric date value such as "08/18/2024" to a formal Date data value, as explained below. Plotting a variable of type Date as the \(x\)-variable in a scatterplot automatically creates a time series visualization with each pair of adjacent points connected by a line segment.

R does not provide an automatic conversion of character string dates to a formal date variable, likely because the conversion is inherently ambiguous. There are multiple ways in which a numerical date can be specified and inferring the date format from data values is not always guaranteed but can usually work. Plot() will attempt the conversion for you. To facilitate verification of the correct date format, Plot() displays its inferred format. Plot() allows an explicit date format specification with the parameter ts_format. View the list of all possible date formats, by entering ?strptime to display the corresponding help file.

Following are the five different possibilities of numerical data values read as character strings that Plot() will convert to actual dates, an R variable of type Date. Expressing the year with all four digits is recommended though not usually necessary. The following examples use the hyphen, -, delimiter but the backslash, /, and period, ., can also be used.

2024-08-18: Four digit year, one or two digit month, one or two digit day.
08-18-2024: One or two digit month, one or two digit day, four digit year.
08-18-24: One or two digit month, one or two digit day, two digit year.
18-08-2024: One or two digit day, one or two digit month, four digit year.
18-08-24: One or two digit day, one or two digit month, two digit year.

Daily Data

Enter the dates for daily data values in one of the above five numerical formats. Or, use the ts_format parameter to specify a format for non-numerical date values that can include the name of the corresponding month (as per ?strptime).

Weekly Data

Enter the dates for weekly data values as with daily data values except that consecutive dates are one week apart. For example, each date represents the first day of the corresponding week, such as "04/03/2024" for the fourth day of March 4, 2024, which begins the first full week in March 2024, followed by "11/03/2024" for the 11th day of the same month.

Monthly Data

Two possibilities exist for entering monthly data. Enter the dates for monthly data values as either:

as with daily date values except that consecutive dates are one month apart. For example, each date represents the first day of the corresponding month, such as "01/03/2024" for the first day of March 2024, followed by "1/04/2024" for the first day of April, 2024.
four digit year followed by the three digit month abbreviation, all as a single data value. For example, "2024 Jan" followed by "2024 Feb".

Quarterly Data

Two possibilities exist for entering quarterly data. Enter the dates for quarterly data values as either:

as with daily date values except that consecutive dates are one quarter or three months apart. For example, represent a quarter with the first day of the month for the corresponding quarter, such as "01/01/2024" for the first day of the first quarter followed by "01/04/2024" for the first day of the second quarter.
four digit year followed by either Q1, Q2, Q3, or Q4, all as a single data value. For example, "2024 Q1" followed by "2024 Q2".

Annual Data

Two possibilities exist for entering annual data. Enter the dates for annual data values as either:

as with daily data values except that consecutive dates are one year apart. For example, reach date represents the first day of the corresponding year, such as "01/01/2024" for the first day of the year for 2024, followed by "01/01/2025" for the first day of the following year.
four digit year. For example, "2024" followed by "2025".

A Single Time Series

Read time series data of stock Price for three companies: Apple, IBM, and Intel. The data table is part of lessR, called StockPrice.

d <- Read("StockPrice")

## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## Date: Date with year, month and day
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Month      Date   1419       0     473   1985-01-01 ... 2024-05-01
##  2   Company character   1419       0       3   Apple  Apple ... Intel  Intel
##  3     Price    double   1419       0    1400   0.100055  0.085392 ... 30.346739  30.555891
##  4    Volume    double   1419       0    1419   6366416000 ... 229147100
## ------------------------------------------------------------------------------------------

d[1:5,]

##        Month Company    Price     Volume
## 1 1985-01-01   Apple 0.100055 6366416000
## 2 1985-02-01   Apple 0.085392 4733388800
## 3 1985-03-01   Apple 0.076335 4615587200
## 4 1985-04-01   Apple 0.073316 2868028800
## 5 1985-05-01   Apple 0.059947 4639129600

Activate a time series plot by setting the \(x\)-variable to a variable of R type Date, which is true of the variable Month in this data set. Can also plot a time series by passing a time series object, created with the base R function ts() as the variable to plot. Plot() will attempt to convert a four-digit integer year sequentially organized in increments of 1 year, or a date expressed as digits with / or - delimiters, such as "08/18/2024", to a variable of type Date. However, this conversion is not without some ambiguity, so if it is incorrect, then specify the correct date format with parameter ts_format.

Here, plot the stock price over time just for Apple, with the two variables Month and Price, stock price. The parameter filter specifies the rows of the input data frame retained for the analysis.

Plot(Month, Price, filter=(Company=="Apple"))

## 
## filter:  (Company == "Apple") 
## -----
## Rows of data before filtering:  1419 
## Rows of data after filtering:   473

Add the default fill color by setting the area_fill parameter to "on". Can also specify a custom color.

Plot(Month, Price, filter=(Company=="Apple"), area_fill="on")

## 
## filter:  (Company == "Apple") 
## -----
## Rows of data before filtering:  1419 
## Rows of data after filtering:   473

Multiple Time Series

One One Panel

With the by parameter, plot all three companies on the same panel.

Plot(Month, Price, by=Company)

Stack the plots by setting the parameter stack to TRUE.

Plot(Month, Price, by=Company, stack=TRUE)

Facets

With the facet1 parameter, plot all three companies on the different panels, a Trellis plot.

Plot(Month, Price, facet1=Company)

## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]

Do the Trellis plot with some color. Learn more about customizing visualizations in the vignette utlities.

style(sub_theme="black", window_fill="gray10")
Plot(Month, Price, facet1=Company, n_col=1, fill="darkred", color="red", trans=.55)

## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]

Return to the default style and turn off text output for subsequent analyses.

style()

## theme set to "colors"

style(quiet=TRUE)

Set a baseline of 25 with the area_origin parameter for a Trellis plot, with default fill color.

Plot(Month, Price, facet1=Company, xlab="", area_fill="on", area_origin=25)

Change the aspect ratio with the aspect parameter defined as height divided by width.

Plot(Month, Price, facet1=Company, aspect=.5, area_fill="slategray3")

Stack the three time series, fill under each curve with a version of the lessR sequential range "emeralds".

Plot(Month, Price, by=Company, trans=0.4, stack=TRUE, area_fill="emeralds")

Aggregation by Time

This example aggregates monthly stock price data by quarter. Available time units are "years", "quarters", "months", "weeks", and “days”. Also included is the special time unit "days7" explained below in the Forecasting section. Aggregate with the parameter ts_unit (which relies upon functions from the xts package). Generate and display the first several months of the monthly data.

The stock price for each company is reported monthly in the data table. To aggregate to quarters, use the ts_unit parameter. The default aggregation is the sum over the specified time period. That value is appropriate if we are, for example, aggregating monthly sales over each quarter, but for stock Price we want the mean stock price over the specified time period. Set the parameter ts_agg to "mean". Focus just on the Apple stock price data with the filter parameter.

d <- Read("StockPrice", quiet=TRUE)

Plot(Month, Price, ts_unit="quarters", ts_agg="mean", filter=(Company=="Apple"))

Or, aggregate by years to smooth the curve futher, with a darkred line.

Plot(Month, Price, ts_unit="years", ts_agg="mean", filter=(Company=="Apple"),
     color="darkred")

In the following example, aggregate by years for each of the three companies.

Plot(Month, Price, by=Company, ts_unit="years", ts_agg="mean")

Forecast

Plot() implements time series forecasting based on trend and seasonality with either exponential smoothing or regression analysis, including the accompanying visualization. Time series parameters include:

ts_method: Set at "es" for exponential smoothing, the default, or "lm" for linear model regression.
ts_unit: The time unit, either as the natural occurring interval between dates in the data, the default, or aggregated to a wider time interval.
ts_ahead: The number of time units to forecast into the future
ts_agg: If aggregating the time unit, aggregate as the "sum", the default, or as the "mean".
ts_PIlevel: The confidence level of the prediction intervals, with 0.95 the default.
ts_format: Provides a specific format for the date variable if not detected correctly by default.
ts_seasons: Set to FALSE to turn off seasonality in the estimated model.
ts_trend: Set to FALSE to turn off trend in the estimated model.
ts_type: Applies to exponential smoothing to specify additive or multiplicative seasonality, with additive the default.

To forecast Apple’s stock price, focus here on the last several years of the data, beginning with Row 400 through Row 473, the last row of data for apple. In this example, forecast ahead 24 months.

d <- d[400:473,]
Plot(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24)

Or, do the regression with seasonality to forecast according to the parameter ts_method, here changed from its default value of es exponential smoothing to lm for linear model. The data are de-seasonalized, the regression analysis performed, and then the seasonality added back.

Plot(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24, ts_method="lm")

Here, do the linear regression forecast but without seasonality according to the parameter ts_seasons.

Plot(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24, ts_method="lm", ts_seasons=FALSE)

Better to visually understand the characteristics of the time series before trying to forecast. As an aid to facilitate this understanding, consider the decomposition of the time series into the seasonal and trend components with the lessR function STL(), which relies upon the base R function stl() but provides more information and allows more flexible input.

STL(Month, Price)

## 
## Total variance of Price: 2728.807
## Proportion of variance for components:
##   seasonality --- 0.006 
##   trend --------- 0.936 
##   remainder ----- 0.026

The traditional time units, such as "days" or "quarters", evaluate seasonality over the entire year. Quarterly and even monthly data can be usually be meaningfully assessed for seasonality over the entire year. With daily data, however, seasonality is generally more meaningfully assessed over the days of the week. For example, sales may typically be higher on Monday than they are on Sunday.

Consider the following daily data for which we wish to evaluate seasonality over the days of the week. To indicate potential seasonality of daily data within a week, specify the time unit with parameter ts_unit set to "days7".

Plot(days, sales, ts_ahead=8, ts_unit="days7")

We now have seasonality coefficient for each day of the week, which are projected into the future for forecasting.

Missing Data

Entire Record Missing

If the date value and the y-value are missing, then the nearest adjacent points are connected by a line segment that runs over the missing data value, effectively linearly interpolating the missing value across the two adjacent present values. For example, consider a daily time series related to the Tableau Superstore data such that “2021-01-07” and “2021-01-09” are both present with their corresponding y values, but there is no date value or y value for January 8, that is, “2021-01-08”. To yield a single data value of Sales for each day, aggregate Sales by day.

d <- read.table(text="
Order.Date    Sales
2021-01-05   19.536
2021-01-06  473.820
2021-01-06    5.480
2021-01-06   12.780
2021-01-06  609.980
2021-01-06   31.120
2021-01-06    6.540
2021-01-06   19.440
2021-01-07  176.728
2021-01-07   10.430
2021-01-09    9.344
2021-01-09   31.200
2021-01-10   51.940
2021-01-10    2.890",
header = TRUE)

Two sales are recorded on January 7 and two sales are recorded on January 9 but there is no record for any sales or even a date for January 8. The entire row of data for January 8 is missing.

Next, plot the aggregated Sales data by day for dates from January 3 through January 10.

Plot(Order.Date, Sales, ts_unit="days")

## 
## Best guess for the date format: %Y-%m-%d

## If this format is wrong, specify with parameter: ts_format
## To see all possible formats, enter: ?strptime
## Examples:  "08/18/2024" format is "%m/%d/%Y"
##            "18-08-24"   format is "%d-%m-%y"
##            "August 18, 2024" format is "%B %d, %Y"

The resulting visualization plots the y-value for January 7 and also for January 9, with a line segment connecting those two points. There is no corresponding label on the x-axis for the missing data value nor is there a plotted point. And, the January 9 value is appropriately placed two days after the January 7 value on the visualization.

Only the y-value is Missing

In terms of missing data, if the date value exists and the corresponding y-value is missing, with value , then the visualization leaves the corresponding y-value blank. Here, insert the missing row for January 8 with missing data, NA, for that date.

new_row <- data.frame(
  Order.Date = "2021-01-08",
  Sales = NA
)
d <- rbind(d, new_row)
d <- order_by(d, by=Order.Date)

d[9:12,]

##    Order.Date   Sales
## 9  2021-01-07 176.728
## 10 2021-01-07  10.430
## 15 2021-01-08      NA
## 11 2021-01-09   9.344

Now, plot.

Plot(Order.Date, Sales, ts_unit="days")

## 
## Best guess for the date format: %Y-%m-%d

## If this format is wrong, specify with parameter: ts_format
## To see all possible formats, enter: ?strptime
## Examples:  "08/18/2024" format is "%m/%d/%Y"
##            "18-08-24"   format is "%d-%m-%y"
##            "August 18, 2024" format is "%B %d, %Y"

There is now a blank space in visualization for January 8. If instead, better to treat the missing value as zero sales for that day, specify the value of 0 for parameter ts_NA.

Plot(Order.Date, Sales, ts_unit="days", ts_NA=0)

## 
## Best guess for the date format: %Y-%m-%d

## If this format is wrong, specify with parameter: ts_format
## To see all possible formats, enter: ?strptime
## Examples:  "08/18/2024" format is "%m/%d/%Y"
##            "18-08-24"   format is "%d-%m-%y"
##            "August 18, 2024" format is "%B %d, %Y"

Data Stuctures

Data can be stored in in different types of structures, different forms of organization. Plot() can plot a time series from three different data structures:

long-form
wide-form
time-series object

Long-Format Data

The previous examples of plotting time series data read data stored in long format. Long format data organizes data with each row of the data table containing only a single measurement. If the entity provides multiple data values, then the data values are stored in multiple rows.

For example, if observations of Apple’s stock price are taken monthly, then the data for each row of the data table contain only a single stock price. Or, from another perspective, the data values for each company are each store on a separate row.

d <- Read("StockPrice", quiet=TRUE)

head(d)

##        Month Company    Price     Volume
## 1 1985-01-01   Apple 0.100055 6366416000
## 2 1985-02-01   Apple 0.085392 4733388800
## 3 1985-03-01   Apple 0.076335 4615587200
## 4 1985-04-01   Apple 0.073316 2868028800
## 5 1985-05-01   Apple 0.059947 4639129600
## 6 1985-06-01   Apple 0.062103 5811388800

Many data analysis and visualization functions across a variety of statistical systems require long format data. As such, this organization of data is the most common data structure but other possibilities do exist.

Wide-Format Data

Plot() also reads wide-format data, which stores multiple data values across a single row. We have no available wide form time data with lessR, so first convert the long form data file as read to the wide form. In the wide form, the three companies each have their own column of data, repeated for each date. Use the lessR function reshape_wide() to do the conversion.

dw <- reshape_wide(d, group="Company", response="Price", ID="Month")
head(dw)

##        Month    Apple      IBM    Intel
## 1 1985-01-01 0.100055 11.71846 0.359457
## 2 1985-02-01 0.085392 11.51437 0.327310
## 3 1985-03-01 0.076335 11.00154 0.324388
## 4 1985-04-01 0.073316 10.95822 0.321466
## 5 1985-05-01 0.059947 11.14231 0.308315
## 6 1985-06-01 0.062103 10.81489 0.303932

Now the analysis, which repeats a previous analysis, but with wide-form data. Because the data frame is not the default d, explicitly indicate with the data parameter. Specify a range of blue colors from light to dark blue to fill the area under each time series.

Plot(Month, c(Intel, Apple, IBM), area_fill="blues", stack=TRUE, trans=.4, data=dw)

R Time-Series Object Data

Plot() can also plot directly from an R time series object, created with the base R ts() function.

a1.ts <- ts(dw$Apple, frequency=12, start=c(1980, 12))
Plot(a1.ts)

With the lessR style() function many themes can be selected, such as "lightbronze", "dodgerblue", "darkred", and "gray" for gray scale. When no theme or any other parameter value is specified, return to the default theme, colors.

style()