Plot ordered data values collected over time in one of two ways that correspond to how the values are labeled.
Create a run chart from two variables, the \(x\)-variable as the sequence of consecutive
integers from 1 to the number of data values, the Index values, and the
\(y\)-variable that specifies the
corresponding data values to be plotted. Meaningful for sequentially
ordered numerical data values such as by time, plot a run chart of a
single variable with the option of generating the Index values by
specifying the name of the \(x\)
variable, the first variable typically listed, as .Index
.
The name begins with a \(.\) so as to
not confuse with an existing variable. Analogous to a time series
visualization, the run chart plots the data values sequentially, but
without dates or times. An analysis of the runs can also be
obtained.
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
The data values for the variable Salary were not collected
over time, but for illustration, here create a run chart of
Salary as if the data were collected over time. The indices,
the sequence of integers from 1 to the last data value, are created by
Plot()
by specifying the \(x\)-variable as .Index
. Invoke
the run
parameter to instruct Plot()
to plot
the data in sequential order as a run chart.
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
## ------------
## Run Analysis
## ------------
##
## Total number of runs: 21
## Total number of values that do not equal the median: 36
The default run chart displays the plotted points in a small size
with connecting line segments. Change the size of the points with the
parameter size
, here set to zero to remove the points
entirely. Fill the area under the line segments with the parameter
area_fill
, here set to the default on
but can
express any color. Remove the center line with the parameter
center_line
set to off
. Display the analysis
of the runs with the parameter show_runs
set to
TRUE
.
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
## ------------
## Run Analysis
## ------------
##
## size=2 Run 5: 5 6
## size=2 Run 6: 7 8
## size=3 Run 7: 9 10 11
## size=2 Run 8: 12 13
## size=5 Run 10: 15 16 17 18 19
## size=4 Run 17: 26 27 28 29
## size=2 Run 18: 30 31
## size=2 Run 20: 33 34
## size=3 Run 21: 35 36 37
##
## Total number of runs: 21
## Total number of values that do not equal the median: 36
##
## Values ignored that equal the median:
## #29 69547.6
## Total number of values ignored: 1
Create a time series from two variables, the \(x\)-variable as a date, and the \(y\)-variable that specifies the
corresponding measured values to be plotted. Internally, the \(x\)-variable is stored as a variable of R
type Date
. Traditionally, the Date
variable is
created prior to calling Plot()
with the R function
as.Date()
. However, Plot()
can also implicitly
convert a character string numeric date value such as
"08/18/2024"
to a formal Date
data value, as
explained below. Plotting a variable of type Date
as the
\(x\)-variable in a scatterplot
automatically creates a time series visualization with each pair of
adjacent points connected by a line segment.
R does not provide an automatic conversion of character string dates
to a formal date variable, likely because the conversion is inherently
ambiguous. There are multiple ways in which a numerical date can be
specified and inferring the date format from data values is not always
guaranteed but can usually work. Plot()
will attempt the
conversion for you. To facilitate verification of the correct date
format, Plot()
displays its inferred format.
Plot()
allows an explicit date format specification with
the parameter ts_format
. View the list of all possible date
formats, by entering ?strptime
to display the corresponding
help file.
Following are the five different possibilities of numerical data
values read as character strings that Plot()
will convert
to actual dates, an R variable of type Date
. Expressing the
year with all four digits is recommended though not usually necessary.
The following examples use the hyphen, -
, delimiter but the
backslash, /
, and period, .
, can also be
used.
Enter the dates for daily data values in one of the above five
numerical formats. Or, use the ts_format
parameter to
specify a format for non-numerical date values that can include the name
of the corresponding month (as per ?strptime
).
Enter the dates for weekly data values as with daily data values
except that consecutive dates are one week apart. For example, each date
represents the first day of the corresponding week, such as
"04/03/2024"
for the fourth day of March 4, 2024, which
begins the first full week in March 2024, followed by
"11/03/2024"
for the 11th day of the same month.
Two possibilities exist for entering monthly data. Enter the dates for monthly data values as either:
"01/03/2024"
for the first day
of March 2024, followed by "1/04/2024"
for the first day of
April, 2024."2024 Jan"
followed by
"2024 Feb"
.Two possibilities exist for entering quarterly data. Enter the dates for quarterly data values as either:
"01/01/2024"
for the first day of the first quarter
followed by "01/04/2024"
for the first day of the second
quarter."2024 Q1"
followed by
"2024 Q2"
.Two possibilities exist for entering annual data. Enter the dates for annual data values as either:
"01/01/2024"
for the first day
of the year for 2024, followed by "01/01/2025"
for the
first day of the following year."2024"
followed by
"2025"
.Read time series data of stock Price for three companies:
Apple, IBM, and Intel. The data table is part of lessR,
called StockPrice
.
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## Date: Date with year, month and day
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Month Date 1419 0 473 1985-01-01 ... 2024-05-01
## 2 Company character 1419 0 3 Apple Apple ... Intel Intel
## 3 Price double 1419 0 1400 0.100055 0.085392 ... 30.346739 30.555891
## 4 Volume double 1419 0 1419 6366416000 ... 229147100
## ------------------------------------------------------------------------------------------
## Month Company Price Volume
## 1 1985-01-01 Apple 0.100055 6366416000
## 2 1985-02-01 Apple 0.085392 4733388800
## 3 1985-03-01 Apple 0.076335 4615587200
## 4 1985-04-01 Apple 0.073316 2868028800
## 5 1985-05-01 Apple 0.059947 4639129600
Activate a time series plot by setting the \(x\)-variable to a variable of R type
Date
, which is true of the variable Month in this
data set. Can also plot a time series by passing a time series object,
created with the base R function ts()
as the variable to
plot. Plot()
will attempt to convert a four-digit integer
year sequentially organized in increments of 1 year, or a date expressed
as digits with /
or -
delimiters, such as
"08/18/2024"
, to a variable of type Date
.
However, this conversion is not without some ambiguity, so if it is
incorrect, then specify the correct date format with parameter
ts_format
.
Here, plot the stock price over time just for Apple, with
the two variables Month and Price, stock price. The
parameter filter
specifies the rows of the input data frame
retained for the analysis.
##
## filter: (Company == "Apple")
## -----
## Rows of data before filtering: 1419
## Rows of data after filtering: 473
Add the default fill color by setting the area_fill
parameter to "on"
. Can also specify a custom color.
##
## filter: (Company == "Apple")
## -----
## Rows of data before filtering: 1419
## Rows of data after filtering: 473
With the by
parameter, plot all three companies on the
same panel.
Stack the plots by setting the parameter stack
to
TRUE
.
With the facet1
parameter, plot all three companies on
the different panels, a Trellis plot.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
Do the Trellis plot with some color. Learn more about customizing
visualizations in the vignette utlities
.
style(sub_theme="black", window_fill="gray10")
Plot(Month, Price, facet1=Company, n_col=1, fill="darkred", color="red", trans=.55)
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
Return to the default style and turn off text output for subsequent analyses.
## theme set to "colors"
Set a baseline of 25 with the area_origin
parameter for
a Trellis plot, with default fill color.
Change the aspect ratio with the aspect
parameter
defined as height divided by width.
Stack the three time series, fill under each curve with a version of
the lessR sequential range "emeralds"
.
This example aggregates monthly stock price data by quarter.
Available time units are "years"
, "quarters"
,
"months"
, "weeks"
, and “days
”.
Also included is the special time unit "days7"
explained
below in the Forecasting section. Aggregate with the parameter
ts_unit
(which relies upon functions from the
xts
package). Generate and display the first several months
of the monthly data.
The stock price for each company is reported monthly in the data
table. To aggregate to quarters, use the ts_unit
parameter.
The default aggregation is the sum over the specified time period. That
value is appropriate if we are, for example, aggregating monthly sales
over each quarter, but for stock Price we want the mean stock
price over the specified time period. Set the parameter
ts_agg
to "mean"
. Focus just on the Apple
stock price data with the filter
parameter.
Or, aggregate by years to smooth the curve futher, with a darkred line.
In the following example, aggregate by years for each of the three companies.
Plot()
implements time series forecasting based on trend
and seasonality with either exponential smoothing or regression
analysis, including the accompanying visualization. Time series
parameters include:
ts_method
: Set at "es"
for exponential
smoothing, the default, or "lm"
for linear model
regression.ts_unit
: The time unit, either as the natural occurring
interval between dates in the data, the default, or aggregated to a
wider time interval.ts_ahead
: The number of time units to forecast into the
futurets_agg
: If aggregating the time unit, aggregate as the
"sum"
, the default, or as the "mean"
.ts_PIlevel
: The confidence level of the prediction
intervals, with 0.95 the default.ts_format
: Provides a specific format for the date
variable if not detected correctly by default.ts_seasons
: Set to FALSE
to turn off
seasonality in the estimated model.ts_trend
: Set to FALSE
to turn off trend
in the estimated model.ts_type
: Applies to exponential smoothing to specify
additive or multiplicative seasonality, with additive the default.To forecast Apple’s stock price, focus here on the last several years of the data, beginning with Row 400 through Row 473, the last row of data for apple. In this example, forecast ahead 24 months.
Or, do the regression with seasonality to forecast according to the
parameter ts_method
, here changed from its default value of
es
exponential smoothing to lm
for linear
model. The data are de-seasonalized, the regression analysis performed,
and then the seasonality added back.
Here, do the linear regression forecast but without seasonality
according to the parameter ts_seasons
.
Better to visually understand the characteristics of the time series
before trying to forecast. As an aid to facilitate this understanding,
consider the decomposition of the time series into the seasonal and
trend components with the lessR function
STL()
, which relies upon the base R function
stl()
but provides more information and allows more
flexible input.
##
## Total variance of Price: 2728.807
## Proportion of variance for components:
## seasonality --- 0.006
## trend --------- 0.936
## remainder ----- 0.026
The traditional time units, such as "days"
or
"quarters"
, evaluate seasonality over the entire year.
Quarterly and even monthly data can be usually be meaningfully assessed
for seasonality over the entire year. With daily data, however,
seasonality is generally more meaningfully assessed over the days of the
week. For example, sales may typically be higher on Monday than they are
on Sunday.
Consider the following daily data for which we wish to evaluate
seasonality over the days of the week. To indicate potential seasonality
of daily data within a week, specify the time unit with parameter
ts_unit
set to "days7"
.
We now have seasonality coefficient for each day of the week, which are projected into the future for forecasting.
If the date value and the y-value are missing, then the nearest adjacent points are connected by a line segment that runs over the missing data value, effectively linearly interpolating the missing value across the two adjacent present values. For example, consider a daily time series related to the Tableau Superstore data such that “2021-01-07” and “2021-01-09” are both present with their corresponding y values, but there is no date value or y value for January 8, that is, “2021-01-08”. To yield a single data value of Sales for each day, aggregate Sales by day.
d <- read.table(text="
Order.Date Sales
2021-01-05 19.536
2021-01-06 473.820
2021-01-06 5.480
2021-01-06 12.780
2021-01-06 609.980
2021-01-06 31.120
2021-01-06 6.540
2021-01-06 19.440
2021-01-07 176.728
2021-01-07 10.430
2021-01-09 9.344
2021-01-09 31.200
2021-01-10 51.940
2021-01-10 2.890",
header = TRUE)
Two sales are recorded on January 7 and two sales are recorded on January 9 but there is no record for any sales or even a date for January 8. The entire row of data for January 8 is missing.
Next, plot the aggregated Sales data by day for dates from January 3 through January 10.
##
## Best guess for the date format: %Y-%m-%d
## If this format is wrong, specify with parameter: ts_format
## To see all possible formats, enter: ?strptime
## Examples: "08/18/2024" format is "%m/%d/%Y"
## "18-08-24" format is "%d-%m-%y"
## "August 18, 2024" format is "%B %d, %Y"
The resulting visualization plots the y-value for January 7 and also for January 9, with a line segment connecting those two points. There is no corresponding label on the x-axis for the missing data value nor is there a plotted point. And, the January 9 value is appropriately placed two days after the January 7 value on the visualization.
In terms of missing data, if the date value exists and the
corresponding y-value is missing, with value , then the visualization
leaves the corresponding y-value blank. Here, insert the missing row for
January 8 with missing data, NA
, for that date.
new_row <- data.frame(
Order.Date = "2021-01-08",
Sales = NA
)
d <- rbind(d, new_row)
d <- order_by(d, by=Order.Date)
## Order.Date Sales
## 9 2021-01-07 176.728
## 10 2021-01-07 10.430
## 15 2021-01-08 NA
## 11 2021-01-09 9.344
Now, plot.
##
## Best guess for the date format: %Y-%m-%d
## If this format is wrong, specify with parameter: ts_format
## To see all possible formats, enter: ?strptime
## Examples: "08/18/2024" format is "%m/%d/%Y"
## "18-08-24" format is "%d-%m-%y"
## "August 18, 2024" format is "%B %d, %Y"
There is now a blank space in visualization for January 8. If
instead, better to treat the missing value as zero sales for that day,
specify the value of 0 for parameter ts_NA
.
##
## Best guess for the date format: %Y-%m-%d
## If this format is wrong, specify with parameter: ts_format
## To see all possible formats, enter: ?strptime
## Examples: "08/18/2024" format is "%m/%d/%Y"
## "18-08-24" format is "%d-%m-%y"
## "August 18, 2024" format is "%B %d, %Y"
Data can be stored in in different types of structures, different
forms of organization. Plot()
can plot a time series from
three different data structures:
The previous examples of plotting time series data read data stored in long format. Long format data organizes data with each row of the data table containing only a single measurement. If the entity provides multiple data values, then the data values are stored in multiple rows.
For example, if observations of Apple’s stock price are taken monthly, then the data for each row of the data table contain only a single stock price. Or, from another perspective, the data values for each company are each store on a separate row.
## Month Company Price Volume
## 1 1985-01-01 Apple 0.100055 6366416000
## 2 1985-02-01 Apple 0.085392 4733388800
## 3 1985-03-01 Apple 0.076335 4615587200
## 4 1985-04-01 Apple 0.073316 2868028800
## 5 1985-05-01 Apple 0.059947 4639129600
## 6 1985-06-01 Apple 0.062103 5811388800
Many data analysis and visualization functions across a variety of statistical systems require long format data. As such, this organization of data is the most common data structure but other possibilities do exist.
Plot()
also reads wide-format data, which stores
multiple data values across a single row. We have no available wide form
time data with lessR, so first convert the long form
data file as read to the wide form. In the wide form, the three
companies each have their own column of data, repeated for each date.
Use the lessR function reshape_wide()
to
do the conversion.
## Month Apple IBM Intel
## 1 1985-01-01 0.100055 11.71846 0.359457
## 2 1985-02-01 0.085392 11.51437 0.327310
## 3 1985-03-01 0.076335 11.00154 0.324388
## 4 1985-04-01 0.073316 10.95822 0.321466
## 5 1985-05-01 0.059947 11.14231 0.308315
## 6 1985-06-01 0.062103 10.81489 0.303932
Now the analysis, which repeats a previous analysis, but with
wide-form data. Because the data frame is not the default d,
explicitly indicate with the data
parameter. Specify a
range of blue colors from light to dark blue to fill the area under each
time series.
Plot()
can also plot directly from an R time series
object, created with the base R ts()
function.
With the lessR style()
function many
themes can be selected, such as "lightbronze"
,
"dodgerblue"
, "darkred"
, and
"gray"
for gray scale. When no theme
or any
other parameter value is specified, return to the default theme,
colors
.
The annotations in the following visualization consist of the text
field “iPhone” with an arrowhead that points to the time that the first
iPhone became available. With lessR, list each
component of the annotation as a vector for add. Any value listed that
is not a keyword such as “rect” or “arrow” is interpreted as a text
field. Then, in order of their occurrence in the vector for add, list
the needed coordinates for the objects. To place the text field “iPhone”
requires one coordinate, <x1,y1>
. To place an “arrow”
requires two coordinates, <x1,y1>
and
<x2,y2>
. For example, the second element of the
y1
vector is the y1
value for the “arrow”. The
text field does not require a second coordinate, so specify
x2
and y2
as single elements instead of
vectors.
Use the base R help()
function to view the full manual
for Plot()
. Simply enter a question mark followed by the
name of the function.
?Plot
More on Scatterplots, Time Series plots, and other visualizations from lessR and other packages such as ggplot2 at:
Gerbing, D., R Visualizations: Derive Meaning from Data, CRC Press, May, 2020, ISBN 978-1138599635.