Generates a synthetic version of a data.frame
, with
similar characteristics to the original. See Details for the algorithm used.
Usage
synthetic(
data,
model_expression = ranger(x = x, y = y),
predict_expression = predict(model, data = xsynth)$predictions,
missingness_expression = NULL,
verbose = TRUE
)
Arguments
- data
A data.frame of which to make a synthetic version.
- model_expression
An R-expression to estimate a model. Defaults to
ranger(x = x, y = y)
, which uses the fast implementation of random forests inranger
. The expression is evaluated in an environment containing objectsx
andy
, wherex
is adata.frame
with the predictor variables, andy
is avector
of outcome values (see Details).- predict_expression
An R-expression to generate predicted values based on the model estimated by
model_expression
. Defaults topredict(model, data = xsynth)$predictions
. This expression must return a vector of predicted values. The expression is evaluated in an environment containing objectsmodel
andxsynth
, wheremodel
is the model estimated bymodel_expression
, andxsynth
is thedata.frame
of synthetic data used to predict the next column (see Details).- missingness_expression
Optional. An R-expression to impute missing values. Defaults to
NULL
, which means listwise deletion is used. The expression is evaluated in an environment containing the objectdata
, as specified in the call tosynthetic
. It must return adata.frame
with the same dimensions and column names as the original data. For example, usemissingness_expression = missRanger::missRanger(data = data)
for a fast implementation of the excellent 'missForest' single imputation technique.- verbose
Logical, Default: TRUE. Whether to show a progress bar while running the algorithm and provide informative messages.
Details
Based on the work by Nowok, Raab, and Dibben (2016), this function uses a simple algorithm to generate a synthetic dataset with similar characteristics to the original. The algorithm is as follows:
Let x be the original data.frame, with columns 1:j
Let xsynth be a synthetic data.frame, with columns 1:j
Column 1 of xsynth is a bootstrapped version of column 1 of x
Using
model_expression
, a predictive model is built for column c, for c along 2:j, with c predicted from columns 1:(c-1) of the original data.Using
predict_expression
, columns 1:(c-1) of the synthetic data are used to predict synthetic values for column c.
Variables are thus imputed in order of occurrence in the data.frame
.
To impute in a different order, reorder the data.
Note that, for data synthesis to work properly, it is essential that the
class
of variables is defined correctly. The default algorithm
ranger
supports numeric, integer, and factor types.
Other types of variables should be converted to one of these types, or users
can use a custom model_expression
and predict_expressio
when calling synthetic
.
Note that for data synthesis to work properly, it is essential that the
class
of variables is defined correctly. The default algorithm
ranger
supports numeric, integer, factor, and logical
data. Other types of variables should be converted to one of these types.
Users can provide use a custom model_expression
and
predict_expression
to use a different algorithm when calling
synthetic
.
As demonstrated in the example, users could call lm
as a
model_expression
to use
linear regression, which preserves linear marginal relationships but can give
rise to values out of range of the original data.
Or users could call sample
as a predict_expression
to bootstrap
each variable, a very quick solution that maintains univariate distributions
but loses all marginal relationships. These examples are not exhaustive, and
users can even create custom functions.
References
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11 .
Examples
if (FALSE) {
# Example using the iris dataset and default ranger algorithm
iris_syn <- synthetic(iris)
# Example using lm as prediction algorithm (only works for numeric variables)
# note that, within the model_expression, a new data.frame is created because
# lm() requires a separate data argument:
dat <- iris[, 1:4]
synthetic(dat,
model_expression = lm(.outcome ~ .,
data = data.frame(.outcome = y,
xsynth)),
predict_expression = predict(model, newdata = xsynth))
}
# Example using bootstrapping:
synthetic(iris,
model_expression = NULL,
predict_expression = sample(y, size = length(y), replace = TRUE))
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.6 3.3 4.6 0.2 virginica
#> 2 4.4 3.8 3.9 0.4 setosa
#> 3 6.1 2.7 5.0 0.2 versicolor
#> 4 5.4 2.9 1.6 1.8 setosa
#> 5 5.6 3.3 1.5 2.3 versicolor
#> 6 4.4 2.8 4.5 0.2 setosa
#> 7 5.9 3.2 1.4 1.3 versicolor
#> 8 4.9 3.0 1.4 1.3 versicolor
#> 9 6.5 3.1 3.6 1.9 virginica
#> 10 5.4 3.0 4.9 0.2 versicolor
#> 11 5.1 3.5 4.9 0.2 setosa
#> 12 5.6 2.7 1.5 2.2 setosa
#> 13 6.4 3.0 1.5 1.8 versicolor
#> 14 5.6 2.7 1.5 1.4 versicolor
#> 15 5.6 3.3 4.0 1.6 versicolor
#> 16 4.8 2.5 4.2 1.6 virginica
#> 17 5.0 3.1 4.2 1.3 setosa
#> 18 7.3 2.8 4.2 2.4 virginica
#> 19 5.8 3.0 1.3 2.1 virginica
#> 20 6.1 3.2 4.1 0.2 versicolor
#> 21 5.2 3.2 1.5 0.2 setosa
#> 22 4.6 2.8 1.7 0.3 versicolor
#> 23 6.4 3.4 4.5 2.1 setosa
#> 24 4.9 3.4 3.5 2.3 virginica
#> 25 7.2 3.1 5.1 2.5 virginica
#> 26 6.7 3.2 5.5 1.8 setosa
#> 27 7.6 3.1 1.7 0.4 setosa
#> 28 7.4 2.8 1.5 2.1 setosa
#> 29 4.9 3.3 1.3 1.7 versicolor
#> 30 5.6 2.2 4.5 1.9 versicolor
#> 31 6.3 4.1 1.9 1.0 setosa
#> 32 5.9 3.2 1.6 2.3 versicolor
#> 33 6.3 2.5 1.3 2.0 versicolor
#> 34 4.9 3.0 1.4 0.2 virginica
#> 35 6.1 2.6 4.9 1.4 virginica
#> 36 5.1 2.2 4.5 0.1 virginica
#> 37 4.8 3.2 5.1 1.0 versicolor
#> 38 5.0 3.1 5.7 1.5 versicolor
#> 39 4.5 3.6 3.0 1.8 setosa
#> 40 6.7 3.2 1.5 2.3 virginica
#> 41 5.8 3.0 4.9 1.1 virginica
#> 42 7.1 3.1 1.4 0.4 setosa
#> 43 5.0 2.7 1.9 1.3 setosa
#> 44 5.2 2.8 1.3 2.3 virginica
#> 45 4.3 3.0 3.3 0.1 setosa
#> 46 5.0 3.4 4.1 0.2 setosa
#> 47 5.8 2.7 4.4 0.3 setosa
#> 48 6.3 3.1 1.5 0.2 virginica
#> 49 5.4 3.5 5.8 2.4 setosa
#> 50 4.8 2.7 3.5 0.2 setosa
#> 51 5.7 3.4 1.4 0.3 setosa
#> 52 6.8 2.5 1.4 1.8 versicolor
#> 53 5.4 2.8 4.5 0.2 versicolor
#> 54 6.5 3.4 5.0 2.0 versicolor
#> 55 5.4 2.6 1.3 1.3 virginica
#> 56 5.0 3.9 5.0 1.5 versicolor
#> 57 4.6 2.8 4.6 2.0 versicolor
#> 58 5.5 3.4 5.6 1.3 versicolor
#> 59 6.3 3.4 1.4 1.0 virginica
#> 60 7.0 3.0 1.6 2.5 versicolor
#> 61 6.5 3.4 4.6 0.4 setosa
#> 62 6.7 3.2 3.3 1.3 virginica
#> 63 4.8 2.9 5.7 1.3 virginica
#> 64 6.9 2.9 1.6 0.2 virginica
#> 65 5.1 3.0 1.4 1.2 versicolor
#> 66 4.6 3.4 1.5 1.6 virginica
#> 67 5.0 3.8 5.0 1.8 versicolor
#> 68 5.5 2.9 1.2 1.3 virginica
#> 69 5.0 3.1 4.0 1.4 virginica
#> 70 6.3 3.4 1.5 1.2 virginica
#> 71 7.0 3.5 6.1 2.0 setosa
#> 72 6.1 2.9 4.6 1.8 versicolor
#> 73 6.7 3.0 1.9 2.0 versicolor
#> 74 4.8 3.2 1.4 1.0 virginica
#> 75 4.4 2.4 1.5 1.8 setosa
#> 76 4.7 3.2 5.6 1.7 setosa
#> 77 4.6 3.2 3.3 2.4 versicolor
#> 78 4.8 3.0 6.1 2.1 virginica
#> 79 4.4 3.0 1.4 2.1 setosa
#> 80 6.0 2.6 5.4 1.3 setosa
#> 81 4.5 3.4 1.5 1.9 virginica
#> 82 5.1 3.5 4.0 0.3 versicolor
#> 83 6.8 2.3 4.9 1.6 versicolor
#> 84 5.8 2.0 4.8 1.5 virginica
#> 85 5.0 2.8 5.1 2.5 virginica
#> 86 5.4 3.1 5.6 1.8 setosa
#> 87 4.9 2.8 6.1 1.0 setosa
#> 88 5.2 2.7 5.8 1.4 virginica
#> 89 6.1 3.4 1.6 1.4 setosa
#> 90 5.8 4.2 4.5 1.4 setosa
#> 91 6.9 3.2 4.4 1.6 versicolor
#> 92 6.7 3.2 1.4 1.4 setosa
#> 93 5.5 3.5 4.5 0.1 virginica
#> 94 7.3 3.1 6.0 1.1 versicolor
#> 95 7.9 3.5 3.0 0.2 versicolor
#> 96 5.8 2.7 6.4 1.9 virginica
#> 97 5.7 2.6 4.2 1.3 versicolor
#> 98 6.4 2.3 1.5 0.2 setosa
#> 99 6.7 3.1 4.8 2.4 virginica
#> 100 4.8 3.9 4.5 0.2 versicolor
#> 101 5.1 3.8 4.7 2.5 versicolor
#> 102 5.6 2.4 1.7 1.8 versicolor
#> 103 4.9 4.1 5.6 1.3 setosa
#> 104 6.3 3.9 1.0 1.4 virginica
#> 105 5.6 2.9 5.7 1.0 setosa
#> 106 5.2 3.3 1.4 1.3 versicolor
#> 107 5.3 3.1 5.7 0.2 versicolor
#> 108 6.0 4.4 1.3 0.2 versicolor
#> 109 4.6 2.7 4.7 1.9 setosa
#> 110 6.3 3.5 1.4 1.0 versicolor
#> 111 4.8 3.4 3.0 1.3 versicolor
#> 112 5.5 3.1 4.3 1.5 virginica
#> 113 5.1 3.3 1.1 2.2 setosa
#> 114 6.1 2.8 4.7 2.4 virginica
#> 115 6.1 2.7 5.2 2.5 virginica
#> 116 6.3 2.0 1.7 0.2 setosa
#> 117 4.8 2.8 4.8 2.0 versicolor
#> 118 4.7 3.0 4.7 1.3 virginica
#> 119 6.8 2.4 4.7 1.2 versicolor
#> 120 6.1 3.0 1.3 0.3 virginica
#> 121 6.9 3.0 1.5 1.3 versicolor
#> 122 6.7 4.2 1.4 1.5 setosa
#> 123 5.9 3.4 5.1 1.0 versicolor
#> 124 5.0 3.9 1.6 1.5 versicolor
#> 125 6.1 2.8 1.3 0.2 virginica
#> 126 6.3 2.0 4.0 1.3 setosa
#> 127 6.5 3.6 1.7 1.3 setosa
#> 128 7.2 3.2 3.7 1.4 versicolor
#> 129 6.3 2.5 1.6 0.2 setosa
#> 130 4.4 3.2 5.2 2.0 setosa
#> 131 5.2 3.2 4.5 1.5 versicolor
#> 132 5.7 3.4 4.7 0.2 versicolor
#> 133 6.7 3.0 5.6 2.3 setosa
#> 134 5.8 3.4 5.1 0.4 virginica
#> 135 4.9 2.9 3.9 0.2 setosa
#> 136 5.4 3.6 3.0 2.5 versicolor
#> 137 5.6 3.1 5.5 0.2 setosa
#> 138 5.2 3.3 3.6 1.4 setosa
#> 139 4.8 2.5 5.0 0.2 versicolor
#> 140 6.7 3.0 5.6 2.0 setosa
#> 141 6.4 3.1 4.9 1.9 virginica
#> 142 4.4 2.5 1.5 0.6 virginica
#> 143 6.4 3.1 4.6 0.1 setosa
#> 144 7.7 3.6 1.6 0.3 setosa
#> 145 6.5 3.0 1.7 1.9 versicolor
#> 146 5.2 2.7 1.4 1.9 setosa
#> 147 6.0 3.2 1.3 2.5 virginica
#> 148 6.0 2.8 1.4 0.2 setosa
#> 149 5.5 2.8 5.1 1.8 setosa
#> 150 4.5 3.2 5.1 2.0 versicolor
if (FALSE) {
# Example with missing data, no imputation
iris_missings <- iris
for(i in 1:10){
iris_missings[sample.int(nrow(iris_missings), 1, replace = TRUE),
sample.int(ncol(iris_missings), 1, replace = TRUE)] <- NA
}
iris_miss_syn <- synthetic(iris_missings)
# Example with missing data, imputation by median/mode substitution
# First, define a simple function for median/mode substitution:
imp_fun <- function(x){
if(is.data.frame(x)){
return(data.frame(sapply(x, imp_fun)))
} else {
out <- x
if(inherits(x, "numeric")){
out[is.na(out)] <- median(x[!is.na(out)])
} else {
out[is.na(out)] <- names(sort(table(out), decreasing = TRUE))[1]
}
out
}
}
# Then, call synthetic() with this function as missingness_expression:
iris_miss_syn <- synthetic(iris_missings,
missingness_expression = imp_fun(data))
}