parabar with
foreachThe goal of this article is to provide a minimal example of how to
use the parabar and
foreach
packages together. The foreach package is a popular package
that provides syntactic sugar for executing tasks sequentially (i.e.,
via the %do% operator) or in parallel (i.e., via the
%dopar% operator). In this article, I will provide a brief
introduction to the foreach package and show how it can be
used to run tasks in parallel with the parabar package. If
you are not yet familiar with the parabar package, make
sure to check out the documentation for
information on how to get started.
In a nutshell, the foreach package provides a way to
iterate over a collection of elements. For iterating over the respective
collection sequentially, one can use the %do% operator as
follows:
# Load the library.
library(foreach)
# For each element.
foreach(i = 1:5) %do% {
# Do something.
i * 2
}
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#> [1] 6
#>
#> [[4]]
#> [1] 8
#>
#> [[5]]
#> [1] 10In this example, the line
loads the foreach package, making all of its functions
and operators available in main session. More interestingly, the
call
takes the named argument i = 1:5 provided as input and
returns an iterator object of class foreach. Then, the
%do% operator is used to execute the expression on the
right-hand side of the operator
for each element of the iterator object.
Note. The foreach::foreach function may take
additional arguments that control the behavior of the iteration process,
accumulation of the results, and the task execution. For example, by
default, the foreach::foreach function returns the
accumulated results as a list. However, the
foreach::foreach can take a .combine argument
that specifies how the results of each iteration should be combined into
a single object. Specifying, for instance, .combine = c for
the example above instructs foreach::foreach that we expect
the results back as a vector instead of a list:
Moreover, using the .final argument, we can provide a
function that acts on the accumulated results right before their are
provided back to the user. This is useful when we want to perform some
final operation on the results before returning them. For example,
suppose we want to sum the results of the iterations. We can do this as
follows:
# For each element.
foreach(i = 1:5, .combine = c, .final = sum) %do% {
# Do something.
i * 2
}
#> [1] 30As you may have noticed, the arguments that pertain to the behavior
of the foreach::foreach function are prepended with a dot.
There are more arguments available. For a complete list, see the
documentation for foreach::foreach and the vignette Using
the foreach package.
If we want to run a task in parallel, we need to provide a backend
that supports parallelizing the task. Since the foreach
package is not a parallelization package per se, it does not provide a
backend for parallelizing tasks by default. Instead, it provides a
flexible mechanism to register any parallelization backend with it, as
long as that backend supports the %dopar% operator.
The workflow for running a task in parallel with the
foreach package involves:
foreach package.%dopar%
operator.While the parabar package provides synchronous
and asynchronous
parallelization backends, it does not work out of the box with the
foreach package. This is where the doParabar
package comes into play. The doParabar encapsulated the
necessary logic to adapt parabar backends to work
seamlessly with the foreach package.
At a high level the doParabar package consists of two
main functions:
doPar:
provides an implementation for the %dopar% operator (e.g.,
think of it as an adapter that connects the foreach and
parabar packages). This function implements the various
arguments of the foreach::foreach function and determines
how the tasks are parallelized using a parabar
backend.registerDoParabar:
registers the doPar implementation with the
foreach package. This function sets up the necessary hooks
in the foreach package to use the doPar
implementation for the %dopar% operator. In other words, it
tells foreach that as long as a parabar
backend is registered, it should use the doPar
implementation in doParabar for the %dopar%
operator.Note. Two particularly relevant
foreach::foreach arguments in the context of parallelizing
R code are .export and .packages.
The .export argument specifies the variables that need to
be exported to the backend, while the packages argument
specifies the packages that need to be loaded on the backend.
doParabarUnlike other foreach adapter packages out there (e.g.,
doParallel), the the doParabar package does
not automatically load other packages. Instead, I recommend to
explicitly load the necessary packages in your scripts. In a similar
vein, R package developers should add the necessary
packages to the Imports field in the
DESCRIPTION file of their package. Therefore, the first
step in using parabar with foreach is to load
the necessary packages:
Next, we proceed by using parabar to create an asynchronous
parallelization backend that supports progress tracking as follows:
# Create an asynchronous `parabar` backend.
backend <- start_backend(
cores = 2, cluster_type = "psock", backend_type = "async"
)At this point, we have a parallelization backend that we can register
with the foreach package. We do this via the
registerDoParabar function:
To verify that the backend has been registered successfully, we can
use some of the function provides by the foreach package to
query information about the backend:
Now, we can use the %dopar% operator to run tasks in
parallel. For example:
# Define some variables strangers to the backend.
x <- 10
y <- 100
z <- "Not to be exported."
# Used the registered backend to run a task in parallel via `foreach`.
results <- foreach(
i = 1:300, .export = c("x", "y"), .combine = c
) %dopar% {
# Sleep a bit to simulate a long-running task.
Sys.sleep(0.01)
# Compute and return.
i + x + y
}
#> completed 0 out of 300 tasks [ 0%] [ 0s]
#> ...
#> completed 60 out of 300 tasks [ 20%] [ 1s]
#> ...
#> completed 300 out of 300 tasks [100%] [ 2s]Note. The doParabar package does not
automatically export objects (i.e., or packages for that manner) to the
backend. While this break “tradition” with other foreach
adapter packages, it is a deliberate design choice made to encourage
users to keep their scripts tidy and be mindful of what they export to
the backend. (i.e., see the .export,
.noexport, and .packages arguments of the
foreach function).
We can verify that objects are not automatically exported to the
backend by checking the value of the z variable on the
backend. We expect this call to throw an error, since z was
never exported to the backend:
# Verify that the variable `z` was not exported.
try(evaluate(backend, z))
#> Error : ! in callr subprocess.
#> Caused by error in `checkForRemoteErrors(lapply(cl, recvResult))`:
#> ! 2 nodes produced errors; first error: object 'z' not foundFinally, we can stop the backend when we are done with as we would normally do:
In this article, I provided a short introduction on how to run tasks
in parallel on parabar
backends using foreach
semantics. This integration is possible via the doParabar
package, which provides an implementation for the %dopar%
operator (i.e., the doPar function) and a function to
register the implementation with the foreach package (i.e.,
the registerDoParabar function). The source code for the
doParabar package can be consulted on GitHub
at github.com/mihaiconstantin/doParabar.
I kindly welcome any feedback or contributions to improving
parabar or doParabar.