Setting up AEM Author Workflow Offloading

July 10, 2018 0 By Tad Reeves

An issue with AEM which has persisted since the earliest days of the platform, is that the Authoring environment has never been good at horizontally scaling. By authoring environment, this also extends to all of the other things that the Author AEM instance usually does, like image workflows, PDF rendering, video transcoding, and the like.

One model for attempting to horizontally scale the AEM author is to do Workflow Offloading, where you offload the heavy-duty tasks from the AEM author onto a separate AEM instance which is there only to process workflows and then return the payload back to the primary author instance. This has the purported benefit of being able to take major CPU-intensive and I/O-intensive ops and have them executed by a secondary server which is NOT the one your lag-sensitive authoring users are clicking around on.

However, please be warned – setting up offloading is fraught with pitfalls, and you’ll want to be very-super-extra-sure that you really want to go the offloading route before you try, because usually you’ll just be better-served by beefing up your author box and optimizing your workflows.

Diagrams of AEM Offloading Setup Architectures

AEM Offloading Author with Shared NFS/NAS Datastore

Above is the way that Adobe recommends setting up your workflow offloading, based on the offloading best practices document here. Specifically:

You’ve already split your segmentstore and datastore at AEM installation time
You are using FileDatastore on an externalized NAS/NFS mount, and are sharing this datastore with the AEM Offload Instance
You are using binary-less replication to get the workflows and their payloads back and forth between the AEM Author master (leader) instance that your users are logging in to, and the Offload Author(s) that are handling the offloaded workflows.

It is also theoretically possible to do a similar setup using S3Datastore, though such a scenario isn’t explicitly documented by Adobe. (EDIT: after working with Adobe support, a working model was demonstrated on test gear, the steps of such a setup are documented below.)

This setup would look like the following:

Diagram of AEM Offloading with S3 Datastore

Theoretically, offloading also works using an S3 Datastore. When one has an AEM Assets environment spanning multiple (or tens) of terabytes, it’s obviously advantageous to be able to use S3’s low-cost storage to store AEM assets once, rather than multiplying this storage across a shared-nothing publish environment, all on higher-cost EBS storage.

However, even getting workflow offloading to work at all using S3 is an undocumented and yet extremely effective source of pain and agony. One of the reasons for this is that when an item is uploaded to S3, it’s first written to a local S3 cache while an async call is sent off to persist the binary to S3. However, there’s no flag in the workflow to wait until the item is persisted out to S3 before kicking off the offload which then attempts binary-less replication of the workflow to the offload author which would then try to access the binary that perhaps is not available yet. Race condition excellence.

Steps to Configure Workflow Offloading on Authors with binary-less replication & a Shared Amazon S3 Datastore

After working with Adobe support extensively on this issue, the following basic instructions were used to get AEM author workflow offloading working on test instances which were connected based on the diagram above, using a shared S3 datastore.

Configure Master and Worker instances with S3 Datastore

Create folders for two AEM instances: master, worker
Configure the S3 datastore as per this Adobe documentation: https://helpx.adobe.com/experience-manager/6-3/sites/deploying/using/data-store-config.html

Start master AEM instance. Verify that it’s ready and accessible.
Start AEM worker. Verify that it’s ready and accessible.

Administering Topology

Login into OSGi on the Master instance: http://localhost:4502/system/console/topology
Note the instance Sling ID and confirm that master AEM is shown as local. Note, that for master and worker reside on two different servers it requires to configure Configure Discovery.Oak Service at http:// <host>:<port>/system/console/configMgr/org.apache.sling.discovery.oak.Config
Verify that worker instance is shown under Connectors:

Verify that worker instance is shown under Connectors

Login into OSGi on worker: http://localhost:4504/system/console/topology
Note the instance Sling ID and confirm that Worker AEM is shown as local. Note, that for master and worker reside on two different servers it requires to configure Configure Discovery.Oak Service at http:// <host>:<port>/system/console/configMgr/org.apache.sling.discovery.oak.Config
Verify that worker Outgoing topology connectors is pointed to the master AEM:

Verify that worker Outgoing topology connectors is pointed to the master AEM

Configure Topic Consumption

On the master AEM, switch to Offloading Browser at: http://localhost:4502/libs/granite/offloading/content/view.html
Locate Topic: com/adobe/granite/workflow/offloading
Disable Topic for the master instance:

Disable Topic for the master AEM author instance

Turning off automatic agent management

Adobe recommends that you turn off automatic agent management because it does not support binary-less replication and can cause confusion when setting up a new offloading topology. Moreover, it does not automatically support the forward replication flow required by binary-less replication.

Open Configuration Manager from the URL http://localhost:4502/system/console/configMgr.
Open the configuration for OffloadingAgentManager (http://localhost:4502/system/console/configMgr/com.adobe.granite.offloading.impl.transporter.OffloadingAgentManager).
Disable automatic agent management.

Disable automatic agent management

Repeat same steps for the worker instance.

Offloading replication agents

Open replication agents in miscadmin console: http://localhost:4502/miscadmin#/etc/replication/agents.author
Delete all 3 agents named “offloading_replication_agent” “offloading_outbox” and “offloading_reverse_*” on Master
On the Master, create a new replication agent with the Title and Name as “offloading_*” where * is the sling ID of the Worker:
On the Master, create a new replication agent with the Title and Name as “offloading_*” where * is the sling ID of the Worker
Edit the properties of this replication agent as follows:

Property	Value
Settings > Serialization Type	Binary less
Transport >Transport URI	http://<ip of worker instance>:<port>/bin/receive?sling:authRequestLogin=1&binaryless=true
Transport >Transport User	Replication user on target instance
Transport >Transport Passoword	Replication user password on target instance
Extended > HTTP Method	POST
Triggers > Ignore Default	True

Repeat the same steps on the worker, with the following changes:

The Name and Title is “offloading_*” with * being the Sling ID of the Master
The Transport URI needs to point to the Master

Offloading the Processing of DAM Assets

On the master AEM get to Workflow console and switch to launchers tab: http://localhost:4502/libs/cq/workflow/content/console.html
There are four launchers for DAM Update Asset workflow.
For each launcher changes workflow dropdown value from “DAM Update Asset” to “DAM Update Asset Offloading”:
For each launcher changes workflow drop-down value from “DAM Update Asset” to “DAM Update Asset Offloading”
On the worker, open DAM Update Assets workflow and uncheck Transient Workflow, then save the changes.
On the worker, open DAM Update Assets worklow and uncheck Transient Workflow, then Save the changes.
Open Workflow launcher console on the worker.
Disable all DAM Update Assets workflow launchers:
Disable all DAM Update Assets workflow launchers

Using forward replication

On Worker, open configuration for OffloadingDefaultTransporter (http://localhost:4502/system/console/configMgr/com.adobe.granite.offloading.impl.transporter.OffloadingDefaultTransporter).
Change value of the property default.transport.agent-to-master.prefix from offloading_outbox to offloading.
Change value of the property default.transport.agent-to-master.prefix from offloading_outbox to offloading.

Turning off transport packages

On master, open the component configuration of OffloadingDefaultTransporter component at http://localhost:4502/system/console/configMgr/com.adobe.granite.offloading.impl.transporter.OffloadingDefaultTransporter
Disable the property Replication Package (default.transport.contentpackage).
Repeat the same step on worker:
Disable the property Replication Package (default.transport.contentpackage) and repleat the same step on the worker.

Disabling the transport of workflow model

On master, open the workflow console from http://localhost:4502/libs/cq/workflow/content/console.html.
Open the Models tab.
Open the DAM Update Asset Offloading workflow model.
Open step properties for the DAM Workflow Offloading step.
Open the Arguments tab, and unselect the Add Model To Input and Add Model To Output options:
Open the Arguments tab, and unselect the Add Model To Input and Add Model To Output options
Save the changes to the model.

Optimizing the polling interval

Workflow offloading is implemented using an external workflow on the master, that polls for the completion of the offloaded workflow on the worker. The default polling interval for the external workflow processes is five seconds. Adobe recommends that you increase the polling interval of the Assets offloading step to at least 15 seconds to reduce the offloading overhead on the master.

Open the workflow console from http://localhost:4502/libs/cq/workflow/content/console.html.
Open the Models tab.
Open the DAM Update Asset Offloading workflow model.
Open the step properties for the DAM Workflow Offloading step.
Open the Commons tab, and adjust the value of the Period property.
Save the changes to the model.

Testing

Upload an image to Asset on master.
Verify that the same image appears in Asset on Worker.
Check worker workflow console-> Archive tab and note DAM Update Workflow
Check master workflow console-> Archive tab and note DAM Update Asset Offloading instance.
Check the error.log on the Worker and ensure the processing is executed here:

…

17.08.2018 23:12:39.349 *INFO* [JobHandler: /etc/workflow/instances/server0/2018-08-17/update_asset_5:/content/dam/myfolder/BinaryLogs.jpg/jcr:content/renditions/original] com.adobe.xmp.worker.files.ncomm.XMPFilesNComm [PERF][EXECUTE_START] | C:\Users\ela\AppData\Local\Temp\cq-dam-wf-file8971817675171803421.tmp | XMP extraction

…

Check the error.log on the Master and ensure the renditions are retrieved from Worker:

…

17.08.2018 23:12:37.534 *INFO* [JobHandler: /etc/workflow/instances/server0/2018-08-17/dam-xmp-writeback_6:/content/dam/myfolder/BinaryLogs.jpg/jcr:content/metadata] com.day.cq.dam.core.process.XMPWritebackProcess payload path :/content/dam/myfolder/BinaryLogs.jpg/jcr:content/metadata