28 Nov 2014

Troubleshooting SMA (Service Management Automation) – Part 1

As I work more and more with SMA in my daily job, I’m of course also running into situations where things go wrong. I decided to share some troubleshooting tips and therefore, this is the first post in a series of 3, explaing how to troubleshoot SMA infrastructure, failing jobs, stuck / stalled jobs, etc.

In this post I’ going to focus on the job states and some insights how SMA distributes them. If you should have no clue what I’m writing about, have a look here.

Job States

SMA has different states for jobs as they are executed.

1 = QUEUED
When a runbook has been assigned to a worker, but waits to get picked up (wait for execution)
2 = STARTING (ACTIVATING)
When a runbook gets started, a new instance of a runbook job is created and assigned to a runbook worker
3 = RUNNING
Runbook is currently executed on the assigned worker
4 = COMPLETED
The runbook has finished without any terminating errors
5 = FAILED
The runbook execution failed (usually due to an internal error or runbook code syntax error)
6 = STOPPED
The runbook failed because of a terminating error condition
8 = SUSPENDED
The runbook has paused (suspended) either caused by a manual suspend call or due to a non terminating error
11 = STOPPING
The runbook execution is terminated (sandbox is discarded)
12 = RESUMING
The runbook continues at the last good known state (workflow checkpoint) after a non terminating error or a manual suspend action

The job state info can be gathered via Powershell or TSQL against the SMA database.

Powershell example (requires the SMA Powershell module installed)

 Get-SMAJob -WebServiceEndpoint "https://smaweb.domain.com" | where {$_.JobStatus -eq 'Failed'}

TSQL Example

USE [SMA]
SELECT * FROM[Core].[Jobs]
WHERE StatusId = 6

How job execution works in SMA

Assuming you have muliple runbook workers deployed (hope you do), SMA distributes the runbook executions among all active workers. Static random partitions are used as adistribution mechanism. This basically means that each runbook is assigned to a runbook worker on launch time. Each worker has a fixed numbered range (queue). On launch time, the runbook gets assigned a random partition ID. The partition ID is nothing more than a random number which fits into one of the worker queue (partition numbering range).The pratitioning info can be found by querying the two tables: Core.Jobs and Queues.Deployment

No least loaded, no round robin, just randomness kicks in.

After assigning the job to a worker, the RunbookService on the worker starts a process called Orchestrator.Sandbox. The process is in charge to compile and execute the runbook code.
Now Imagine something goes wrong with the worker. The Runbook service dies, the worker server goes down, what will happen to the runbooks currently being started or excecuted? Well, it depends 🙂

Runbooks which where in the Starting status will be started normally when the worker comes back online
Runbooks which where currently running will continue to run from their last good known state (checkpoint), if there is one, otherwise they will fail

In no case the runbooks will get re-distributed to a remaining worker at any time, that’s why the mechanism behind is called “static random paritioned”

A word on suspended runbooks

Suspended runbooks stay in this state forever, until someone or another runbook resumes or stops them. The database holds the state including all checkpoints of each suspended runbooks. I tend to keep the SMA jobs generally for a longer period than the default 30 days. But I really don’t want to have hundreds or thousands of runbooks in the suspended state consuming evetually huge amount of space regardinf the checkpoint states. I’m using a houskeeping runbook to cleanup runbooks which have been in a suspended state for a while.

Which looks like this:

 Workflow Start-JobCleanup
{
    param(
    [Parameter(mandatory)]
    [string]$JobState,
    [Parameter(mandatory)]
    [int]$DaysToKeep
    )

    $SMAWebSvcURI = "https://smaweb"

    #Get all the runbooks matching the state
    $jobs = InlineScript
    {
        $jobs = Get-SmaJob -WebServiceEndpoint $USING:SMAWebSvcURI | ? {$_.JobStatus -eq "$USING:JobState"}
        return $jobs
    }

    Foreach -parallel ($j in $jobs)
    {
        If ((Get-Date).AddDays(-$DaysToKeep) -gt ([datetime]$j.LastModifiedTime))
        {
            Write-Output "SMA Job: $($j.JobId) will be stopped"
            Stop-SmaJob -Id $j.JobId -WebServiceEndpoint $SMAWebSvcURI
        }
    }
}

In the next post I’ll show how to troubleshoot hung or stalled jobs and how to deal with “lost” workers.

Hope this already helped.

stay tuned!