Automating the restart of selected error jobs

Let us assume that a profile sends to a target system that is often restarted at an unpredictable time. In peak times the profile processes 3000 jobs per hour. As a result of a serialisation of the data, the absence of just one output file leads to a stop of the process.

As a result, the unavailability of the target system at undetermined times creates the need to restart a large number of jobs.

All errors from this profile that result from the unavailability of the target system have the same characteristics that differentiate them from other errors. This is the starting point for the automation of the restarts.

To automate the restart of selected error jobs, two classes are provided, which (in a profile) allow the error list to be read from the Control Center and then automatically restart those error jobs from the list. More details about this later.

The profile itself is responsible for the evaluation of the error list and the selection of the job numbers by means of the error number and the error text.

Since system landscape, error scenarios and user objectives are very specific, this profile needs to be developed by the user. General solutions are not possible here. It may also be necessary to develop one or more custom classes.

Reading from the error list in the Control Center

The ErrorRetrieveCron class allows a profile, called management profile, to be used to read errors that are listed in the Control Center under "Logs/Errors".

The class generates an error list with two record types per error.

Record type "Error": A master record of an error entry that occurs once per job.
Record type "Stack": An optional record that can occur several times after the error record and allows a precise identification of the error in the stack trace.

The error list can be restricted to specific profiles and/or to specific phases.

The management profile with time-driven Input Agent can be started or triggered (by another profile) at regular intervals.

After that, the error list is evaluated.

Evaluating the error list

In the management profile, it is possible to narrow down the errors, e.g. by evaluating the error text or the stack trace, in order to then determine the job numbers of the error jobs which are to be restarted automatically. These job numbers are passed as a list, one job number per line, to the custom class RestartFailedJob. See section “Automatically restart a list of error jobs”.

The management profile could also use an environment check class or a function, to check the availability of the target system to prevent the restart of the jobs when the target system is not yet available again.

Automatically restart a list of error jobs

The class RestartFailedJob expects a list of job numbers of jobs to be restarted from the destination structure of the management profile. Each job number must be in a single line. The class does not expect a configuration file.

Any job number in the list that does not correspond to a job number of a current error job in the Control Center is ignored. The class can therefore only process actual error jobs.

Successfully restarted jobs will be removed from the error list in the Control Center. The job log of the restarted error job will contain an entry indicating the restart by another profile. An additional entry with timestamp will be left for the removal of the original job from the list of error jobs.

By activating the profile logging for "Phase 6/Custom" of the management profile in the Control Center, further detailed information can be obtained.

Automating the restart of selected error jobs

Reading from the error list in the Control Center

See also

Evaluating the error list

Automatically restart a list of error jobs