Derby Database

Cleaning the Derby DB

Cleaning a corrupted DB

If you checked the logs and see your Derby DB corrupted, then the Derby DB must be re-generated from scratch.

Here you can found the procedure and some suggestions.

First of all, plan a downtime, and, very importantly, recommend to users to download the outputs from Done batch jobs, because after the operations they will not be more visible on the portal. No problem for running jobs.

​In case there are, in Admin’s portal, groups defined (basically, tags), you need to export the users on a zip file, to be re-imported again after the clean.

The clean procedure:

  1. Stop the EnginFrame service
    systemctl stop enginframe
  2. Move or renamte the directory /home/appli/nice/enginframe/data/derby/EnginFrameDB. For example:
    mv  /home/appli/nice/enginframe/data/derby/EnginFrameDB  /home/appli/nice/enginframe/data/derby/EnginFrameDB.old
  3. Start the EnginFrame service
    systemctl start enginframe

You will see the directory /home/appli/nice/enginframe/data/derby/EnginFrameDB was re-created again, and it will be populated with records starting from what is seen on system, after a while the old closing sessions will disappear, meaning that the update.sessions trigger worked properly.

Avoiding the issue

PBS

In order to help to prevent this issue in the future you can apply a little solution to reduce the GridML produced by EnginFrame.

In some cases, job cache GridML (EnginFrame internal XML representation of a cluster resource, such as jobs) entries might overcome VARCHAR limits. Derby DB, for example, has a 32k limit for a single job GridML. This issue might be due to the number of grid:execution-host tags produced by a core-oriented job resource requests, e.g.

qsub -l ncpus=1024

This request will produce a number of grid:execution-host tags equal to the number of cores, without an order.

One solution might be adding a further filter to grid.list.jobs script. This script queries the underlying scheduler and produces EnginFrame GridML. 

Reducing the number of grid:execution-host tags to one per host does not produce any known issue.

To apply the filter on a specific grid plugin, for PBS ${EF_ROOT}/plugins/pbs/grid/grid.list.jobs:

Save the following content as  gridml-reduce.awk and follows the step listed into the comments.

# reduces redundant grid:execution-host tags in Enginframe GridML
# setup:
# 1. cp gridml-reduce.awk "${EF_ROOT}"/plugins/pbs/grid/
# 2. chmod 755 "${EF_ROOT}"/plugins/pbs/grid/gridml-reduce.awk
# 3. sed -i \
#     's/ | ef_qstat_f2xml)$/ | ef_qstat_f2xml | awk -f \"${EF_ROOT}\"\/plugins\/pbs\/grid\/gridml-reduce.awk)/g' \
#     "${EF_ROOT}"/plugins/pbs/grid/grid.list.jobs

BEGIN {
    exec_host_section = 0
}

# sort saved host array if present
!/^ *<grid:execution-host>/ {
    if ( exec_host_section == 1) {
        n = asorti(execution_hosts, execution_hosts_sorted)
        for ( i=1; i<=n; i++ ) {
            print execution_hosts_sorted[i] 
        }
        delete execution_hosts
        delete execution_hosts_sorted
        exec_host_section = 0
    }
    print $0
}

# save entries in a host array
/^ *<grid:execution-host>/ {
    if ( exec_host_section == 0) {
        exec_host_section = 1
    }
    execution_hosts[$0]++
}

There is no need to restart services. 

Try to submit a job or a session to check that everything is working properly.

SGE

For SGE, instead of that patch, you can try to use the parameter EF_SGE_JOB_LIST_MODE into $EF_ROOT/enginframe//conf/plugins/sge/ef.sge.conf file.

Possible options:

# Job list mode
# -----------------------------------------------------------------------------
# modes available: default, qacct-on-my-jobs, no-qacct, brief, sge
#
# default mode displays finished jobs, using qstat + qacct
#         may suffer bad performances coping with a large number of jobs
#         due to qacct command query
# qacct-on-my-jobs invokes qacct only for my jobs or job/host details pages
# no-qacct never invokes qacct. Does not display finished jobs
# brief   limited info for job details. Does not display finished jobs
#         recommended for clusters with many jobs
# sge is provided for backward compatibility only and should be avoided
#
# EF_SGE_JOB_LIST_MODE="qacct-on-my-jobs"

You can also reduce the command size from default 512 to less characters with the parameter GRID_TAG_SIZE_LIMIT, in the same configuration file mentioned above.

Check the full list of parameters on file $EF_ROOT/enginframe/2021.0-r1657enginframe/plugins/sge/conf/ef.sge.conf, but don’t edit it.

A good first try is the parameter qacct-on-my-jobs.

The impact of GRID_TAG_SIZE_LIMIT  is on reducing the content of <grid:command> tag on GridML generated.

There is not any side effect, it is only for visualization on web pages and storing in database.

You can try to reduce it to 300 chars, for example:

GRID_TAG_SIZE_LIMIT=300

For SGE changes you need to restart the EnginFrame service (systemctl restart enginframe).

Check the job submission and check finished job/spooler information.