Linux – FAQ

nVidia driver not working anymore

In case an update of the kernel is performed e.g. via the yum update command, then the nVidia driver needs to be reinstalled. The related commands typically look similar like this (please remember to use dcvgldiag to check the DCV installation):

# reinstall the nVidia driver
sudo sh NVIDIA-Linux-x86_64-430.26.run
sudo nvidia-xconfig --preserve-busid --enable-all-gpus
# Add line ‘Option "UseDisplayDevice" "None" ’ into Screen section
sudo vim /etc/X11/xorg.conf
# ensure that X server is running; you might get logged out by these commands
sudo systemctl isolate multi-user.target
sudo systemctl isolate graphical.target
# enable DCV for 3D
sudo dcvgladmin enable
# verify the installation with dcvgldiag
dcvgldiag ### typical out like below

dcvgldiag is a very convenient tool to check the DCV installation on Linux which is part of the NICE DCV download:

root@host# dcvgldiag
NICE DCV - Diagnostic Script
 Host:             ip-172-31-19-22.eu-west-1.compute.internal
 Architecture:     x86_64
 Operating System: Red Hat Enterprise Linux Server release 7.7 (Maipo)
 Kernel Version:   3.10.0-1062.1.2.el7.x86_64
 Nvidia GPU:       GRID K520
 Nvidia Driver:    430.26
 Runlevel:         5
 X configuration file: /etc/X11/xorg.conf
 DCV GL is enabled for 64 bit applications.
 Running tests: ………………. DONE
 No problem found.
 A detailed report about the tests is available in '/root/dcvgldiag-qc1nmo'

And check if the nVidia driver is working fine again:

# check if the driver is working properly
root@host# nvidia-smi
Mon Jun 29 12:46:29 2020
+---------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|------------------------------+---------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GRID K520           Off  | 00000000:00:03.0 Off |                  N/A |
| N/A   46C    P0    44W / 125W |    357MiB /  4037MiB |      0%      Default |
 
+--------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      2500      G   /usr/bin/X                                    62MiB |
|    0      2778      G   /usr/bin/gnome-shell                           4MiB |
|    0      3452      C   /usr/libexec/dcv/dcvagent                     60MiB |
|    0      4004      G   /usr/bin/gnome-shell                          59MiB |
|    0      7712      G   ...ownloads/lsprepost4.6_centos7/lsprepost   162MiB |

Cannot create directory ‘/run/user/15798’: Permission denied

Please check if the following environment variable is defined on the system:

echo $XDG_RUNTIME_DIR

In case XDG_RUNTIME_DIR is set but no folder /run/user/<UID> is present, please try to remove that environment variable and try again to create the session. You can do:

unset XDG_RUNTIME_DIR

The XDG_RUNTIME_DIR is managed by PAM. You can get more details in https://manpages.ubuntu.com/manpages/xenial/en/man8/pam_systemd.8.html

“Authentication is required to refresh the system repositories”

When a new DCV virtual session is created on RHEL 8, a “Authentication is required to refresh the system repositories” authentication message is prompt.

It can be disabled creating a new polkit rule:

  • create a new file as /etc/polkit-1/rules.d/99-system-sources-refresh.rules
  • add the following lines to the new file:
polkit.addRule(function(action, subject) {
          if (action.id == "org.freedesktop.packagekit.system-sources-refresh") {
                        return polkit.Result.YES;
                     }
});

ANSYS Workbench issues

Depending on the configuration there is an issue on NVIDIA GPU and power management that could limit the GPUs frame rate (not related to DCV or DCV-GL). The issue can be solved by enabling the HardDPMS option of versions of the NVIDIA driver 415 and later. To enable the HardDPMS option add this row to the Device section of the /etc/X11/xorg.conf file:

Section "Device"
...
  Option         "HardDPMS" "false"
...
EndSection

Deny the usage of nvenc (nvidia framebuffer capture)

Modify the /etc/dcv/dcv.conf file and add the following in the display section:

[display]
display-encoders = ['ffmpeg', 'turbojpeg', 'lz4']
framebuffer-readers=['desktopduplication', 'gdi']

Disabling nvfbc as most GPUs has no support (nvfbc is marked as deprecated).

GPU frame rate drops to 1 fps after some minutes

On Linux instances with NVIDIA GPUs the Display Power Management (DPMS) can reduce the performance of the GPU and limit the GPU frame rate to 1 fps. The issue can be reproduced on physical hosts, with driver 415.xx and later.

The issue is not related to DCV. It can be reproduced without a running DCV server by e.g. running glxgears for some minutes in a SSH connection:

DISPLAY=:0 glxgears
...
300 frames in 5.0 seconds = 59.972 FPS
300 frames in 5.0 seconds = 59.972 FPS
300 frames in 5.0 seconds = 59.972 FPS
101 frames in 5.7 seconds = 17.767 FPS
5 frames in 5.0 seconds =  1.000 FPS
5 frames in 5.0 seconds =  1.000 FPS
...

The issue can be solved by disabling the HardDPMS option of NVIDIA driver 415 and later.

To disable the HardDPMS option add this row to the Device section of the /etc/X11/xorg.conf file:

Section "Device"
    ...
    Option         "HardDPMS" "false"
    ...
EndSection

Below are 3 bash script lines to add this option in /etc/X11/xorg.conf:

# Add in /etc/X11/xorg.conf: Option "HardDPMS" "false"
sudo cp /etc/X11/xorg.conf /etc/X11/xorg.conf.BK
sudo sed '/^Section "Device"/a \ \ \ \ Option         "HardDPMS" "false"'  /etc/X11/xorg.conf > /tmp/xorg.conf
sudo mv /tmp/xorg.conf /etc/X11/xorg.conf

or in an Extensions section:

Section "Extensions"
    Option      "DPMS" "Disable"
EndSection

Cannot login due websocket handshake issue

Error:

Loginform.js:388 WebSocket connection to 'wss://<ip>:<port>/auth' failed: Error during WebSocket handshake: Unexpected response code: 404
show @ loginform.js:388

Parallelcluster configures NICE DCV to authenticate with a key file. If your goal is to enter a username and password to login to NICE DCV, you can accomplish this by editing /etc/dcv/dcv.conf by:

  1. Commenting out ‘auth-token-verifier=”https://localhost:8444″‘
  2. Commenting out ‘file=”/etc/parallelcluster/ext-auth-certificate.pem”‘
  3. Uncommenting ‘authentication = none’, and changing authentication to ‘system’
  4. Restarting DCV server

Troubleshooting GNOME startup issues

Here is a check list to help you to identify GNOME startup issues.

  1. Check if the package “xorg-x11-xinit-session” is installed.
  2. Check the file /etc/dcv/dcvsessioninit if it is looking for /etc/gdm/Xsession file. If not, then add before the respective line “fi” :
    elif [ -x /etc/gdm/Xsession ]; then
    SESSIONBIN="/etc/gdm/Xsession gnome-session"

SELinux – Troubleshooting

Default SELinux policies in RHEL8 can lead to failures in XDM processes using NVIDIA drivers and libraries, thus the gnome-shell and the DCV system ageny (being children of GDM) can be impacted.

Related info:

If you think that you are facing SELinux problems, you can check some logs that can have more details:

  • /var/log/audit/audit.log
  • /var/log/avc.log
  • /var/log/audit.log 

Some commands that can show SELinux denials:

  • dmesg | grep -i -e type=1300 -e type=1400
  • journalctl -t setroubleshoot

If you still do not see any messages, maybe they are being supressed by audit “dontaudit” daemon rules. Try to disable those rule with the command semodule -DB, then try to execute the pattern that may is being affected by SELinux and check the audit messages. To enable those rules again, execute semodule -B command.

You can try disable SELinux service with the command setenforce 0 and test. You can enable again doing setenforce 1. You can check the current status with getenforce command. Note: setenforce will not make this config permanent after reboot. Check the file /etc/selinux/config for that.

Some good references about SELinux troubleshooting:

Spinning wheel after login

Sometimes, after a successful log in, DCV gets stuck on a spinning wheel (like in the above image). The reason is that the client is not receiving any frames (pixels not reaching the client) on the display channel. It is not a problem on authentication on permissions, as otherwise the client would directly return an error without showing the spinning wheel.  

The reasons can be different. Here some possible issues and solutions:

  1. The X server is not running. If on Linux, check if the X server is up and running and if the user has access to the Desktop. This is especially needed in the case of a console session. In particular you need to check that the system is running in graphical mode. This is frequently overlooked on EC2 instances since many AMIs do not automatically start the X server after a reboot. To enable X at boot use the following command (and reboot):
    sudo systemctl set-default graphical.target
  2. The display channel is not enabled for the user or session. This can be verified by looking into server.log. The solution is easy: enable the user or the session to use the display feature.
    Please check https://docs.aws.amazon.com/dcv/latest/adminguide/managing-sessions.html
  3. Issues with NVIDIA drivers. In case you are running DCV 2017.0 on Windows or Linux with an NVIDIA card, some versions of the NVIDIA drivers have problems with the NvIFR library. Starting from DCV 2017.1 we have changed this to default to our NvENC encoder which is not affected by the NvIFR problems. If you run into this problem, please update to DCV 2017.1 and also update the NVIDIA driver to a version greater or equal to 390.x. In case you are not able to update to the latest DCV version, an alternative solution is to change the configuration to not use NvIFR by changing the section display.
    See this guide in order to understand how to configure it: https://docs.aws.amazon.com/dcv/latest/adminguide/config-param-ref.html
    and on display section add this setting:
    display-encoders = ['nvenc', 'turbojpeg', 'lz4']

Headless environment

Black screen or weird resolutions

This config allows you to start desktop environment and enable GPU acceleration without monitor connected. No dummy displays plugs required.

Set, under "Screen" section of your xorg.conf file, these parameters:

Option "AllowEmptyInitialConfiguration" "True"
Option "NoPowerConnectorCheck" "True"
Option "ModeValidation" "NoDFPNativeResolutionCheck,NoVirtualSizeCheck,NoEdidMaxPClkCheck,NoMaxPClkCheck,NoHorizSyncCheck,NoVertRefreshCheck,NoWidthAlignmentCheck,AllowNonEdidModes"

Explaining ModeValidation config:

  • NoDFPNativeResolutionCheck: This disables checking if the configured display resolution matches the monitor’s native resolution reported by DFP (Digital Flat Panel). This allows using non-native resolutions on your monitor.
  • NoVirtualSizeCheck: This bypasses the check for virtual screen sizes reported by the monitor. Virtual screens are a way to extend the desktop beyond the physical limitations of a single display. Disabling this check might lead to unexpected behavior with virtual screen setups.
  • NoEdidMaxPClkCheck: This skips the check for the maximum pixel clock (PClk) supported by the monitor as reported by EDID (Extended Display Identification Data). PClk determines the number of pixels the monitor can refresh per second. Disabling this check might result in setting a resolution that the monitor cannot handle.
  • NoMaxPClkCheck: Similar to the previous term, this disables checking the graphics card’s maximum PClk. Setting a resolution that exceeds the graphics card’s PClk capabilities could lead to display issues.
  • NoHorizSyncCheck and NoVertRefreshCheck: These terms disable checking the horizontal and vertical synchronization ranges reported by the monitor. Monitors can only display images within a specific horizontal and vertical refresh rate range. Disabling these checks might cause screen tearing or flickering.
  • NoWidthAlignmentCheck: This bypasses the check for monitor limitations regarding the alignment of the horizontal resolution with the monitor’s internal clock. Disabling this check could lead to blurry or distorted images on some monitors.
  • AllowNonEdidModes: This allows using video modes that are not reported by the EDID information. EDID provides detailed information about the monitor’s capabilities. Enabling this allows using custom resolutions or refresh rates that might not be officially supported by the monitor.

How to troubleshoot sessions not starting

The checklist

  1. For old OS distros, check the file $USER/.xsession-errors for sessions errors related with Xorg service
  2. For newer OS distros, check the journalctl command to check possible session problems.
    To check realtime log: journalctl -f
    To check the entire log: journalctl
    To check specific user log: journalctl _UID=1000
    To get an user id: id --user myuser
  3. Check the system nessage logs:
    dmesg -H
  4. Sometimes the error is due user configuration. Do backups and remove the below user directories and check what happen.
    • .gnome
    • .kde
    • .config
    • and any other directories related with your session manager
  5. Setup a fresh user and check if the problem persists
  6. Check if there are wrong permissions with the directries listed below that prevents the user to write in those directories.
    • .dbus
    • .Xauthority
  7. Create and test a failsafe session (check the next topic)

Create and test a failsafe session

One simple test to try to find session problems is create a dummy session and check if it will work as expected.

Create a file named init.sh, set execution permission doing chmod a+x init.sh and fill with this content:

#!/bin/sh
metacity & xterm

Then create the dummy session with the command:

dcv create-session dummy --init init.sh

You can also create a dummy session using another user:

sudo dcv create-session test --user USER --owner USER --init init.sh

For AWS AMI, you can use this command:

dcv create-session --storage-root %home% --init /home/centos/initscript.sh session

Then you can execute a test any application like dcvgltest or glxgears to verify if the application is working any other service or library, like OpenGL.

Note: check if metacity and xterm packages are installed