- nVidia driver not working anymore
- Cannot create directory ‘/run/user/15798’: Permission denied
- “Authentication is required to refresh the system repositories”
- ANSYS Workbench issues
- Deny the usage of nvenc (nvidia framebuffer capture)
- GPU frame rate drops to 1 fps after some minutes
- Cannot login due websocket handshake issue
- Troubleshooting GNOME startup issues
- SELinux – Troubleshooting
- Spinning wheel after login
- Headless environment
- Black screen or weird resolutions
- How to troubleshoot sessions not starting
- The checklist
- Create and test a failsafe session
nVidia driver not working anymore
In case an update of the kernel is performed e.g. via the yum update
command, then the nVidia driver needs to be reinstalled. The related commands typically look similar like this (please remember to use dcvgldiag
to check the DCV installation):
# reinstall the nVidia driver
sudo sh NVIDIA-Linux-x86_64-430.26.run
sudo nvidia-xconfig --preserve-busid --enable-all-gpus
# Add line ‘Option "UseDisplayDevice" "None" ’ into Screen section
sudo vim /etc/X11/xorg.conf
# ensure that X server is running; you might get logged out by these commands
sudo systemctl isolate multi-user.target
sudo systemctl isolate graphical.target
# enable DCV for 3D
sudo dcvgladmin enable
# verify the installation with dcvgldiag
dcvgldiag ### typical out like below
dcvgldiag
is a very convenient tool to check the DCV installation on Linux which is part of the NICE DCV download:
root@host# dcvgldiag
NICE DCV - Diagnostic Script
Host: ip-172-31-19-22.eu-west-1.compute.internal
Architecture: x86_64
Operating System: Red Hat Enterprise Linux Server release 7.7 (Maipo)
Kernel Version: 3.10.0-1062.1.2.el7.x86_64
Nvidia GPU: GRID K520
Nvidia Driver: 430.26
Runlevel: 5
X configuration file: /etc/X11/xorg.conf
DCV GL is enabled for 64 bit applications.
Running tests: ………………. DONE
No problem found.
A detailed report about the tests is available in '/root/dcvgldiag-qc1nmo'
And check if the nVidia driver is working fine again:
# check if the driver is working properly
root@host# nvidia-smi
Mon Jun 29 12:46:29 2020
+---------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|------------------------------+---------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GRID K520 Off | 00000000:00:03.0 Off | N/A |
| N/A 46C P0 44W / 125W | 357MiB / 4037MiB | 0% Default |
+--------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| 0 2500 G /usr/bin/X 62MiB |
| 0 2778 G /usr/bin/gnome-shell 4MiB |
| 0 3452 C /usr/libexec/dcv/dcvagent 60MiB |
| 0 4004 G /usr/bin/gnome-shell 59MiB |
| 0 7712 G ...ownloads/lsprepost4.6_centos7/lsprepost 162MiB |
Cannot create directory ‘/run/user/15798’: Permission denied
Please check if the following environment variable is defined on the system:
echo $XDG_RUNTIME_DIR
In case XDG_RUNTIME_DIR is set but no folder /run/user/<UID> is present, please try to remove that environment variable and try again to create the session. You can do:
unset XDG_RUNTIME_DIR
The XDG_RUNTIME_DIR is managed by PAM. You can get more details in https://manpages.ubuntu.com/manpages/xenial/en/man8/pam_systemd.8.html
“Authentication is required to refresh the system repositories”
When a new DCV virtual session is created on RHEL 8, a “Authentication is required to refresh the system repositories” authentication message is prompt.
It can be disabled creating a new polkit rule:
- create a new file as
/etc/polkit-1/rules.d/99-system-sources-refresh.rules
- add the following lines to the new file:
polkit.addRule(function(action, subject) {
if (action.id == "org.freedesktop.packagekit.system-sources-refresh") {
return polkit.Result.YES;
}
});
ANSYS Workbench issues
Depending on the configuration there is an issue on NVIDIA GPU and power management that could limit the GPUs frame rate (not related to DCV or DCV-GL). The issue can be solved by enabling the HardDPMS option of versions of the NVIDIA driver 415 and later. To enable the HardDPMS option add this row to the Device section of the /etc/X11/xorg.conf
file:
Section "Device"
...
Option "HardDPMS" "false"
...
EndSection
Deny the usage of nvenc (nvidia framebuffer capture)
Modify the /etc/dcv/dcv.conf
file and add the following in the display section:
[display]
display-encoders = ['ffmpeg', 'turbojpeg', 'lz4']
framebuffer-readers=['desktopduplication', 'gdi']
Disabling nvfbc as most GPUs has no support (nvfbc is marked as deprecated).
GPU frame rate drops to 1 fps after some minutes
On Linux instances with NVIDIA GPUs the Display Power Management (DPMS) can reduce the performance of the GPU and limit the GPU frame rate to 1 fps. The issue can be reproduced on physical hosts, with driver 415.xx and later.
The issue is not related to DCV. It can be reproduced without a running DCV server by e.g. running glxgears for some minutes in a SSH connection:
DISPLAY=:0 glxgears
...
300 frames in 5.0 seconds = 59.972 FPS
300 frames in 5.0 seconds = 59.972 FPS
300 frames in 5.0 seconds = 59.972 FPS
101 frames in 5.7 seconds = 17.767 FPS
5 frames in 5.0 seconds = 1.000 FPS
5 frames in 5.0 seconds = 1.000 FPS
...
The issue can be solved by disabling the HardDPMS
option of NVIDIA driver 415 and later.
To disable the HardDPMS
option add this row to the Device section of the /etc/X11/xorg.conf file:
Section "Device"
...
Option "HardDPMS" "false"
...
EndSection
Below are 3 bash script lines to add this option in /etc/X11/xorg.conf:
# Add in /etc/X11/xorg.conf: Option "HardDPMS" "false"
sudo cp /etc/X11/xorg.conf /etc/X11/xorg.conf.BK
sudo sed '/^Section "Device"/a \ \ \ \ Option "HardDPMS" "false"' /etc/X11/xorg.conf > /tmp/xorg.conf
sudo mv /tmp/xorg.conf /etc/X11/xorg.conf
or in an Extensions section:
Section "Extensions"
Option "DPMS" "Disable"
EndSection
Cannot login due websocket handshake issue
Error:
Loginform.js:388 WebSocket connection to 'wss://<ip>:<port>/auth' failed: Error during WebSocket handshake: Unexpected response code: 404
show @ loginform.js:388
Parallelcluster configures NICE DCV to authenticate with a key file. If your goal is to enter a username and password to login to NICE DCV, you can accomplish this by editing /etc/dcv/dcv.conf
by:
- Commenting out ‘auth-token-verifier=”https://localhost:8444″‘
- Commenting out ‘file=”/etc/parallelcluster/ext-auth-certificate.pem”‘
- Uncommenting ‘authentication = none’, and changing authentication to ‘system’
- Restarting DCV server
Troubleshooting GNOME startup issues
Here is a check list to help you to identify GNOME startup issues.
- Check if the package “xorg-x11-xinit-session” is installed.
- Check the file /etc/dcv/dcvsessioninit if it is looking for /etc/gdm/Xsession file. If not, then add before the respective line “fi” :
elif [ -x /etc/gdm/Xsession ]; then
SESSIONBIN="/etc/gdm/Xsession gnome-session"
SELinux – Troubleshooting
Default SELinux policies in RHEL8 can lead to failures in XDM processes using NVIDIA drivers and libraries, thus the gnome-shell and the DCV system ageny (being children of GDM) can be impacted.
Related info:
- XDM and SELinux: https://github.com/fedora-selinux/selinux-policy/pull/312
- NVEnc and DCV system agent: https://issues-dub.amazon.com/issues/DCV-2839
If you think that you are facing SELinux problems, you can check some logs that can have more details:
/var/log/audit/audit.log
/var/log/avc.log
/var/log/audit.log
Some commands that can show SELinux denials:
dmesg | grep -i -e type=1300 -e type=1400
journalctl -t setroubleshoot
If you still do not see any messages, maybe they are being supressed by audit “dontaudit” daemon rules. Try to disable those rule with the command semodule -DB
, then try to execute the pattern that may is being affected by SELinux and check the audit messages. To enable those rules again, execute semodule -B
command.
You can try disable SELinux service with the command setenforce 0
and test. You can enable again doing setenforce 1
. You can check the current status with getenforce
command. Note: setenforce will not make this config permanent after reboot. Check the file /etc/selinux/config
for that.
Some good references about SELinux troubleshooting:
- Find details and fix https://wiki.gentoo.org/wiki/SELinux/Tutorials/Where_to_find_SELinux_permission_denial_details
- Official RedHat guide https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/using_selinux/troubleshooting-problems-related-to-selinux_using-selinux
Spinning wheel after login
Sometimes, after a successful log in, DCV gets stuck on a spinning wheel (like in the above image). The reason is that the client is not receiving any frames (pixels not reaching the client) on the display channel. It is not a problem on authentication on permissions, as otherwise the client would directly return an error without showing the spinning wheel.
The reasons can be different. Here some possible issues and solutions:
- The X server is not running. If on Linux, check if the X server is up and running and if the user has access to the Desktop. This is especially needed in the case of a console session. In particular you need to check that the system is running in graphical mode. This is frequently overlooked on EC2 instances since many AMIs do not automatically start the X server after a reboot. To enable X at boot use the following command (and reboot):
sudo systemctl set-default graphical.target
- The display channel is not enabled for the user or session. This can be verified by looking into server.log. The solution is easy: enable the user or the session to use the display feature.
Please check https://docs.aws.amazon.com/dcv/latest/adminguide/managing-sessions.html - Issues with NVIDIA drivers. In case you are running DCV 2017.0 on Windows or Linux with an NVIDIA card, some versions of the NVIDIA drivers have problems with the NvIFR library. Starting from DCV 2017.1 we have changed this to default to our NvENC encoder which is not affected by the NvIFR problems. If you run into this problem, please update to DCV 2017.1 and also update the NVIDIA driver to a version greater or equal to 390.x. In case you are not able to update to the latest DCV version, an alternative solution is to change the configuration to not use NvIFR by changing the section display.
See this guide in order to understand how to configure it: https://docs.aws.amazon.com/dcv/latest/adminguide/config-param-ref.html
and on display section add this setting:display-encoders = ['nvenc', 'turbojpeg', 'lz4']
Headless environment
Black screen or weird resolutions
This config allows you to start desktop environment and enable GPU acceleration without monitor connected. No dummy displays plugs required.
Set, under "Screen"
section of your xorg.conf
file, these parameters:
Option "AllowEmptyInitialConfiguration" "True"
Option "NoPowerConnectorCheck" "True"
Option "ModeValidation" "NoDFPNativeResolutionCheck,NoVirtualSizeCheck,NoEdidMaxPClkCheck,NoMaxPClkCheck,NoHorizSyncCheck,NoVertRefreshCheck,NoWidthAlignmentCheck,AllowNonEdidModes"
Explaining ModeValidation
config:
- NoDFPNativeResolutionCheck: This disables checking if the configured display resolution matches the monitor’s native resolution reported by DFP (Digital Flat Panel). This allows using non-native resolutions on your monitor.
- NoVirtualSizeCheck: This bypasses the check for virtual screen sizes reported by the monitor. Virtual screens are a way to extend the desktop beyond the physical limitations of a single display. Disabling this check might lead to unexpected behavior with virtual screen setups.
- NoEdidMaxPClkCheck: This skips the check for the maximum pixel clock (PClk) supported by the monitor as reported by EDID (Extended Display Identification Data). PClk determines the number of pixels the monitor can refresh per second. Disabling this check might result in setting a resolution that the monitor cannot handle.
- NoMaxPClkCheck: Similar to the previous term, this disables checking the graphics card’s maximum PClk. Setting a resolution that exceeds the graphics card’s PClk capabilities could lead to display issues.
- NoHorizSyncCheck and NoVertRefreshCheck: These terms disable checking the horizontal and vertical synchronization ranges reported by the monitor. Monitors can only display images within a specific horizontal and vertical refresh rate range. Disabling these checks might cause screen tearing or flickering.
- NoWidthAlignmentCheck: This bypasses the check for monitor limitations regarding the alignment of the horizontal resolution with the monitor’s internal clock. Disabling this check could lead to blurry or distorted images on some monitors.
- AllowNonEdidModes: This allows using video modes that are not reported by the EDID information. EDID provides detailed information about the monitor’s capabilities. Enabling this allows using custom resolutions or refresh rates that might not be officially supported by the monitor.
How to troubleshoot sessions not starting
The checklist
- For old OS distros, check the file
$USER/.xsession-errors
for sessions errors related with Xorg service - For newer OS distros, check the journalctl command to check possible session problems.
To check realtime log:journalctl -f
To check the entire log:journalctl
To check specific user log:journalctl _UID=1000
To get an user id:id --user myuser
- Check the system nessage logs:
dmesg -H
- Sometimes the error is due user configuration. Do backups and remove the below user directories and check what happen.
- .gnome
- .kde
- .config
- and any other directories related with your session manager
- Setup a fresh user and check if the problem persists
- Check if there are wrong permissions with the directries listed below that prevents the user to write in those directories.
- .dbus
- .Xauthority
- Create and test a failsafe session (check the next topic)
Create and test a failsafe session
One simple test to try to find session problems is create a dummy session and check if it will work as expected.
Create a file named init.sh, set execution permission doing chmod a+x init.sh and fill with this content:
#!/bin/sh
metacity & xterm
Then create the dummy session with the command:
dcv create-session dummy --init init.sh
You can also create a dummy session using another user:
sudo dcv create-session test --user USER --owner USER --init init.sh
For AWS AMI, you can use this command:
dcv create-session --storage-root %home% --init /home/centos/initscript.sh session
Then you can execute a test any application like dcvgltest or glxgears to verify if the application is working any other service or library, like OpenGL.
Note: check if metacity
and xterm
packages are installed