'\" t .\" Man page generated from reStructuredText. . . .nr rst2man-indent-level 0 . .de1 rstReportMargin \\$1 \\n[an-margin] level \\n[rst2man-indent-level] level margin: \\n[rst2man-indent\\n[rst2man-indent-level]] - \\n[rst2man-indent0] \\n[rst2man-indent1] \\n[rst2man-indent2] .. .de1 INDENT .\" .rstReportMargin pre: . RS \\$1 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin] . nr rst2man-indent-level +1 .\" .rstReportMargin post: .. .de UNINDENT . RE .\" indent \\n[an-margin] .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] .nr rst2man-indent-level -1 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] .in \\n[rst2man-indent\\n[rst2man-indent-level]]u .. .TH "CONDOR_GPU_DISCOVERY" "1" "Apr 01, 2024" "" "HTCondor Manual" .SH NAME condor_gpu_discovery \- HTCondor Manual .sp Output GPU\-related ClassAd attributes .SH SYNOPSIS .sp \fBcondor_gpu_discovery\fP \fB\-help\fP .sp \fBcondor_gpu_discovery\fP [\fB\fP ] .SH DESCRIPTION .sp \fIcondor_gpu_discovery\fP outputs ClassAd attributes corresponding to a host\(aqs GPU capabilities. It can presently report CUDA and OpenCL devices; which type(s) of device(s) it reports is determined by which libraries, if any, it can find when it runs; this reflects what GPU jobs will find on that host when they run. (Note that some HTCondor configuration settings may cause the environment to differ between jobs and the HTCondor daemons in ways that change library discovery.) .sp If \fBCUDA_VISIBLE_DEVICES\fP or \fBGPU_DEVICE_ORDINAL\fP is set in the environment when \fIcondor_gpu_discovery\fP is run, it will report only devices present in the those lists. .sp This tool is not available for MAC OS platforms. .sp With no command line options, the single ClassAd attribute \fBDetectedGPUs\fP is printed. If the value is 0, no GPUs were detected. If one or more GPUS were detected, the value is a string, presented as a comma and space separated list of the GPUs discovered, where each is given a name further used as the \fIprefix string\fP in other attribute names. Where there is more than one GPU of a particular type, the \fIprefix string\fP includes an GPU id value identifying the device; these can be integer values that monotonically increase from 0 when the \fB\-by\-index\fP option is used or globally unique identifiers when the \fB\-short\-uuid\fP or \fB\-uuid\fP argument is used. .sp For example, a discovery of two GPUs with \fB\-by\-index\fP may output .INDENT 0.0 .INDENT 3.5 .sp .EX DetectedGPUs=\(dqCUDA0, CUDA1\(dq .EE .UNINDENT .UNINDENT .sp Further command line options use \fB\(dqCUDA\(dq\fP either with or without one of the integer values 0 or 1 as the name of the device properties ad for \fB\-nested\fP properties, or as the \fIprefix string\fP in attribute names when \fB\-not\-nested\fP properties are chosen. .sp For machines with more than one or two NVIDIA devices, it is recommended that you also use the \fB\-short\-uuid\fP or \fB\-uuid\fP option. The uuid value assigned by NVIDA to each GPU is unique, so using this option provides stable device identifiers for your devices. The \fB\-short\-uuid\fP option uses only part of the uuid, but it is highly likely to still be unique for devices on a single machine. As of HTCondor 9.0 \fB\-short\-uuid\fP is the default. When \fB\-short\-uuid\fP is used, discovery of two GPUs may look like this .INDENT 0.0 .INDENT 3.5 .sp .EX DetectedGPUs=\(dqGPU\-ddc1c098, GPU\-9dc7c6d6\(dq .EE .UNINDENT .UNINDENT .sp Any NVIDIA runtime library later than 9.0 will accept the above identifiers in the \fBCUDA_VISIBLE_DEVICES\fP environment variable. .sp If the NVML libary is available, and a multi\-instance GPU (MIG) \-capable device is present, has MIG enabled, and has created compute instances for each MIG instance, \fIcondor_gpu_discovery\fP will report those instance as distinct devices. Their names will be in the long UUID form unless the \fB\-short\-uuid\fP option is used, because they can not be enumerated via CUDA. MIG instances don\(aqt have some of the properties reported by the \fB\-properties\fP, \fB\-extra\fP, and \fB\-dynamic\fP options; these properties will be omitted. If MIG is enabled on any GPU in the system, some properties become unavailable for every GPU in the system; \fIcondor_gpu_discovery\fP will report what it can. .SH OPTIONS .INDENT 0.0 .INDENT 3.5 .INDENT 0.0 .TP \fB\-help\fP Print usage information and exit. .TP \fB\-properties\fP In addition to the \fBDetectedGPUs\fP attribute, display some of the attributes of the GPUs. Each of these attributes will be in a nested ClassAd (\fB\-nested\fP) or have a \fIprefix string\fP at the beginning of its name (\fB\-not\-nested\fP). The displayed CUDA attributes are \fBCapability\fP, \fBDeviceName\fP, \fBDriverVersion\fP, \fBECCEnabled\fP, \fBGlobalMemoryMb\fP, and \fBRuntimeVersion\fP\&. The displayed Open CL attributes are \fBDeviceName\fP, \fBECCEnabled\fP, \fBOpenCLVersion\fP, and \fBGlobalMemoryMb\fP\&. .TP \fB\-nested\fP .INDENT 7.0 .TP .B Default. Display properties that are common to all GPUs in a \fBCommon\fP nested ClassAd, and properties that are not common to all in a nested ClassAd using the GPUid as the ClassAd name. Use the \fB\-not\-nested\fP argument to disable nested ClassAds and return to the older behavior of using a \fIprefix string\fP for individual property attributes. .UNINDENT .TP \fB\-not\-nested\fP .INDENT 7.0 .TP .B Display properties that are common to all GPUs using a \fBCUDA\fP or \fBOCL\fP as the attribute prefix, and properties that are not common to all using a GPUid prefix. Versions of \fIcondor_gpu_discovery\fP prior to 9.11.0 support only this mode. .UNINDENT .TP \fB\-extra\fP Display more attributes of the GPUs. Each of these attributes will be added to a nested property ClassAd (\fB\-nested\fP) or have a \fIprefix string\fP at the beginning of its name (\fB\-not\-nested\fP). The additional CUDA attributes are \fBClockMhz\fP, \fBComputeUnits\fP, and \fBCoresPerCU\fP\&. The additional Open CL attributes are \fBClockMhz\fP and \fBComputeUnits\fP\&. .TP \fB\-dynamic\fP Display attributes of NVIDIA devices that change values as the GPU is working. Each of these attributes will be added to the the nested property ClassAd (\fB\-nested\fP) or have a \fIprefix string\fP at the beginning of its name (\fB\-not\-nested\fP). These are \fBFanSpeedPct\fP, \fBBoardTempC\fP, \fBDieTempC\fP, \fBEccErrorsSingleBit\fP, and \fBEccErrorsDoubleBit\fP\&. .TP \fB\-mixed\fP When displaying attribute values, assume that the machine has a heterogeneous set of GPUs, so always include the integer value in the \fIprefix string\fP\&. .TP \fB\-device\fP \fI\fP Display properties only for GPU device \fI\fP, where \fI\fP is the integer value defined for the \fIprefix string\fP\&. This option may be specified more than once; additional \fI\fP are listed along with the first. This option adds to the devices(s) specified by the environment variables \fBCUDA_VISIBLE_DEVICES\fP and \fBGPU_DEVICE_ORDINAL\fP, if any. .TP \fB\-tag\fP \fIstring\fP Set the resource tag portion of the intended machine ClassAd attribute \fBDetected\fP to be \fIstring\fP\&. If this option is not specified, the resource tag is \fB\(dqGPUs\(dq\fP, resulting in attribute name \fBDetectedGPUs\fP\&. .TP \fB\-prefix\fP \fIstr\fP When naming \fB\-not\-nested\fP attributes, use \fIstr\fP as the \fIprefix string\fP\&. When this option is not specified, the \fIprefix string\fP is either \fBCUDA\fP or \fBOCL\fP unless \fB\-uuid\fP or \fB\-short\-uuid\fP is also used. .TP \fB\-by\-index\fP Use the prefix and device index as the device identifier. .TP \fB\-short\-uuid\fP Use the first 8 characters of the NVIDIA uuid as the device identifier. When this option is used, devices will be shown as \fBGPU\-\fP where is the first 8 hex digits of the NVIDIA device uuid. Unlike device indices, the uuid of a device will not change of other devices are taken offline or drained. .TP \fB\-uuid\fP Use the full NVIDIA uuid as the device identifier rather than the device index. .TP \fB\-simulate:[D,N[,D2,...]]\fP For testing purposes, assume that N devices of type D were detected, And N2 devices of type D2, etc. No discovery software is invoked. D can be a value from 0 to 6 which selects a simulated a GPU from the following table. .SH SIMULATED GPUS .TS center; |l|l|l|l|. _ T{ T} T{ DeviceName T} T{ Capability T} T{ GlobalMemoryMB T} _ T{ 0 T} T{ GeForce GT 330 T} T{ 1.2 T} T{ 1024 T} _ T{ 1 T} T{ GeForce GTX 480 T} T{ 2.0 T} T{ 1536 T} _ T{ 2 T} T{ Tesla V100\-PCIE\-16GB T} T{ 7.0 T} T{ 24220 T} _ T{ 3 T} T{ TITAN RTX T} T{ 7.5 T} T{ 24220 T} _ T{ 4 T} T{ A100\-SXM4\-40GB T} T{ 8.0 T} T{ 40536 T} _ T{ 5 T} T{ NVIDIA A100\-SXM4\-40GB MIG 3g.20gb T} T{ 8.0 T} T{ 20096 T} _ T{ 6 T} T{ NVIDIA A100\-SXM4\-40GB MIG 1g.5gb T} T{ 8.0 T} T{ 4864 T} _ .TE .TP \fB\-opencl\fP Prefer detection via OpenCL rather than CUDA. Without this option, CUDA detection software is invoked first, and no further Open CL software is invoked if CUDA devices are detected. .TP \fB\-cuda\fP Do only CUDA detection. .TP \fB\-nvcuda\fP For Windows platforms only, use a CUDA driver rather than the CUDA run time. .TP \fB\-config\fP Output in the syntax of HTCondor configuration, instead of ClassAd language. An additional attribute is produced \fBNUM_DETECTED_GPUs\fP which is set to the number of GPUs detected. .TP \fB\-repeat\fP [\fIN\fP] Repeat listed GPUs \fIN\fP (default 2) times. This results in a list that looks like \fBCUDA0, CUDA1, CUDA0, CUDA1\fP\&. .sp If used with \fB\-divide\fP, the last one on the command\-line wins, but you must specify \fB2\fP if you want it; the default value only applies to the first flag. .TP \fB\-divide\fP [\fIN\fP] Like \fB\-repeat\fP, except also divide the attribute \fBGlobalMemoryMb\fP by \fIN\fP\&. This may help you avoid overcommitting your GPU\(aqs memory. .sp If used with \fB\-repeat\fP, the last one on the command\-line wins, but you must specify \fB2\fP if you want it; the default value only applies to the first flag. .TP \fB\-packed\fP When repeating GPUs, repeat each GPU \fIN\fP times, not the whole list. This results in a list that looks like \fBCUDA0, CUDA0, CUDA1, CUDA1\fP\&. .TP \fB\-cron\fP This option suppresses the \fBDetectedGpus\fP attribute so that the output is suitable for use with \fIcondor_startd\fP cron. Combine this option with the \fB\-dynamic\fP option to periodically refresh the dynamic Gpu information such as temperature. For example, to refresh GPU temperatures every 5 minutes .INDENT 7.0 .INDENT 3.5 .sp .EX use FEATURE : StartdCronPeriodic(DYNGPUS, 5*60, $(LIBEXEC)/condor_gpu_discovery, \-dynamic \-cron) .EE .UNINDENT .UNINDENT .TP \fB\-verbose\fP For interactive use of the tool, output extra information to show detection while in progress. .TP \fB\-diagnostic\fP Show diagnostic information, to aid in tool development. .UNINDENT .UNINDENT .UNINDENT .SH EXIT STATUS .sp \fIcondor_gpu_discovery\fP will exit with a status value of 0 (zero) upon success, and it will exit with the value 1 (one) upon failure. .SH AUTHOR HTCondor Team .SH COPYRIGHT 1990-2024, Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI, US. Licensed under the Apache License, Version 2.0. .\" Generated by docutils manpage writer. .