.\" .\" ********************************************************************* .\" .\" * * .\" .\" * Copyright 2015-2018, Intel Corporation * .\" .\" * * .\" .\" * All Rights Reserved. * .\" .\" * * .\" .\" ********************************************************************* .TH opafindgood 8 "Intel Corporation" "Copyright(C) 2015\-2018" "Master map: IFSFFCLIRG (Man Page)" .SH NAME opafindgood .NL .PP Checks for hosts that are able to be pinged, accessed via SSH, and active on the Intel(R) Omni-Path Fabric. Produces a list of good hosts meeting all criteria. Typically used to identify good hosts to undergo further testing and benchmarking during initial cluster staging and startup. .PP The resulting good file lists each good host exactly once and can be used as input to create \fImpi\(ulhosts\fR files for running mpi\(ulapps and the HFI-SW cable test. The files alive, running, active, good, and bad are created in the selected directory listing hosts passing each criteria. .PP This command assumes the Node Description for each host is based on the \fIhostname\fR-s output in conjunction with an optional hfi1\(ul# suffix. When using a /etc/opa/hosts file that lists the hostnames, this assumption may not be correct. .PP This command automatically generates the file FF\(ulRESULT\(ulDIR/punchlist.csv. This file provides a concise summary of the bad hosts found. This can be imported into Excel directly as a *.csv file. Alternatively, it can be cut/pasted into Excel, and the \fBData/Text to Columns\fR toolbar can be used to separate the information into multiple columns at the semicolons. .PP A sample generated output is: .PP .br # opafindgood .br 3 hosts will be checked .br 2 hosts are pingable (alive) .br 2 hosts are ssh\[aq]able (running) .br 2 total hosts have FIs active on one or more fabrics (active) .br No Quarantine Node Records Returned .br 1 hosts are alive, running, active (good) .br 2 hosts are bad (bad) .br Bad hosts have been added to /root/punchlist.csv .br # cat /root/punchlist.csv .br 2015/10/04 11:33:22;phs1fnivd13u07n1 hfi1\(ul0 p1 phs1swivd13u06 p16;Link errors .br 2015/10/07 10:21:05;phs1swivd13u06;Switch not found in SA DB .br 2015/10/09 14:36:48;phs1fnivd13u07n4;Doesn\[aq]t ping .br 2015/10/09 14:36:48;phs1fnivd13u07n3;No active port .br .PP For a given run, a line is generated for each failing host. Hosts are reported exactly once for a given run. Therefore, a host that does not ping is NOT listed as can\[aq]t ssh nor No active port. There may be cases where ports could be active for hosts that do not ping, especially if Ethernet host names are used for the ping test. However, the lack of ping often implies there are other fundamental issues, such as PXE boot or inability to access DNS or DHCP to get proper host name and IP address. Therefore, reporting hosts that do not ping is typically of limited value. .PP Note that opafindgood queries the SA for NodeDescriptions to determine hosts with active ports. As such, ports may be active for hosts that cannot be accessed via SSH or pinged. .PP By default, opafindgood checks for and reports nodes that are quarantined for security reasons. To skip this, use the -Q option. .SH Syntax .NL opafindgood [-R|-A|-Q] [-d \fIdir\fR] [-f \fIhostfile\fR] [-h \[aq]\fIhosts\fR\[aq]] .br [-t \fIportsfile\fR] [-p \fIports\fR] [-T \fItimelimit\fR] .SH Options .NL .TP 10 --help .NL Produces full help text. .TP 10 -R .NL Skips the running test (SSH). Recommended if password-less SSH is not set up. .TP 10 -A .NL Skips the active test. Recommended if Intel(R) Omni-Path Fabric software or fabric is not up. .TP 10 -Q .NL Skips the quarantine test. Recommended if Intel(R) Omni-Path Fabric software or fabric is not up. .TP 10 -d \fIdir\fR .NL Specifies the directory in which to create alive, active, running, good, and bad files. Default is /etc/opa directory. .TP 10 -f \fIhostfile\fR .NL Specifies the file with hosts in cluster. Default is /etc/opa/hosts directory. .TP 10 -h \fIhosts\fR .NL Specifies the list of hosts to ping. .TP 10 -t \fIportsfile\fR .NL Specifies the file with list of local HFI ports used to access fabric(s) for analysis. Default is /etc/opa/ports file. .TP 10 -p \fIports\fR .NL Specifies the list of local HFI ports used to access fabric(s) for analysis. .IP Default is first active port. The first HFI in the system is 1. The first port on an HFI is 1. Uses the format hfi:port, .br for example: .RS .TP 10 .sp 0:0 First active port in system. .RE .RS .TP 10 .sp 0:y Port \fIy\fR within system. .RE .RS .TP 10 .sp x:0 First active port on HFI \fIx\fR. .RE .RS .TP 10 .sp x:y HFI \fIx\fR, port \fIy\fR. .RE .TP 10 -T \fItimelimit\fR .NL Specifies the time limit in seconds for host to respond to SSH. Default is 20 seconds. .SH Environment Variables .NL .PP The following environment variables are also used by this command: .TP 10 \fBHOSTS\fR .NL List of hosts, used if -h option not supplied. .TP 10 \fBHOSTS\(ulFILE\fR .NL File containing list of hosts, used in absence of -f and -h. .TP 10 \fBPORTS\fR .NL List of ports, used in absence of -t and -p. .TP 10 \fBPORTS\(ulFILE\fR .NL File containing list of ports, used in absence of -t and -p. .TP 10 \fBFF\(ulMAX\(ulPARALLEL\fR .NL Maximum concurrent operations. .SH Examples .NL opafindgood .br opafindgood -f allhosts .br opafindgood -h \[aq]arwen elrond\[aq] .br HOSTS=\[aq]arwen elrond\[aq] opafindgood .br HOSTS\(ulFILE=allhosts opafindgood .br opafindgood -p \[aq]1:1 1:2 2:1 2:2\[aq]