colmux - multiplex communications to multiple systems running collectl from a single system
colmux [-command "collectl-switches... [-p filespec]]" [-address addr1[,addr2,...]|-addr filename] [-cols col1[,col2...]] | [-column num]
This utility gathers up data generated by collectl from multiple systems and multiplexes it into a single consolidated format. It runs in essentially 2 distinct modes, the first is known as real-time, because data is retrieved and displayed in real time. The second is playback mode because data is played back from existing collectl data files.
There are also 2 general formats for the data being displayed. The first is a multi-line display in which the data is displayed in the native form that collectl displays it, except it is sorted by a distint column, essentially allowing one to see the TOP producers of that data. The second format is a single line display in which one or more distinct data elements from each source is displayed on the same line. This latter format is never sorted, but rather positionally organized by the name of the system that generated it.
Collectl will be then be executed, using any optional switches specified by -command, on each of the systems specified by -address OR read those addresses from a file it the target of that switch is a filename rather than a list of hosts OR on the local system if -address is not specified. See collectl for details of the various switches. In some cases certain collectl switches will not make sense in a colmux environment and if chosen will generate an error. Further, if hosts are specified with -address, they should be a individual addresses or hostnames separated by commas. In turn, any of them can be in what those familiar with pdsh would recognize as -w format.
Colmux will then execute the collectl command, gather the results from all sources for a particular interval and display them one result per line, sorted by the specified column OR all on the same line in groups specified by -cols. The number of lines displayed is set to the size of the terminal window by default, but can be changed using -lines. The one exception is the use of -nosort which only applies to the playback of existing collectl raw files. In this mode all records for a particular interval will be displayed and the sorting bypassed, making this a speedy and convenient mechanism for gathering all data from all systems in one place for potential further processing.
Colmux will never modify the size of the terminal window so to see more or wider lines either expand the window or override the number of display lines and run it again. If the number display lines is set greater then the terminal height or 0, colmux will no longer overlay the previous window and simply run in a continuous scrolling mode.
In single line mode, the timestamp will either come from the host system in real-time mode OR the first host when run in playback mode. This is the most common use/need for this switch. But be careful in choosing column numbers with -cols as the position of the data shifts by 1 when time is included and by 2 if date and time are. Using -test will correctly show the shifted positions but only if you include -o with the command at the same time you use -test.
In real-time/top mode this switch is not allowed since colmux simply reports the current time of the system it is running on.
When playing back data multi-line formatted data from one or more files, a timestamp for each interval is reported, consisting of the time of that interval. When this switch is included, each line will be tagged with an appropriate timestamp since on rare occasions they may not necessarily all be identical.
Single-line format controls the number of lines displayed between headers. A value of 0 will only display the header one time.
Playback Mode Specific
The following additional switches only apply to playback mode. There are no real-time mode specific switches.
However, once colmux is running, one might want to look at subsequent lines, ie those below the bottom of the screen and therefore invisible. If the ReadKey module is installed, one can simply use the PageDown key to move down the display and the PageUp key to move in the other direction. If ReadKey is not installed, typing the multi-key sequences pd<ENTER> or pu<ENTER> will cause the same thing to happen.
You can also change the column number interactively with the RIGHT/LEFT arrow keys IF the ReadKey module is installed (see colmux -version) OR simply type it in followed by the <ENTER> key.
OR simply type the r key and <ENTER>.
Exception Reporting Specific
In single-line format, rather than wait for all hosts to report their data, colmux simply reports the last data seen when the time to generate a line of output has come. In most cases, these do reflect the most recent data values but in times of load, the data may be late getting to colmux and so a previous value may be reported. If the age of that data exceeds a defined number of intervals, the default is currently 2, an exception value will be reported of -1. At other times it has been seen where kernel/driver bugs may cause incorrect values to be reported as negative numbers and those values are also reported as -1. Both the age and exception values can be changed with the following switches.
The following switches are intended more for diagnostic purposes than normal operation, though are also worth using on appropriate occasions.
When a connection is received from an unexpected address, a warning is also reported and the request promptly ignored. This switch also suppresses those messages as well. For more information on problems connecting, see CONNECTION PROBLEMS.
There are 2 switches whose descriptions don't really fit anywhere else:
PLAYBACK MODE RESTRICTIONS¶
All logs being played back must have been collected using the same interval as colmux only looks at the first file/host to determine the appropriate value.
It is assumed all clocks are reasonably well synchronized as colmux uses time to determine which data is to be displayed as a set.
All files must be in the same directory on all systems and that directory must be included in the playback file specification
All files on a remote host must be for that host only
Run collectl on 3 nodes, showing CPU, Disk and Network statistics once a second and sorted by column 1, which happens to be total cpu.
colmux -addr abc,def,xyz
Dynamically display top processes on nodes n1-n10 of a cluster once a second, sorted by column 5.
colmux -addr n[1-10] -command "-sZ :1" -column 5
Do the same for yesterday, between the hours of 5AM and 6AM, being sure to stall for 1/2 second between intervals. Note, if you leave off -addr you could put all the logs into /var/log/collectl on the local host and play them back from there.
colmux -addr n[1-10] -command "-sZ -p/var/log/collectl/YESTERDAY -from 05:00-06:00" -column 5 -delay .5
Look at the amount of mapped and slab memory consumed on nodes n1-n10 and n15 in real-time, every 2 seconds using single-line format. Include totals and preface each line with the time. Since memory sizes tend to be rather large, divide each by 1024 so we see MB rather than KB. Note that the columns numbers are always displayed are ascending order regardless of their order in -cols. To be sure, first test the column numbers.
colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols
6,7 -coltot -colk -test
colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot -colk
Display most active disks, based on KB written, on nodes n1, n4 and n5.
colmux -addr n1,n4,n5 -command "-sD" -column 6
Here is a cool trick. Collectl currently lets you look at top processes with the --top switch and even choose a sort column by name. However, if you want to change the column you need to exit, then rerun collectl with a different sort column name. But if you run it like this example, you get the power of colmux to dynamically change the sort columns with the arrow keys! You can also use this technique to have collectl dynamically sort any local multi-line data such as slabs or even detail data like CPU, Disk, Lustre and Networks too! Naturally this technique works just as well with playing back data as well.
colmux -command "-sZ -i:1"
colmux requires passwordless ssh between the node it is running on those it is monitoring. also be sure the port you are using for communications, the default is 2655, if open
The way colmux works is to choose an address it wants to communicate over and starts up one or more remote copies of collectl, telling them to connect back to colmux using that address. The easiest way to see this, is to run colmux with -noesc, which tells it NOT to issue any escape sequences and therefore not to run in full screen mode. The addional switch of -debug 1 tells it to show the remote collectl startup command. When there is a communications problem you will typically see 'connection timed out' messages displayed.
There are actually a couple of possibilities here, one of which is a firewall is preventing connections and the easiest way to test this is run collectl on the local machine like this: collectl -Aserver. This tells collectl run as a server, listening for connections just like colmux. Then log into a remote machine and run /usr/share/collectl/util/client.pl addr-of-server which tells client.pl to open a socket to that copy of collectl. It should fail just like when it was run via colmux, so try opening the firewall and try it again. If it fixes the problem, it was indeed the firewall blocking things and colmux should now work just fine.
Sometimes there are multiple interfaces defined on the machine hosting colmux and in some cases only some addresses will allow socket connections. Again, using client.pl on the remote machine try connecting back to collectl over different addresses and when you find one that works, tell colmux to use that address for communication via the -retaddr switch.
This program was written by Mark Seger (email@example.com).
Copyright 2016 Hewlett-Packard Development Company, L.P.