Martin Pool's blog

TCP/SSH transport independence

I have been thinking for a long time about a new program (working title of Superlifter) that would take some of the best ideas from rsync and leave behind the historical baggage.

There are already some old and sketchy notes about it here.

rsync follows an interesting pattern in open source projects of being quasi-transport-indepdendant by running over a plain TCP socket, or some other two-way stream provided by something like OpenSSH.

(The structure of the discussion here is based on the Pattern Form.)

forces

  1. We want to optionally support as-fast-as-possible (wire speed?) operation with no overhead for an anonymous, typically read-only, configuration.
  2. We also want the option of strong cryptographic protection on authentication, encryption, and integrity.
  3. Some people will not want to increase the attack surface of their machine by adding new daemons listening for connections. In particular Subversion got a lot of resistance from administrators who didn't want to run apache2 mod_svn, regardless of how well written it is. Another listening process is another body of code that might possibly have bugs, that might need to be upgraded, that needs to be allowed for in firewall and intrusion-detection rules, that needs to be configured, ...
  4. People developing an application don't want to worry too much about security, for similar reasons: having any security functionality potentially means needing to rush out upgrades and certainly to worry a lot more about the code.
  5. Authentication should ideally be done over standard mechanisms. For most Linux machines, SSH is pretty standard. SSH in turn has some plugins for Kerberos, certificates, etc.

solution

Write the application so that it only relies on a bidirectional byte stream, and then put in little mechanisms so that it can connect to the server either by opening a socket directly, or by invoking the server over ssh. (From the server's point of view this looks a lot like being started by inetd.)

known uses

CVS is the earliest program that I know of to follow this pattern. It is very common to have developers access the repository over SSH, and the rest of the world access a read-only mirror using pserver. However, some security-conscious projects (OpenBSD?) allow anonymous access over SSH so that public data is not modified in transit.

rsync also uses this pattern. Both modes are popular. In particular, TCP connections are often used for anonymous mirroring. One problem with the way this is used in rsync is that the two modes behave very differently: anonymous mode uses a "modules" virtual filesystem, allows access control, etc.

distcc borrowed this idea from rsync.

Subversion has recently added this mode of operation through the svn: URL protocol, and I think this will greatly ease adoption. Previously authenticated access required Apache2 with a special mod_svn module, which encountered some resistance from administrators.

consequences

(If it it sounds like there are a lot of negative consequences then that is mostly because familiarity with this pattern has made me look more critically. It's really quite good.)

When run in SSH mode, the application is very secure without needing to include any security or crypto code of its own. As soon as the application starts running, it knows that the user has passed the system's requirements to open an SSH connection. This is not to say that the user may not still try to get up to mischief, but the program is in the second line, not the front line.

In TCP mode connections are about as fast and simple as you can get. Very standard Unix, Stevens and all that.

Server processes run under the persona of the user who started them. They should not interfere with each other; if they want to access other system resources they do so subject to normal security mechanisms.

Ordinary users can install the server and use it over their SSH connection without needing any administrative setup, assuming they are able to install programs in their own directory. They are not exposing the system to attacks in the way that they would be by listening on a port.

sshd can be configured to allow particular users to only run particular commands.

Although the code to open SSH connections is a little complex, it can be adapted (subject to licensing restrictions) from other programs. If the code can't be used, then it is straightforward to rewrite.

The default remote username is the same as the local user, which is often reasonable. It can be easily overridden.

SSH is happiest with a separate OS-level account for all users, although you could probably set it up to allow different people to use different keys to get into the same account with different limitations.

A lot of the documentation of the program has to say "see the SSH manual for details", or just replicate it. This makes the documentation easier to write, but perhaps not easier to understand.

Normally, allowing somebody to SSH in to run the server allows them to run arbitrary commands too, though it can be tied down.

It is harder, though not impossible, to put limitations on what people connected by SSH can do, because they run an arbitrary command. This is a little different to typical FTP or HTTP servers, where root can impose limits on authenticated and anonymous users alike.

There is an asymmetry between the server directly listening on a port in TCP mode, and the server being started by sshd in SSH mode. It might be nice if instead of opening the socket ourselves, there was something like in.rshd that would allow anonymous connections to run one particular program.

The application is limited to very traditional client-server TCP applications. UDP, peers connecting back to clients, and multicast can't happen (or at least can't be secured.)

The server has no control over when it is invoked, at least in SSH mode. It can't do things like preforking or server pools that are relatively straightforward when it accepts its own connections. Apache is the most famous example of this but distcc does it to from version 2.5 and apparently gains a small benefit, and also a simple way to restrict the number of incoming connections.

SSH is slightly heavy, particularly in the latency for opening connections, compared to possible other protocols. The details will depend on the application but for example distcc is about 25% slower when run over SSH in a typical environment.

Secured connections tie together several issues that might usefully be separated: authentication, assignment to a server-side OS persona, integrity of the connection, confidentiality of the connection, etc.

implementation

For a reasonably readable implementation, see the distcc source. Some particular points to note:

  1. The distccd server can run either standalone, or (equivalently) from inetd or sshd. This is configured by command line parameters. For friendliness the daemon tries to guess if neither --daemon or --inetd is specified. It looks like the most reliable thing is to see whether stdin is a socket or a tty. But there are plausible situations where the server just cannot tell: one might reasonably do ssh root@fatso distccd and want it to start a daemon.

  • The client needs to fork a child process that will run ssh with a command line ssh $host distccd --inetd.

  • The "near end" of ssh needs to be connected to the client program. The best way to do this on Unix is with a socketpair, which creates an anonymous unix-domain socket. This is not available on all systems, so you may have to fall back to using a pair of pipe()s. These work fine, except that they are unidirection and therefore all the client code needs to be able to cope with having separate input and output fds. (Or you could just not support systems that don't have socketpair(), but using separate fds is not so hard if you start from the beginning.)
  • I have no idea how you'd do this on Windows. I would start looking at Putty and see what that supports, if anything.
  • It's obviously useful for the user to be able to change the name of the command used to open the connection.
  • Because ~/.ssh/config can contain sections switched by destination host name it is less necessary to have a way for the user to append arbitrary arguments to the command line.
  • It can be faster/more reliable to execvp() ssh directly rather than through a shell. This goes against the user being able to specify anything other than just the first word of the command. (CVS and distcc have this behaviour; rsync does not.)
  • The way $PATH gets set for commands invoked over ssh can be quite confusingly different to what users see interactively for many reasons. distccd is installed by default into /usr/local/bin rather than sbin to make it more likely that it can be found on the default path. This has persistently been a problem for people with rsync installed in say /opt/freeware/rsync/bin/rsync. There must be a way for the full path to the remote server to be specified on the client to fix this kind of problem.
  • variations

    1. Use the SSH "subsystem" facility. I don't know much about it.

  • Leave out entirely the ability for the server to listen on sockets of its own accord, and depend on something like inetd or netcat to run it and possibly to do access control. This means the server sees almost no difference between TCP and SSH connections.
  • In a first cut of the client, use netcat or socket to open connections.
  • rsync offers a weak challenge-response authentication protocol for use over TCP. This can be useful on mostly-secure networks. It does not send cleartext passwords, but file data is sent clear, and it would be vulnerable to a man-in-the-middle attack.
  • alternative solutions

    Sun wanted GSSAPI to do something like this: NFSv4 connections can either be weakly or strongly authenticated. But it does not seem to have become very generally popular on Linux, and therefore does not really have the "no overhead" benefit of SSH.

    On a platform with a well-established secure RPC like Windows it might make sense to use it. But that is possibly less efficient than plain TCP, and it certainly constrains the application.

    Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May