\documentclass[10pt,letterpaper]{article}

\usepackage{vmargin}
\usepackage{multicol}
\usepackage{times}
\usepackage{epsfig}
\usepackage{float}
\usepackage{color}
\usepackage{calc}
\usepackage{alltt}
\usepackage{verbatim}
\usepackage{moreverb}
\usepackage{fancyheadings}
\usepackage{supertabular}
\usepackage{colortbl}
\usepackage{rotating}

\floatstyle{boxed}

\makeatletter

\let\hangafter\@hangfrom

\definecolor{dark}{rgb}{0.5,0.5,0.5}
\definecolor{medium}{rgb}{0.7,0.7,0.7}
\definecolor{light}{rgb}{0.95,0.95,0.95}
\newcommand{\grayrow}{\rowcolor[gray]{0.8}}

\setpapersize{USletter}
\setmarginsrb{1.0in}{1.0in}{1.0in}{0.5in}{0pt}{0pt}{0.5in}{0.5in}

%\renewcommand{\sfdefault}{phv}
\renewcommand{\section}{\@startsection%
    {section}{1}{\z@}%
    {-3.5ex \@plus -1ex \@minus -.2ex}%
    {2.3ex \@plus.2ex}%
    {\normalfont\Large\bfseries}}
\renewcommand{\subsection}{\@startsection%
    {subsection}{2}{\z@}%
    {-3.25ex\@plus -1ex \@minus -.2ex}%
    {1.5ex \@plus .2ex}%
    {\normalfont\large\bfseries}}
\renewcommand{\subsubsection}{\@startsection%
    {subsubsection}{3}{\z@}%
    {-3.25ex\@plus -1ex \@minus -.2ex}%
    {1.5ex \@plus .2ex}%
    {\normalfont\normalsize}}
\renewcommand{\paragraph}{\@startsection%
    {paragraph}{4}{\z@}%
    {3.25ex \@plus1ex \@minus.2ex}%
    {-1em}%
    {\normalfont\normalsize\bfseries}}
\renewcommand{\subparagraph}{\@startsection%
    {subparagraph}{5}{\parindent}%
    {3.25ex \@plus1ex \@minus .2ex}%
    {-1em}%
    {\normalfont\normalsize}}

\renewcommand{\labelitemi}{   \rule{6pt}{6pt} }
\renewcommand{\labelitemii}{  \rule{5pt}{5pt} }
\renewcommand{\labelitemiii}{ \rule{4pt}{4pt} }
\renewcommand{\labelitemiv}{  \rule{3pt}{3pt} }

\newcommand{\mypar}[1]{\medskip\noindent{\large\bfseries #1}\smallskip}

\makeatother

\pagestyle{fancy}
\lhead{}
\chead{}
\rhead{}
\headrulewidth=0pt
\lfoot{}
\cfoot{\thepage}
\rfoot{}

\begin{document}

\begin{center}
{\huge\bfseries Portable Multithreading}\\[1mm]
{\large\bfseries The Signal Stack Trick For User-Space Thread Creation}\\[5mm]
{\large Ralf S. Engelschall}\\[1mm]
{\large\textit{Technische Universit\"at M\"unchen (TUM)}}\\[1mm]
rse@engelschall.com, http://www.engelschall.com\\[8mm]
%{\large June 18th, 2000}
\end{center}

\begin{abstract}
\noindent This paper describes a pragmatic but portable fallback
approach for creating and dispatching between the machine contexts of
multiple threads of execution on Unix systems that lack a dedicated
user-space context switching facility. Such a fallback approach
for implementing machine contexts is a vital part of a user-space
multithreading environment, if it has to achieve maximum portability
across a wide range of Unix flavors. The approach is entirely based
on standard Unix system facilities and ANSI-C language features and
especially does not require any assembly code or platform specific
tricks at all. The most interesting issue is the technique of creating
the machine context for threads, which this paper explains in detail.
The described approach closely follows the algorithm as implemented by
the author for the popular user-space multithreading library \textit{GNU
Portable Threads} (\textit{GNU Pth}, \cite{pth}) which this way quickly
gained the status of one of the most portable user-space multithreading
libraries.

\medskip
\noindent
\textsf{\bfseries Keywords:}
portability, multithreading, Unix, POSIX, SUSv2, ANSI-C, user-space, context
creation, context switching, signal handler, stack, makecontext,
switchcontext, sigaltstack, setjmp, longjmp.

\medskip
\noindent
\textsf
{\bfseries Publishing:}
Early drafts of this paper were distributed with the GNU Pth
distribution. The final release version was published on the USENIX
Annual Technical Conference, June 18-23, 2000, San Diego, California,
USA.
\end{abstract}

\medskip
\columnsep=0.25in
\begin{multicols}{2}[\section{Introduction}]

\subsection{Multithreading}

The paradigm of programming with multiple threads of execution (aka
\textit{multithreading}) is already a very old one and dates back to
the decades of programming with \textit{co-routines} \cite{co1,co2}.
Paradoxically, the use of threads on Unix platforms did not become
popular until the early 1990s.

\mypar{Multithreading Advantages}

\noindent
Multithreading can provide many benefits for applications (good runtime
concurrency, parallel programming techniques can be implemented more easily,
the popular procedural programming style can be combined with multiple
threads of execution, \textit{etc.}) but the most interesting ones are
usually performance gains and reduced resource consumption. Because
in contrast to multiprocess applications, multithreaded ones usually
require less system resources (mainly memory) and their internal
communication part can leverage from the shared address space.

\mypar{Multithreading and Applications}

\noindent
Nevertheless there still exist just a few real applications in the free
software world that use multithreading for their benefit, although
their application domains are predestined for multithreading. For
instance, the popular Apache webserver as of version 1.3 still uses
a pre-forking process model for serving HTTP requests, although
two experiments with multithreaded Apache variants in 1996 (with
\textit{rsthreads} \cite{rsthreads}) and 1998 (with \textit{NSPR}
\cite{nspr}) already showed great performance boosts. The same applies
to many similar applications.

The reason for this restraint mainly is that for a long time,
multithreading facilities under Unix were rare. The situation
became better after some vendors like \textit{Sun} and \textit{DEC}
incorporated threading facilities into their Unix flavors and
\textit{POSIX} standardized a threading \textit{Application Programming
Interface} (API) (aka \textit{Pthreads} \cite{iso}). But an API and a
few vendor implementations are not enough to fulfill the portability
requirements of modern free software packages. Here stand-alone and
really portable multithreading environments are needed.

The author collected and evaluated over twenty (mostly user-space)
available multithreading facilities for Unix systems (see Table 1),
but only a few of them are freely available and showed to be really
portable. And even the mostly portable ones suffered from the fact that
they partly depend on assembly code or platform specific tricks usually
related to the creation and dispatching of the individual threads. This
means that the number of platforms they support is limited and
applications which are based on these facilities are only portable to
those platforms. This situation is not satisfactory, so application
authors still avoid the use of multithreading if they want to (or have
to) achieve maximum portability for their application.

A pragmatic and mostly portable fallback technique for implementing
user-space threads can facilitate wider use of multithreading in free
software applications.

\newcommand{\he}[1]{\begin{rotate}{45}\textbf{\small #1}\end{rotate}}
\begin{figure*}[t]
\noindent
\fbox
{\small
\setlength{\tabcolsep}{2.5pt}
\begin{tabular}{lllllllllllllllll}
\rule{0pt}{26mm}%
\textbf{Package} & \he{Genesis} & \he{Latest Version} & \he{Implementation Space} & \he{Thread Mapping} & \he{Active Development} & \he{Experimental State} & \he{Open Source} & \he{Pthread API} & \he{Pthread Shared Memory} & \he{Native API} & \he{Native API $\ge$ Pthread API} & \he{Native API is Pthread API} & \he{Preemptive Scheduling} & \he{Portability} & \he{Assembly Code} & \he{SysCall Wrap.} \\
\hline
gnu-pth         & 1999 & 1.3.5  & user        & n:1  & yes  & no  & yes & yes & no  & yes & yes & no     & no  & full/mcsc+sjlj & no  & partly \\[-1pt]
cmu-lwp         & 1984 & 1.4    & user        & n:1  & yes  & no  & yes & no  & -   & yes & yes & partly & no  & semi/fixed:8   & yes & no     \\[-1pt]
fsu-pthread     & 1992 & 3.13   & user        & n:1  & no   & no  & yes & yes & no  & no  & -   & -      & yes & semi/fixed:6   & yes & yes    \\[-1pt]
mit-pthread     & 1993 & 1.8.9  & user        & n:1  & no   & no  & yes & yes & no  & no  & -   & -      & yes & semi/fixed:17  & yes & yes    \\[-1pt]
ptl             & 1997 & 990622 & user        & n:1  & no   & no  & yes & yes & no  & no  & -   & -      & yes & semi/fixed:10  & yes & yes    \\[-1pt]
linuxthreads    & 1997 & 2.1.2  & user+kernel & 1:1  & yes  & no  & yes & yes & no  & no  & -   & -      & yes & semi/fixed:5   & yes & yes    \\[-1pt]
uthread         & 1998 & 3.4    & user        & n:1  & yes  & no  & yes & yes & no  & no  & -   & -      & yes & semi/fixed:2   & yes & yes    \\[-1pt]
cthread         & 1991 & 991115 & user        & n:1  & no   & no  & yes & no  & -   & yes & yes & no     & no  & semi/fixed:8   & yes & yes    \\[-1pt]
openthreads/qt  & 1996 & 2.0    & user        & n:1  & no   & no  & yes & no  & -   & yes & no  & no     & no  & semi/fixed:9   & yes & no     \\[-1pt]
rt++/qt         & 1996 & 1.0    & user        & n:1  & no   & no  & yes & no  & -   & yes & yes & no     & no  & semi/fixed:9   & yes & no     \\[-1pt]
rsthreads       & 1996 & 980331 & user        & n:1  & no   & yes & yes & no  & -   & yes & no  & no     & no  & semi/fixed:9   & yes & no     \\[-1pt]
pcthread        & 1996 & 1.0    & user        & n:1  & no   & yes & yes & yes & no  & no  & -   & -      & yes & semi/fixed:1   & yes & no     \\[-1pt]
bbthreads       & 1996 & 0.3    & kernel      & 1:1  & no   & yes & yes & no  & -   & yes & no  & -      & yes & semi/fixed:1   & yes & no     \\[-1pt]
jkthreads       & 1998 & 1.2    & kernel      & 1:1  & no   & yes & yes & no  & -   & yes & no  & -      & yes & semi/fixed:1   & yes & no     \\[-1pt]
nthreads        & 1997 & 970604 & user        & n:1  & no   & yes & yes & no  & -   & yes & no  & -      & no  & semi/fixed:9   & yes & partly \\[-1pt]
rexthreads      & 1993 & 930614 & user        & n:1  & no   & yes & yes & no  & -   & yes & no  & -      & no  & semi/fixed:4   & yes & no     \\[-1pt]
coro            & 1999 & 1.0.3  & user        & n:1  & no   & yes & yes & no  & -   & yes & no  & -      & no  & semi/fixed:1   & yes & no     \\[-1pt]
greenthreads    & 1995 & 1.2    & user        & n:1  & no   & no  & no  & no  & -   & yes & yes & -      & yes & full/mcsc      & no  & no     \\[-1pt]
solaris-pthread & NN   & 2.7    & user+kernel & n:m  & yes  & no  & no  & yes & yes & yes & yes & no     & yes & NN             & NN  & yes    \\[-1pt]
tru64-pthread   & NN   & 5.0    & user+kernel & n:m  & yes  & no  & no  & yes & yes & no  & no  & no     & yes & NN             & NN  & yes    \\[-1pt]
aix-pthread     & NN   & 4.3    & user+kernel & 1:1  & yes  & no  & no  & yes & yes & no  & no  & no     & yes & NN             & NN  & yes    \\[-1pt]
\end{tabular}\ \ \ \ \ \ \ 
}\\[4pt]
\parbox{\linewidth}{\footnotesize\baselineskip=0pt
\hangafter{\textbf{Table 1:}\quad}%
Summary of evaluated multithreading packages and some of their
determined characteristics. Notice that mostly all packages con\-tain
assembly code and are just semi-portable, \textit{i.e.}, they support
only a fixed set of platforms and do not automatically adjust for new
ones.
}
\end{figure*}

\mypar{Ingredients of a Thread}

\noindent
A Unix process has many ingredients, but the most important ones are its
memory mapping table, the signal dispatching table, the signal mask, the
set of file descriptors and the machine context. The machine context
in turn consists of at least the CPU registers including the program
counter and the stack pointer. In addition, there can be light-weight
processes (LWP) or threads, which usually share all attributes with the
underlying (heavy-weight) process except for the machine context.

\mypar{Kernel-Space vs. User-Space}

\noindent
Those LWPs or threads, on a Unix platform classically can be implemented
either in kernel-space or in user-space. When implemented in
kernel-space, one usually calls them LWPs or kernel threads, otherwise
(user-space) threads. If threads are implemented by the kernel, the
thread context switches are performed by the kernel without notice by
the application, similar to the dispatching of processes. If threads are
implemented in user-space, the thread context switches are performed
usually by an application library without notice by the kernel.
Additionally, there exist hybrid threading approaches, where typically a
user-space library binds one or more user-space threads to one or more
kernel-space LWPs.

\mypar{Thread Models}

\noindent
The vendor threading facilities under \textit{Sun Solaris},
\textit{IBM AIX}, \textit{DEC Tru64} (formerly \textit{DIGITAL
UNIX} or \textit{OSF/1}) and \textit{SGI IRIX} use a \textbf{M:N}
mapping \cite{str,aix}, \textit{i.e.}, \textbf{M} user-space threads
are mapped onto \textbf{N} kernel-space LWPs. On the other hand,
\textit{LinuxThreads} \cite{linuxthreads} under \textit{GNU/Linux}
uses a \textbf{1:1} mapping and pure user-space implementations like
\textit{GNU Pth}, \textit{FSU pthreads} or \textit{MIT pthreads},
\textit{etc.} use a \textbf{M:1} mapping \cite{pth,fsu,mit}.

From now on we focus on such \textbf{M:1} user space threading
approaches, where one or more user space threads are implemented
inside a single kernel space process. The exercise is to implement
this by using standardized Unix system and ANSI-C language facilities
\textit{only}.

\subsection{The Exercise}

As we have mentioned, a thread shares its state with the underlying
process except for the machine context. So the major task for a
user-space threading system is to create and dispatch those machine
contexts.

In practice, the second major task it has to do is to ensure that no
thread by accident blocks the whole process (and thereby all other
threads). Instead when an operation would block, the threading library
should suspend only the execution of the current thread and in the
meantime dispatch the remaining threads. But this task is outside the
scope of this paper (see \cite{bmv} for details about this task). We
focus only on the aspect of machine context handling.

\subsection{The Curse of Portability}

Our goal of real portability for a threading system causes some
non-trivial problems which have to be solved. The most obvious one is
that dealing with machine contexts usually suffers from portability,
because it is a highly CPU dependent task for which not every Unix
flavor provides a standardized API. Although such an API would be not
too hard for vendors to provide, because in principle it is just a matter
of switching a few CPU registers (mainly the program counter and the
stack pointer).

\mypar{Assembly Code Considered Harmful}

\noindent
Additionally, we disallow the use of any assembly solutions or platform
specific tricks, because then the threading system again would be only
semi-portable, \textit{i.e.}, it can be ported to \textbf{N} platforms
but on the \textbf{(N+1)}th platform one has to manually adjust or even
extend it to work there, too.

This is usually not acceptable, even if it also makes solving the
problems harder. At least most of the known free software user-space
threading systems \cite{fsu,mit,ptl2} do not restrict themself to this
and therefore are just semi-portable. But real portability should be a
major goal.

\section{Problem Analysis}

\subsection{The Task in Detail}

Our task is simple in principle: provide an API and corresponding
implementation for creating and dispatching machine contexts on which
user-space threads can be implemented.

\mypar{The Proposed API}

\noindent
In detail we propose the following \textit{Application Programmers
Interface} (API) for the machine context handling:

\begin{itemize}
\item A data structure of type \texttt{mctx\_t} which holds the machine context.

\item A function 
      ``\textbf{void} \texttt{mctx\_crea\-te(}\texttt{mctx\_t} \texttt{*}\textit{mctx},
      \textbf{void} \texttt{(*}\textit{sf\_addr}\texttt{)(}\textbf{void *}\texttt{),}
      \textbf{void} \texttt{*}\textit{sf\_arg}, 
      \textbf{void} \texttt{*}\textit{sk\_addr}, 
      \textbf{size\_t} \textit{sk\_size}\texttt{)}''
      which creates and initializes a machine context structure in
      \textit{mctx} with a start function \textit{sf\_addr}, a start
      function argument \textit{sf\_arg}, and a stack starting at
      \textit{sk\_addr}, which is \textit{sk\_size} bytes in size.
  
\item A function
      ``\textbf{void} \texttt{mctx\_save(}\texttt{mctx\_t}
      \texttt{*}\textit{mctx}\texttt{)}''
      which saves the current machine context into the machine context
      structure \textit{mctx}.

\item A function
      ``\textbf{void} \texttt{mctx\_restore(}\texttt{mctx\_t}
      \texttt{*}\textit{mctx}\texttt{)}''
      which restores the new machine context from the machine context
      structure \textit{mctx}. This function does not return to
      the caller. Instead it does return at the location stored
      in \textit{mctx} (which is either \textit{sf\_addr} from a
      previous \texttt{mctx\_create} call or the location of a previous
      \texttt{mctx\_save} call).

\item A function
      ``\textbf{void} \texttt{mctx\_switch(}\texttt{mctx\_t} \texttt{*}\textit{mctx\-\_old},
      \texttt{mctx\_t} \texttt{*}\textit{mctx\_new}\texttt{)}''
      which switches from the current machine context (saved to
      \textit{mctx\_old} for later use) to a new context (restored from
      \textit{mctx\_new}). This function returns only to the caller if
      \texttt{mctx\_restore} or \texttt{mctx\_switch} is again used on
      \textit{mctx\_old}.
      
\end{itemize}

\subsection{Technical Possibilities}

Poking around in the references of the ANSI-C language reference and the
Unix standards show the following functions on which an implementation
can be based:

\begin{itemize}
\item There is the \texttt{ucontext}(3) facility with the functions
      \texttt{getcontext}(3), \texttt{makecontext}(3),
      \texttt{swapcontext}(3) and \texttt{setcontext}(3) which
      conform to the \textit{Single Unix Specification}, Version 2
      (\textit{SUSv2} \cite{sus}, aka \textit{Unix95/98}). Unfortunately
      these are available on modern Unix platforms only.

\item There are the \texttt{jmp\_buf} based
      functions \texttt{setjmp}(3) and \texttt{long\-jmp}(3) which
      conform to ISO 9899:1990 (ISO-C) and the \texttt{sigjmp\_buf}
      based \texttt{sigsetjmp}(3) and \texttt{sig\-long\-jmp}(3)
      functions which conform to IEEE Std\-1003.1-1988 (\textit{POSIX}),
      and \textit{Single Unix Specification}, Version 2 (\textit{SUSv2}
      \cite{sus}, aka \textit{Unix95/98}). The first two functions are
      available really on all Unix platforms, the last two are available
      only on some of them.

      On some platforms \texttt{setjmp}(3) and \texttt{longjmp}(3)
      save and restore also the signal mask (if one does not want
      this semantics, one has to call \texttt{\_setjmp}(3) and
      \texttt{\_longjmp}(3) there) while on others one has to
      explicitly use the superset functions \texttt{sigsetjmp}(3) and
      \texttt{siglongjmp}(3) for this. In our discussion we can assume
      that \texttt{setjmp}(3) and \texttt{longjmp}(3) save and restore
      the signal mask, because if this is not the case in practice,
      one easily can replace them with \texttt{sigsetjmp}(3) and
      \texttt{siglongjmp}(3) calls (if available) or (if not available)
      emulate the missing functionality manually with additional
      \texttt{sigprocmask}(2) calls (see \texttt{pth\_mctx.c} in
      \textit{GNU Pth} \cite{pth}).

\item There is the function \texttt{sigaltstack}(2) which 
      conforms to the \textit{Single Unix Specification}, Version 2
      (\textit{SUSv2} \cite{sus}, aka \textit{Unix95/98}) and its
      ancestor function \texttt{sigstack}(2) from \textit{4.2BSD}. The
      last one exists only on \textit{BSD}-derived platforms, but the
      first function already exists on all current Unix platforms.
\end{itemize}

\subsection{Maximum Portability Solution}

The maximum portable solution obviously is to use
the standardized \texttt{makecontext}(3) function
to create threads and \texttt{switchcontext}(3) or
\texttt{getcontext}(3)/\texttt{setcontext}(3) to dispatch them.
And actually these are the preferred functions modern user-space
multithreading systems are using. We could easily implement our proposed
API as following (all error checks omitted for better readability):

{\small
\baselineskip=9pt
\begin{alltt}
/* \textrm{\textit{\bfseries machine context data structure}} */
\textbf{typedef struct} mctx_st \{ 
    ucontext_t uc;
\} mctx_t;

/* \textrm{\textit{\bfseries save machine context}} */
\textbf{#define} mctx_save(mctx) \(\backslash\)
    (\textbf{void})getcontext(&(mctx)->uc)

/* \textrm{\textit{\bfseries restore machine context}} */
\textbf{#define} mctx_restore(mctx) \(\backslash\)
    (\textbf{void})setcontext(&(mctx)->uc)

/* \textrm{\textit{\bfseries switch machine context}} */
\textbf{#define} mctx_switch(mctx_old,mctx_new) \(\backslash\)
    (\textbf{void})swapcontext(&((mctx_old)->uc), \(\backslash\)
                      &((mctx_new)->uc))

/* \textrm{\textit{\bfseries create machine context}} */
\textbf{void} mctx_create(
    mctx_t *mctx,
    \textbf{void} (*sf_addr)(\textbf{void} *), \textbf{void} *sf_arg,
    \textbf{void} *sk_addr, size_t sk_size) 
\{

    /* \textrm{\textit{\bfseries fetch current context}} */
    getcontext(&(mctx->uc));

    /* \textrm{\textit{\bfseries adjust to new context}} */
    mctx->uc.uc_link           = NULL;
    mctx->uc.uc_stack.ss_sp    = sk_addr;
    mctx->uc.uc_stack.ss_size  = sk_size;
    mctx->uc.uc_stack.ss_flags = 0;

    /* \textrm{\textit{\bfseries make new context}} */
    makecontext(&(mctx->uc), 
                sf_addr, 1, sf_arg);
    \textbf{return};
\}
\end{alltt}}

\noindent
Unfortunately there are still lots of Unix platforms where this
approach cannot be used, because the standardized \texttt{ucontext}(3)
API is not provided by the vendor. Actually the platform test results
for \textit{GNU Pth} (see Table 2 below) showed that only 7 of 21
successfully tested Unix flavors provided the standardized API
(\texttt{makecontext}(3), \textit{etc.}). On all other platforms,
\textit{GNU Pth} was forced to use the fallback approach of implementing
the machine context as we will describe in the following. Obviously
this fallback approach has to use the remaining technical possibilities
(\texttt{sigsetjmp}(3), \textit{etc.}).

\medskip
\noindent
\fbox
{\parbox{7.8cm}
{\footnotesize
\noindent
\medskip
\begin{tabular}{l|l|l|l}
{\bfseries Operating System} & 
{\bfseries Architecture(s)} & 
{\bfseries mcsc} & 
{\bfseries sjlj} \\
FreeBSD 2.x/3.x            & Intel             & no  & yes \\
FreeBSD 3.x                & Intel, Alpha      & no  & yes \\
NetBSD 1.3/1.4             & Intel, PPC, M68K  & no  & yes \\
OpenBSD 2.5/2.6            & Intel, SPARC      & no  & yes \\
BSDI 4.0                   & Intel             & no  & yes \\
Linux 2.0.x glibc 1.x/2.0  & Intel, SPARC, PPC & no  & yes \\
Linux 2.2.x glibc 2.0/2.1  & Intel, Alpha, ARM & no  & yes \\
Sun SunOS 4.1.x            & SPARC             & no  & yes \\
Sun Solaris 2.5/2.6/2.7    & SPARC             & yes & yes \\
SCO UnixWare 2.x/7.x       & Intel             & yes & yes \\
SCO OpenServer 5.0.x       & Intel             & no  & yes \\
IBM AIX 4.1/4.2/4.3        & RS6000, PPC       & yes & yes \\
HP HPUX 9.10/10.20         & HPPA              & no  & yes \\
HP HPUX 11.0               & HPPA              & yes & yes \\
SGI IRIX 5.3               & MIPS 32/64        & no  & yes \\
SGI IRIX 6.2/6.5           & MIPS 32/64        & yes & yes \\
ISC 4.0                    & Intel             & no  & yes \\
Apple MacOS X              & PPC               & no  & yes \\
DEC OSF1/Tru64 4.0/5.0     & Alpha             & yes & yes \\
SNI ReliantUNIX            & MIPS              & yes & yes \\
AmigaOS                    & M68K              & no  & yes \\
\end{tabular}

\medskip
\noindent
{\tiny
\begin{tabular}{ll}
\end{tabular}
}
}}\\[4pt]
\parbox{\linewidth}{\footnotesize\baselineskip=0pt
\hangafter{\textbf{Table 2:}\quad}%
Summary of operating system support. The level and type of support found on
each tested operating system. \texttt{mcsc}: functional
\texttt{makecontext}(3)/\-\texttt{switchcontext}(3), \texttt{sjlj}: functional
\texttt{setjmp}(3)/\texttt{longjmp}(3) or
\texttt{sig\-setjmp}(3)/\-\texttt{siglongjmp}(3). See file \texttt{PORTING} in
\textit{GNU Pth} \cite{pth} for more details.
}

\subsection{Remaining Possibilities}

Our problem can be divided into two parts, an easy one and a difficult one.

\mypar{The Easy Part}

\noindent
That \texttt{setjmp}(3) and \texttt{longjmp}(3) can be used to implement
user-space threads is commonly known \cite{ptl2,rsthreads,uthread}.
Mostly all older portable user-space threading libraries are based
on them, although some problems are known with these facilities (see
below). So it becomes clear that we also have to use these functions and
base our machine context (\texttt{mctx\_t}) on their \texttt{jmp\_buf}
data structure.

We immediately recognize that this way we have at least solved
the dispatching problem, because our \texttt{mctx\_save},
\texttt{mctx\_restore} and \texttt{mctx\_switch} functions can be easily
implemented with \texttt{setjmp}(3) and \texttt{longjmp}(3).

\mypar{The Difficult Part}

\noindent
Nevertheless, the difficult problem of how to create the machine context
remains. Even knowing that our machine context is \texttt{jmp\_buf}
based is no advantage to us. A \texttt{jmp\_buf} has to be treated
by us as an opaque data structure --- for portability reasons. The
only operations we can perform on it are \texttt{setjmp}(3) and
\texttt{longjmp}(3) calls, of course. Additionally, we are forced to use
\texttt{sigaltstack}(3) for our stack manipulations, because it is the
only portable function which actually deals with stacks.

So it is clear that our implementation for \texttt{mctx\_\-create} has
to play a few tricks to use a \texttt{jmp\_buf} for passing execution
control to an arbitrary startup routine. And our approach 
has to be careful to ensure that it does not suffer from unexpected
side-effects. It should be also obvious that we cannot again
expect to find an easy solution (as for \texttt{mctx\_save},
\texttt{mctx\_restore} and \texttt{mctx\_switch}), because
\texttt{setjmp}(3) and \texttt{sigaltstack}(3) cannot be trivially
combined to form \texttt{mctx\_create}.

\section{Implementation}

As we have already discussed, our implementation contains an easy part
(\texttt{mctx\_save}, \texttt{mctx\_restore} and \texttt{mctx\_switch})
and a difficult part (\texttt{mctx\_create}). Let us start with the easy
part, whose implementation is obvious (all error checks again omitted
for better readability):

{\small
\baselineskip=9pt
\begin{alltt}
/* \textrm{\textit{\bfseries machine context data structure}} */
\textbf{typedef struct} mctx_st \{ 
    jmp_buf jb;
\} mctx_t;

/* \textrm{\textit{\bfseries save machine context}} */
\textbf{#define} mctx_save(mctx) \(\backslash\)
    (\textbf{void})setjmp((mctx)->jb)

/* \textrm{\textit{\bfseries restore machine context}} */
\textbf{#define} mctx_restore(mctx) \(\backslash\)
    longjmp((mctx)->jb, 1)

/* \textrm{\textit{\bfseries switch machine context}} */
\textbf{#define} mctx_switch(mctx_old,mctx_new) \(\backslash\)
    \textbf{if} (setjmp((mctx_old)->jb) == 0) \(\backslash\)
        longjmp((mctx_new)->jb, 1)

/* \textrm{\textit{\bfseries create machine context}} */
\textbf{void} mctx_create(
    mctx_t *mctx,
    \textbf{void} (*sf_addr)(\textbf{void} *), \textbf{void} *sf_arg,
    \textbf{void} *sk_addr, size_t sk_size) 
\{
    \textrm{\textit{\bfseries ...initialization of \texttt{mctx} to be filled in...}}
\}
\end{alltt}}

\noindent
There is one subtle but important point we should mention: The
use of the C pre-processor \texttt{\#define} directive to implement
\texttt{mctx\_save}, \texttt{mctx\_restore} and \texttt{mctx\_switch}
is intentional. For technical reasons related to \texttt{setjmp}(3)
semantics and \texttt{return} related stack behavior (which we will
explain later in detail) we \textit{cannot} implement these three
functions (at least not \texttt{mctx\_switch}) as C functions if we want
to achieve maximum portability across all platforms. Instead they have
to be implemented as pre-processor macros.

\subsection{Algorithm Overview}

The general idea for \texttt{mctx\_create} is to configure the given
stack as a signal stack via \texttt{sigaltstack}(2), send the current
process a signal to transfer execution control onto this stack, save
the machine context there via \texttt{setjmp}(3), get rid of the signal
handler scope and bootstrap into the startup routine.

The real problem in this approach comes from the signal handler scope
which implies various restrictions on Unix platforms (the signal
handler scope often is just a flag in the process control block (PCB)
which various system calls, like \texttt{sigaltstack}(2), check before
allowing the operation -- but because it is part of the process state
the kernel manages, the process cannot change it itself). As we will
see, we have to perform a few tricks to get rid of it. The second main
problem is: how do we prepare the calling of the start routine without
immediately entering it?

\subsection{Algorithm}

The input to the \texttt{mctx\_create} function is the machine context
structure \textit{mctx} which should be initialized, the thread startup
function address \textit{sf\_addr}, the thread startup function argument
\textit{sf\_arg} and a chunk of memory starting at \textit{sk\_addr} and
\textit{sk\_size} bytes in size, which should become the threads stack.

The following algorithm for \texttt{mctx\_create} is directly modeled
after the implemented algorithm one can find in \textit{GNU Pth}
\cite{pth}, which in turn was derived from techniques originally found
in \textit{rsthreads} \cite{rsthreads}:

\begin{enumerate}

\item Preserve the current signal mask and block an arbitrary worker
      signal (we use \texttt{SIGUSR1}, but any signal can be used for
      this -- even an already used one). This worker signal is later
      temporarily required for the trampoline step.

\item Preserve a possibly existing signal action for the worker
      signal and configure a trampoline function as the new temporary
      signal action. The signal delivery is configured to occur on
      an alternate signal stack (see next step).

\item Preserve a possibly active alternate signal stack and configure
      the memory chunk starting at \textit{sk\_addr} as the new
      temporary alternate signal stack of length \textit{sk\_size}.

\item Save parameters for the trampoline step (\textit{mctx},
      \textit{sf\_addr}, \textit{sf\_arg}, \textit{etc.}) in global
      variables, send the current process the worker signal, temporarily
      unblock it and this way allow it to be delivered on the signal
      stack in order to transfer execution control to the trampoline
      function. 

\item After the trampoline function asynchronously entered, save its
      machine context in the \textit{mctx} structure and immediately
      return from it to terminate the signal handler scope.
 
\item Restore the preserved alternate signal stack, preserved signal
      action and preserved signal mask for worker signal. This way
      an existing application configuration for the worker signal is
      restored.

\item Save the current machine context of \texttt{mctx\_create}. This
      allows us to return to this point after the next trampoline step.

\item Restore the previously saved machine context of the trampoline function
      (\textit{mctx}) to again transfer execution control onto the alternate
      stack, but this time without(!) signal handler scope. 

\item After reaching the trampoline function (\textit{mctx}) again,
      immediately bootstrap into a clean stack frame by just calling a
      second function.

\item Set the new signal mask to be the same as the original signal
      mask which was active when \texttt{mctx\_create} was called. This
      is required because in the first trampoline step we usually had at
      least the worker signal blocked.

\item Load the passed startup information (\textit{sf\_addr},
      \textit{sf\_arg}) from \texttt{mctx\_create} into local
      (stack-based) variables. This is important because their values
      have to be preserved in machine context dependent memory until
      the created machine context is the first time restored by the
      application.

\item Save the current machine context for later restoring by
      the calling application.
      
\item Restore the previously saved machine context
      of \texttt{mctx\_create} to transfer execution control back to it.

\item Return to the calling application.

\end{enumerate}

\noindent
When the calling application now again switches into the established
machine context \textit{mctx}, the thread starts running at routine
\textit{sf\_addr} with argument \textit{sf\_arg}. Figure 1 illustrates
the algorithm (the numbers refer to the algorithm steps listed above).

\bigskip
\noindent
\psfig{file=rse-pmt.eps}\\[4pt]
\parbox{\linewidth}{\footnotesize\baselineskip=0pt
\hangafter{\textbf{Figure 1:}\quad}%
Illustration of the machine context creation procedure. The thick
solid lines and numeric marks correspond to the algorithm steps as
described in section 3.2. The thick dotted lines show a possible further
processing where a few context switches are performed to dispatch
between the main thread and the new created thread.
}
\smallskip

\subsection{Source Code}

The corresponding ANSI-C code, which implements \texttt{mctx\_create},
is a little bit more complicated. But with the presented algorithm in
mind, it is now straight-forward.

{\small%
\baselineskip=9pt
\begin{alltt}
\textbf{static} mctx_t       mctx_caller;
\textbf{static} sig_atomic_t mctx_called;

\textbf{static} mctx_t      *mctx_creat;
\textbf{static} \textbf{void}       (*mctx_creat_func)(\textbf{void} *);
\textbf{static} \textbf{void}        *mctx_creat_arg;
\textbf{static} sigset_t     mctx_creat_sigs;

\textbf{void} mctx_create(
    mctx_t *mctx,
    \textbf{void} (*sf_addr)(\textbf{void} *), \textbf{void} *sf_arg, 
    \textbf{void} *sk_addr, \textbf{size_t} sk_size)
\{
    \textbf{struct} sigaction sa;
    \textbf{struct} sigaction osa;
    \textbf{struct} sigaltstack ss;
    \textbf{struct} sigaltstack oss;
    sigset_t osigs;
    sigset_t sigs;

    /* \textrm{\textit{\bfseries Step 1:}} */
    sigemptyset(&sigs);
    sigaddset(&sigs, SIGUSR1);
    sigprocmask(SIG_BLOCK, &sigs, &osigs);

    /* \textrm{\textit{\bfseries Step 2:}} */
    memset((\textbf{void} *)&sa, 0, 
           \textbf{sizeof}(\textbf{struct} sigaction));
    sa.sa_handler = mctx_create_trampoline;
    sa.sa_flags = SA_ONSTACK;
    sigemptyset(&sa.sa_mask);
    sigaction(SIGUSR1, &sa, &osa);

    /* \textrm{\textit{\bfseries Step 3:}} */
    ss.ss_sp    = sk_addr; 
    ss.ss_size  = sk_size;
    ss.ss_flags = 0;
    sigaltstack(&ss, &oss);

    /* \textrm{\textit{\bfseries Step 4:}} */
    mctx_creat      = mctx;
    mctx_creat_func = sf_addr;
    mctx_creat_arg  = sf_arg;
    mctx_creat_sigs = osigs;
    mctx_called     = FALSE;
    kill(getpid(), SIGUSR1);
    sigfillset(&sigs);
    sigdelset(&sigs, SIGUSR1);
    \textbf{while} (!mctx_called)
        sigsuspend(&sigs);

    /* \textrm{\textit{\bfseries Step 6:}} */
    sigaltstack(NULL, &ss);
    ss.ss_flags = SS_DISABLE;
    sigaltstack(&ss, NULL);
    if (!(oss.ss_flags & SS_DISABLE))
        sigaltstack(&oss, NULL);
    sigaction(SIGUSR1, &osa, NULL);
    sigprocmask(SIG_SETMASK, 
                &osigs, NULL);

    /* \textrm{\textit{\bfseries Step 7 \& Step 8:}} */
    mctx_switch(&mctx_caller, mctx);

    /* \textrm{\textit{\bfseries Step 14:}} */
    \textbf{return};
\}

\textbf{void} mctx_create_trampoline(\textbf{int} sig)
\{
    /* \textrm{\textit{\bfseries Step 5:}} */
    \textbf{if} (mctx_save(mctx_creat) == 0) \{
        mctx_called = TRUE;
        \textbf{return};
    \}

    /* \textrm{\textit{\bfseries Step 9:}} */
    mctx_create_boot();
\}

\textbf{void} mctx_create_boot(\textbf{void})
\{
    \textbf{void} (*mctx_start_func)(\textbf{void} *);
    \textbf{void} *mctx_start_arg;

    /* \textrm{\textit{\bfseries Step 10:}} */
    sigprocmask(SIG_SETMASK, 
                &mctx_creat_sigs, NULL);

    /* \textrm{\textit{\bfseries Step 11:}} */
    mctx_start_func = mctx_creat_func;
    mctx_start_arg  = mctx_creat_arg;

    /* \textrm{\textit{\bfseries Step 12 \& Step 13:}} */
    mctx_switch(mctx_creat, &mctx_caller);

    /* \textrm{\textit{\bfseries The thread ``magically'' starts... }} */
    mctx_start_func(mctx_start_arg);

    /* NOTREACHED */
    abort();
\}
\end{alltt}
}

\subsection{Run-time Penalty}

After this discussion of the implementation details, an obviously
occuring question now is what the expected run-time penalty is. That is,
what does our presented machine context implementation cost compared
to a \texttt{ucontext}(3) based solution. From the already discussed
details we can easily guess that our complex machine context creation
procedure (\texttt{mctx\_create}) will be certainly noticeably slower
than a solution based on a \texttt{ucontext}(3) facility.

But a wild guess is not sufficing for a reasonable statement. So we have
written a \textit{Simple Machine Context Benchmark} (SMCB \cite{smcb})
which was used to compare run-time costs of the \texttt{mctx\_create}
and \texttt{mctx\_switch} functions if once implemented through
the \textit{POSIX} \texttt{makecontext}(3)/\texttt{swapcontext}(3)
functions (as shown in section 2.3), and once implemented with our
based fallback implementation (for convenience reasons we directly used
\texttt{sigjmp\_buf}, \texttt{sigsetjmp}(3) and \texttt{siglongjmp}(3)
in the benchmark, because all tested platforms provided this). The
results are shown Table 3 below.

As one can derive from these evaluations, our signal stack trick to
implement \texttt{mctx\_create} in practice is approximately 15 times
slower than the \texttt{makecontext}(3) based variant. This cost
should not be neglected. On the other hand, the \texttt{sigsetjmp}(3)/
\texttt{siglongjmp}(3) based \texttt{mctx\_switch} performs about as
good as the \texttt{swapcontext}(3) based variant (the reason why
on most of the tested platforms it is even slightly faster is not
known -- but we guess it is related to a greater management overhead
in the \texttt{ucontext}(3) facility, which is a superset of the
functionality we require). Or in short: our presented fallback approach
costs noticeable extra CPU cycles on thread creation time, but is as
fast as the standardized solution under thread dispatching time.

\bigskip
\noindent
\fbox
{\parbox{7.8cm}
{\footnotesize
\noindent
\medskip
\ \ \ {\bfseries 10000 $\times$ mctx\_create (in seconds):}\\[1mm]
\begin{tabular}{l|r|r|r}
{\bfseries Platform} & 
{\bfseries mcsc} & 
{\bfseries sjlj} & 
{\bfseries overhead} \\
Sun Solaris 2.6 (SPARC)  & 0.076 & 1.268 & 16.7 \\
DEC Tru64 5.0 (Alpha)    & 0.019 & 0.235 & 12.4 \\
SGI IRIX 6.5 (MIPS)      & 0.105 & 1.523 & 14.5 \\
SCO UnixWare 7.0 (Intel) & 0.204 & 3.827 & 18.8 \\
HP HP/UX 11.0 (HPPA)     & 0.057 & 0.667 & 11.8 \\
\multicolumn{4}{r}{
    {\bfseries Average: \quad 14.8}} \\
\end{tabular}

\medskip
\ \ \ {\bfseries 10000 $\times$ mctx\_switch (in seconds):}\\[1mm]
\begin{tabular}{l|r|r|r}
{\bfseries Platform} & 
{\bfseries mcsc} & 
{\bfseries sjlj} & 
{\bfseries overhead} \\
Sun Solaris 2.6 (SPARC)  & 0.137 & 0.210 & 1.5 \\
DEC Tru64 5.0 (Alpha)    & 0.034 & 0.022 & 0.6 \\
SGI IRIX 6.5 (MIPS)      & 0.235 & 0.190 & 0.8 \\
SCO UnixWare 7.0 (Intel) & 0.440 & 0.398 & 0.9 \\
HP HP/UX 11.0 (HPPA)     & 0.106 & 0.065 & 0.6 \\
\multicolumn{4}{r}{
    {\bfseries Average: \quad 0.9}} \\
\end{tabular}
}}\\[4pt]
\parbox{\linewidth}{\footnotesize\baselineskip=0pt
\hangafter{\textbf{Table 3:}\quad}%
Summary of \textit{Simple Machine Context Benchmark} (SMCB, \cite{smcb}).
The speed of machine context creation and switching found on
each tested operating system. \textbf{mcsc}: functional
\texttt{makecontext}(3) / \texttt{switchcontext}(3), \textbf{sjlj}: functional
\texttt{sigsetjmp}(3)/\texttt{siglongjmp}(3). \textbf{overhead}: the
overhead of using \textbf{sjlj} instead of \textbf{mcsc}.
}

\subsection{Remaining Issues}

The presented algorithm and source code can be directly used in
practice for implementing a minimal threading system or the concept
of co-routines. Its big advantage is that if the operating system
provides the required standardized primitives, we do not need to know
anything at all about the machine we are running on --- everything just
works. Nevertheless, there remain a few special issues we still have to
discuss.

\mypar{The Waggly longjmp(3) after Return}

\noindent
On some platforms, \texttt{longjmp}(3) may not be called after the
function which called the \texttt{setjmp}(3) returned. When this is
done, the stack frame situation is not guaranteed to be in a clean and
consistent state. But this is exactly the mechanism we use in order to
get rid of the signal handler scope in step 5.

The only alternative would be to leave the signal handler via
\texttt{longjmp}(3), but then we would have another problem,
as experience showed. For instance, \textsc{Robert S. Thau}'s
\textit{Really Simple Threads} (\textit{rsthreads}) \cite{rsthreads}
was ported to several platforms and was used to run an experimental
multithreaded version of the Apache webserver. \textsc{Thau}'s approach
was similar to ours, but differed significantly in the way the signal
handler is left. In particular, in an attempt to avoid the unsafe stack
frame, it used a \texttt{longjmp}(3) call to leave the signal handler,
rather than returning from it. But this approach does not work on some
\textit{SysV}-derived kernels, as we already mentioned.

The problem is that these kernels do not ``believe'' that the code
is out of the signal-handling context, until the signal handler has
returned --- and accordingly, refuse to allow readjustment of the
signal stack until it has. But with the \textit{rsthreads} approach,
the signal handler that created the first thread never returns, and
when \textit{rsthreads} wants to create the second thread, these
kernels refuse to readjust the signal stack, and we are stuck. So with
portability in mind, we decided that it is better to get rid of the
signal handler scope with the straight-forward ``\texttt{return}'' and
instead fight the mentioned (simpler) problem of an unsafe stack frame.

Fortunately, in practice this is not as problematic as it seems, because
evaluations (for \textit{GNU Pth}) on a wide range of current Unix
platforms showed that one can reach a safe stack frame again by just
calling a function. That is the reason why our algorithm enters the
second trampoline function in step 9.

\mypar{The Uncooperative longjmp(3)}

\noindent
Even on operating systems which have working \textit{POSIX}
functions, our approach may theoretically still not work, because
\texttt{longjmp}(3) does not cooperate. For instance, on some
platforms the standard \textit{libc} \texttt{longjmp}(3) branches
to error-handling code if it detects that the caller tries to jump
\textit{up} the stack, \textit{i.e.}, into a stack frame that has
already returned.

This is usually implemented by comparing the current stack pointer
to the one in the \texttt{jmp\_buf} structure. That is why it is
important for our algorithm to return from the signal handler and this
way enter the (different) stack of the parent thread. In practice,
the implementation in \textit{GNU Pth} showed that then one no longer
suffers from those uncooperative \texttt{longjmp}(3) implementations,
but one should keep this point in mind when reaching even more
uncooperative variants on esoteric Unix platforms. If it still occurs,
one can only try to resume the operation by using a possibly existing
platform-specific error handling hook.

\mypar{Garbage at Bottom of Stacks}

\noindent
There is a subtle side-effect of our implementation: there remains some
garbage at the bottom of each thread stack. The reason is that if a
signal is delivered, the operating system pushes some state onto the
stack, which is restored later, when the signal handler returns. But
although we return from the signal handler, we jump in again, and this
time we enter not directly at the bottom of the stack, because of the
\texttt{setjmp}(3) in the trampoline function.

Since the operating system has to capture all CPU registers (including
those that are ordinarily scratch registers or caller-save registers),
there can be a fair amount of memory at the bottom of the established
thread stack. For some systems this can be even up to 1 KB of garbage
\cite{rsthreads}. But except for the additional memory consumption it
does not hurt.

We just have to keep in mind this additional stack consumption when
deciding the stack size (\textit{sk\_size}). A reasonable stack size
usually is between 16 and 32 KB. Less is neither reasonable nor always
allowed (current Unix platforms usually require a stack to be at least
16 KB in size).

\mypar{Stack Overflows}

\noindent
There is a noticeable difference between the initial \texttt{main}()
thread and the explicitly spawned threads: the initial thread runs on
the standard process stack. This stack automatically can grow under
Unix, while the stacks of the spawned threads are fixed in size. So
stack overflows can occur for the spawned threads. This implies that
the parent has to make a reasonable guess of the threads stack space
requirement already at spawning time.

And there is no really portable solution to this problem, because even
if the thread library's scheduler can detect the stack overflow, it
cannot easily resize the stack. The reason is simply that the stack
initialization goes hand in hand with the initialization of the start
routine, as we discussed before. And this start routine has to be a real
C function in order to \textit{call}. But once the thread is running,
there no longer exists such an entry point. So, even if the scheduler
would be able to give the thread a new enlarged stack, there is no
chance to restart the thread on this new stack.

Or more correct, there is no \textit{portable} way to achieve it. As
with the previous problems, there is a non-portable solution. That
is why our implementation did not deal with this issue. Instead in
practice one usually lets the scheduler just detect the stack overflow
and terminate the thread. This is done by using a red zone at the top
of the stack which is marked with a magic value the scheduler checks
between thread dispatching operations.

Resizing solutions are only possible in semi-portable ways. One
approach is to place the thread stacks into a memory mapped area (see
\texttt{mmap}(2)) of the process address space and let the scheduler
catch \texttt{SIGSEGV} signals. When such a signal occurs, because of
a stack overflow in this area, the scheduler explicitly resizes the
memory mapped area. This resizing can be done either by copying the
stack contents into a new larger area which is then re-mapped to the
old address or via an even more elegant way, as the vendor threading
libraries of \textit{Sun Solaris}, \textit{FreeBSD} and \textit{DEC
Tru64} do it: the thread stacks are allocated inside memory mapped areas
which are already initially a few MB in (virtual) size and then one just
relies on the virtual memory system's feature that only the actually
consumed memory space is mapped.

\mypar{Startup Routine Termination}

\noindent
There is a cruel \texttt{abort}(3) call at the end of our
\texttt{mctx\_create\_boot} function. This means, if the startup
routine would return, the process is aborted. That is obviously not
reasonable, so why have we written it this way?

If the thread returns from the startup routine, it should be cleanly
terminated. But it cannot terminate itself (for instance, because
it cannot free its own stack while running on it, \textit{etc.}). So
the termination handling actually is the task of the thread library
scheduler. As a consequence, the thread spawning function of a thread
library should be not directly \texttt{mctx\_create}.

Instead the thread spawning function should use an additional trampoline
function as the higher-level startup routine. And this trampoline
function performs a context switch back into the thread library
scheduler before the lower-level startup routine would return.
The scheduler then can safely remove the thread and its machine
context. That is why the \texttt{abort}(3) call is never reached
in practice (more details can be found in the implementations of
\texttt{pth\_spawn} and \texttt{pth\_exit} in \texttt{pth\_lib.c} of
\textit{GNU Pth} \cite{pth})

\mypar{The sigstack(2) Fallback Situation}

\noindent
Not all platforms provide the standardized \texttt{sigaltstack}(2).
Instead they at least provide the \textit{4.2BSD} ancestor
function \texttt{sigstack}(2). But one cannot trivially replace
\texttt{sigaltstack}(2) by \texttt{sigstack}(2) in this situation,
because in contrast to \texttt{sigaltstack}(2), the old
\texttt{sigstack}(2) does not automatically handle the machine dependent
direction of stack growth.

Instead, the caller has to know the direction and always call
\texttt{sigstack}(2) with the address of the bottom of the stack. So, in
a real-world implementation one first has to determine the direction of
stack growth in order to use \texttt{sigstack}(2) as a replacement for
\texttt{sigaltstack}(2). Fortunately this is easier than it seems on the
first look (for details see the macros \texttt{AC\_CHECK\_STACKGROWTH}
and \texttt{AC\_CHECK\_STACKSETUP} in file \texttt{aclocal.m4} from
\textit{GNU Pth} \cite{pth}). Alternatively if one can afford to waste
memory, one can use an elegant trick: to set up a stack of size $N$,
one allocates a chunk of memory (starting at address $A$) of size
$N\times2$ and then calls \texttt{sigstack}(2) with the parameters
\textit{sk\_addr=}$A+N$ and \textit{sk\_size=}$N$, \textit{i.e.}, one
specifies the middle of the memory chunk as the stack base.

\mypar{The Blind Alley of Brain-Dead Platforms}

\noindent
The world would not be as funny as it is, if really all Unix platforms
would be fair to us. Instead, currently at least one platform exists
which plays unfair: unfortunately, ancient versions of the popular
\textit{GNU/Linux}. Although we will discover that it both provides
\texttt{sigaltstack}(2) and \texttt{sigstack}(2), our approach won't
work on \textit{Linux} kernels prior to version 2.2 and \textit{glibc}
prior to version 2.1.

Why? Because its \textit{libc} provides only stubs of these functions
which always return just \texttt{-1} with \texttt{errno} set
to \texttt{ENOSYS}. So, this definitely means that our nifty
algorithm is useless there, because its central point \textit{is}
\texttt{sigaltstack}(2)/\texttt{sigstack}(2). Nevertheless we do not
need to give up. At least not, if we, for a single brain-dead platform,
accept to break our general goal of not using any platform dependent
code.

So, what can we actually do here? All we have to do, is to fiddle
around a little bit with the machine-dependent \texttt{jmp\_buf}
ingredients (by poking around in \texttt{setjmp.h} or by disassembling
\texttt{longjmp}(3) in the debugger). Usually one just has to do a
\texttt{setjmp}(3) to get an initial state in the \texttt{jmp\_buf}
structure and then manually adjust two of its fields: the program
counter (usually a structure member with ``\texttt{pc}'' in the name)
and the stack pointer (usually a structure member with ``\texttt{sp}''
in the name).

That is all and can be acceptable for a real-world implementation which
really wants to cover mostly \textit{all} platforms -- at least as long
as the special treatment is needed just for one or two platforms. But
one has to keep in mind that it at least breaks one of the initial goals
and has to be treated as a last chance solution.

\mypar{Functions sigsetjmp(3) and siglongjmp(3)}

One certainly wants the \textit{POSIX} thread semantics where a
thread has its own signal mask. As already mentioned, on some
platforms \texttt{setjmp}(3) and \texttt{longjmp}(3) do not provide
this and instead one has to explicitly call \texttt{sigsetjmp}(3)
and \texttt{siglongjmp}(3) instead. There is only one snare:
on some platforms \texttt{sigsetjmp}(3)/\texttt{siglongjmp}(3)
save also information about the alternate signals stack. So
here one has to make sure that although the thread dispatching
later uses \texttt{sigsetjmp}(3)/\texttt{siglongjmp}(3), the
thread creation step in \texttt{mctx\_create} still uses plain
\texttt{setjmp}(3)/\texttt{longjmp}(3) calls for the trampoline
trick. One just has to be careful because the \texttt{jmp\_buf}
and \texttt{sigjmp\_buf} structures cannot be mixed between
calls to the \texttt{sigsetjmp}(3)/\texttt{siglongjmp}(3) and
\texttt{setjmp}(3)/\texttt{longjmp}(3).

\mypar{More Machine Context Ingredients}

\noindent
Finally, for a real-world threading implementation one usually wants to
put more state into the machine context structure \texttt{mctx\_t}.
For instance to fulfill more \textit{POSIX} threading semantics, it is
reasonable to also save and restore the global \texttt{errno} variable.
All this can be easily achieved by extending the \texttt{mctx\_t}
structure with additional fields and by making the \texttt{mctx\_save},
\texttt{mctx\_restore} and \texttt{mctx\_switch} functions to be aware
of them.

\subsection{Related Work}

Beside \textit{GNU Pth} \cite{pth}, there are other multithreading
libraries which use variants of the presented approach for implementing
machine contexts in user-space. Most notably there are \textsc{Robert
S. Thau}'s \textit{Really Simple Threads} (\textit{rsthreads},
\cite{rsthreads}) package which uses \texttt{sigaltstack}(2) in
a very similar way for thread creation, and \textsc{Kota Abe}'s
\textit{Portable Thread Library} (\textit{PTL}, \cite{ptl2}) which uses
a \texttt{sigstack}(2) approach. But because their approaches handle
the signal handler scope differently, they are not able to achieve the
same amount of portability and this way do not work for instance on some
System-V-derived platforms.

\subsection{Summary \& Availability}

We have presented a pragmatic and mostly portable fallback approach
for implementing the machine context for user-space threads, based
entirely on Unix system and ANSI-C language facilities. The approach
was successfully tested in practice on a wide range of Unix flavors by
\textit{GNU Pth} and should also adapt to the remaining Unix platforms
as long as they adhere to the relevant standards.

The \textit{GNU Pth} package is distributed under the
GNU Library General Public License (LGPL 2.1) and freely
available from \textit{http://www.gnu.org/software/pth/} and
\textit{ftp://ftp.gnu.org/gnu/pth/}.

\subsection{Acknowledgements}

I would like to thank \textsc{Robert S. Thau}, \textsc{David Butenhof},
\textsc{Martin Kraemer}, \textsc{Eric Newton} and \textsc{Bruno Haible}
for their comments which helped to write the initial version of this
paper. Additionally, credit has to be given to \textsc{Christopher
Small} and the USENIX reviewers for their invaluable feedback which
allowed this paper to be extended, cleaned up and finally published
at the USENIX Annual Technical Conference 2000. Finally, thanks
go to all users of \textit{GNU Pth} for their feedback on the
implementation, which helped in fine-tuning the presented approach.
\hfill [\textsf{\small rse}]

{\small%\baselineskip=10pt
\begin{thebibliography}{XX}

\bibitem{iso}
        \textit{POSIX 1003.1c Threading},
        IEEE POSIX 1003.1c-1995, ISO/IEC 9945-1:1996

\bibitem{co1}
        \textsc{M.E. Conway}: \textit{Design of a separable transition-diagram
        compiler.}, Comm. ACM 6:7, 1963, p.396-408

\bibitem{co2}
        \textsc{E.W. Dijkstra}: \textit{Co-operating sequential processes}, in
        F. Genuys (Ed.), \textit{Programming Languages}, NATO Advanced Study
        Institute, Academic Press, London, 1965, p.42-112.

\bibitem{nbf}
        \textsc{B. Nichols, D. Buttlar, J.P. Farrel}: \textit{Pthreads
        Programming - A POSIX Standard for Better Multiprocessing}, 
        O'Reilly, 1996; ISBN 1-56592-115-1

\bibitem{lbe}
        \textsc{B. Lewis, D. J. Berg}: 
        \textit{Threads Primer - A Guide To Multithreaded Programming}, 
        Prentice Hall, 1996; ISBN 0-13-443698-9

\bibitem{ndi}
        \textsc{S. J. Norton, M. D. Dipasquale}:
        \textit{Thread Time - The Multithreaded Programming Guide},
        Prentice Hall, 1997; ISBN 0-13-190067-6

\bibitem{drb}
        \textsc{D. R. Butenhof}:
        \textit{Programming with POSIX Threads},
        Addison Wesley, 1997;
        ISBN 0-201-63392-2

\bibitem{spr}
        \textsc{S. Prasad}:
        \textit{Multithreading Programming Techniques},
        McGraw-Hill, 1996;
        ISBN 0-079-12250-7 

\bibitem{kss}
        \textsc{S. Kleinman, B. Smalders, D. Shah}:
        \textit{Programming with Threads},
        Prentice Hall, 1995;
        ISBN 0-131-72389-8

\bibitem{cjn}
        \textsc{C.J. Northrup}:
        \textit{Programming With Unix Threads},
        John Wiley \& Sons, 1996;
        ISBN 0-471-13751-0

\bibitem{bmv}
        \textsc{P. Barton-Davis, D. McNamee, R. Vaswani, E. Lazowska}:
        \textit{Adding Scheduler Activations to Mach 3.0},
        University of Washington, 1992;
        Technical Report 92-08-03

\bibitem{lwp}
        \textsc{D. Stein, D. Shah}: \textit{Implementing Lightwight Threads}, 
        SunSoft Inc., 1992 (published at USENIX'92). 

\bibitem{aup}
        \textsc{W.R.Stevens}: \textit{Advanced Programming in
        the Unix Environment}, Addison-Wesley, 1992; ISBN 0-201-56317-7

\bibitem{posix}
        \textsc{D. Lewine}: \textit{POSIX Programmer's Guide: Writing Portable
        Unix Programs}, O'Reilly \& Associates,Inc., 1994; ISBN 0-937175-73-0

\bibitem{osfaq}
        \textsc{Bryan O'Sullivan}: \textit{Frequently asked questions for
        comp.os.research}, 1995; http://www.serpentine.com/\verb|~|bos/os-faq/,
        ftp://rtfm.mit.edu/pub/usenet/comp.os\-.re\-search/

\bibitem{tfaq1}
        \textsc{Sun Microsystems, Inc}: \textit{Threads Frequently Asked Questions},
        1995, http://www\-.sun\-.com/\-workshop/\-threads/\-faq.html

\bibitem{tfaq2}
        \textsc{Bryan O'Sullivan}: \textit{Frequently asked questions for
        comp.programming.threads}, 1997;
        http://www.serpentine.com/\verb|~|bos/threads-faq/.

\bibitem{tfaq3}
        \textsc{Bil Lewis}: \textit{Frequently asked questions for
        comp.programming.threads}, 1999;
        http://\-www.lambdacs.com/\-newsgroup/\-FAQ\-.html

\bibitem{mdg}
        \textsc{Numeric Quest Inc}: \textit{Multithreading - Definitions and Guidelines};
        1998; http://www.numeric-quest.com/lang/multi-frame.html

\bibitem{sus}
        \textsc{The Open Group}: \textit{The Sing\-le Unix Speci\-fi\-cation, Version 2 - Threads};
        1997; http://www\-.opengroup\-.org/online\-pubs/007908799/xsh/threads.html

\bibitem{str}
        \textsc{Sun Microsystems Inc}: \textit{SMI Thread Resources}; 
        http://www.sun\-.com/\-workshop/\-threads

\bibitem{fsu}
        \textsc{Frank Mueller}: \textit{FSU pthreads}; 1997;
        http://www\-.cs.fsu\-.edu/\-\verb|~|mueller/\-pthreads/

\bibitem{mit}
        \textsc{Chris Provenzano}: \textit{MIT pthreads}; 1993;
        http://www.mit\-.edu/people/\-pro\-ven/\-pthreads.html (old),
        http://www\-.human\-factor\-.com/\-pthreads/\-mit-pthreads.html (updated)

\bibitem{ptl2}
        \textsc{Kota Abe}: \textit{Portable Threading Library} (PTL); 1999;
        http://www\-.media\-.osaka-cu\-.ac\-.jp/\verb|~|k-abe/PTL/

\bibitem{pth}
        \textsc{Ralf S. Engelschall}: \textit{GNU Portable Threads} (Pth); 1999;
        http://www\-.gnu\-.org/\-software/\-pth/, ftp://ftp\-.gnu\-.org/\-gnu/\-pth/

\bibitem{pcthreads}
        \textsc{Michael T. Peterson}: \textit{POSIX and DCE Threads For Linux}
        (PCThreads); 1995; http://members.aa\-.net/\verb|~|mtp/PCthreads.html

\bibitem{rsthreads}
        \textsc{Robert S. Thau}: \textit{Really Simple Threads} (rsthreads); 1996;
        ftp://ftp.ai.mit.edu/pub/rst/

\bibitem{uthread}
        \textsc{John Birrell}: \textit{FreeBSD uthreads}; 1998;
        ftp://ftp.freebsd\-.org/pub/FreeBSD/\-FreeBSD-current/src/lib/libc\_r/uthread/

\bibitem{linuxthreads}
        \textsc{Xavier Leroy}: \textit{The LinuxThreads library}; 1999;
        http://\-pauillac\-.inria\-.fr/\verb|~|xleroy/\-linux\-threads/

\bibitem{aix}
        \textsc{IBM}: \textit{AIX Version 4.3 General Programming Concepts:
        Writing and Debugging Programs; Understanding Threads}; 1998;
        http://www\-.rs6000\-.ibm\-.com/\-doc\_link/\-en\_US/\-a\_doc\_lib/\-aixprggd/\-genprogc/\-understanding\-\_threads\-.htm

\bibitem{nspr}
        \textit{Netscape Portable Runtime} (NSPR);
        http://\-www.mozilla.org/\-docs/\-refList/refNSPR/,
        http://\-lxr.mozilla.org/\-seamonkey/\-source/nsprpub/

\bibitem{smcb}
        \textsc{Ralf S. Engelschall}: \textit{Simple Machine Context Benchmark}; 2000;
        http://\-www.gnu.org\-/software/\-pth\-/smcb.tar.gz

\end{thebibliography}
}

\end{multicols}

\end{document}