Value Function Iteration with a GPU
Software
The software is hosted at GitHub: VFI Repository Page.

The recommended method of obtaining the software is by cloning the repository with Git. Otherwise, a ZIP archive can be downloaded from the repository page.

The repository contains several different implementations of the problem described in the next section:

• Single-threaded, sequential C++, making use of the Eigen template library for linear algebra computations.
• Single-threaded, sequential Matlab. This is done to compare with what the majority of economists would use to solve the problem.
• Thrust, using the OpenMP backend to solve the problem in parallel on several CPU cores.
• Thrust, using the CUDA backend to solve the problem in parallel on the GPU.
• CUDA C.

The side-by-side implementations allow users to learn the basics of massively parallel programming by comparing code with familiar benchmarks (Matlab or sequential C++).

Problem Description

Aldrich et al. (2011) solves a neoclassical growth model wih value function iteration (VFI) on a GPU. The basic problem is summarized by the optimization problem $$V(K,Z) = \max_c \left\{ \frac{C^{1-\gamma}}{1-\gamma} + \beta \mathbb{E}[V(K',Z')|Z]\right\} \label{bellman}$$ subject to \begin{align} K' & = Z K^{\alpha} + (1-\delta) K - C \label{rc} \\ \log(Z') & = \rho \log(Z) + \varepsilon, \text{ where } \varepsilon \sim \mathcal{N}(0,\sigma^2). \label{tfp} \end{align} The state variables in this problem are capital, $K$, and total factor of productivity (TFP), $Z$.

Algorithm

The VFI software solves the problem with the following algorithm.

1. Fix some $\tau > 0$ which will determine convergence and set $\epsilon = \tau+1$.
2. Compute the deterministic steady-state level of capital, $K_{ss}$, and set $\underline{K} = 0.95K_{ss}$ and $\overline{K} = 1.05K_{ss}$. Discretize the state space for capital so that it is confined to a grid of $N_k$ equally-spaced values between $\underline{K}$ and $\overline{K}$. Denote the grid by $\mathcal{K}$.
3. Use the method of Tauchen (1986) to discretize the state space for the log of TFP so that it is confined to a grid of $N_z$ equally-spaced values between $\underline{z}$ and $\overline{z}$ (where $z = \log(Z)$). Denote the grid for TFP levels by $\mathcal{Z}$ and the matrix of transition probabilities $P$, where the probability of transitioning from $Z$ to $Z'$ is expressed as $P(Z,Z')$.
4. Guess initial values of the value function, $V^0$, for each pair of possible values of the state variables, $K$ and $Z$ (i.e. $V^0$ is an $N_k \times N_z$ matrix). In particular, set $V^0$ to be equal to the deterministic steady-state values of the value function.
5. while $\epsilon > \tau$ do
6.    for each $K \in \mathcal{K}$
7.        for each $Z \in \mathcal{Z}$
8.            for each $K' \in \mathcal{K}$
9.                Compute \begin{align} C(K,Z,K') &= Z K^{\alpha} + (1-\delta)K - K'\label{cons} \\ Exp(K,Z,K') &= \sum_{Z' \in \mathcal{Z}} V^0(K',Z')*P(Z,Z') \label{Exp} \\ \tilde{V}(K,Z,K') &= \frac{C(K,Z,K')^{1-\gamma}}{1-\gamma} + Exp(K,Z,K'). \label{continuation} \end{align}
10.            end for
11.            Set \begin{align} V(K,Z) &= \max_{K'} \tilde{V}(K,Z,K'). \label{max} \end{align}
12.        end for
13.    end for
14.    Compute the difference between the updated value function and $V^0$: \begin{align} \epsilon &= ||V - V^0||_{\infty}. \end{align}
15.    Set $V = V^0$.
16. end while

A basic implementation would involve computing the quantities in Equations \eqref{cons} - \eqref{continuation} in a serial fashion for each value of $K' \in \mathcal{K}$ in the loop at Step 8. If either $\mathcal{K}$ or $\mathcal{Z}$ is a very dense grid, Step 8 may involve many thousands of serial calculations for each of the values in the loops at Steps 6 and 7.

Alternatively, with many parallel processors available, the loops at Steps 6 and 7 could be eliminated and the sequence of instructions nested in Step 8 could be assigned to an individual processor and computed in parallel for each pair $(K,Z)$. The reason that parallelism can be exploited in this problem is that the computations nested within Steps 6 and 7 depend only on the concurrent $(K,Z)$ and not on other values in $\mathcal{K}$ and $\mathcal{Z}$.

Results

This section reports timing results for solving the model on a 4U rackmount server with a single quad-core Intel Xeon 2.4 GHz CPU and two NVIDIA Tesla C2075 GPUs. Only one of the GPUs was utilized for computation of the Thrust/GPU and CUDA C implementations and all four cores of the CPU were used for the Thrust/OpenMP implementation. The model was solved using the parameter values below.

\begin{array}{cccccc} \hline \beta & \gamma & \alpha & \delta & \rho & \sigma \\ \hline 0.984 & 2 & 0.35 & 0.01 & 0.95 & 0.005 \\ \hline \end{array}

The following two tables report times for the implementations above. $N_z = 4$ and $N_k$ was incremented to assess the relative performance of the GPU for an increasingly dense grid of state space values. All results are computed in double precision and ratios are relative to C++ times. The first table reports results using a binary search algorithm for maximization, while the second table reports results for a naive grid search. The grid search method was able to exploit policy function iteration, only performing the maximization on the right-hand side of Equation \eqref{bellman} every 20 iterations of the value function. This was not possible for the binary search algorithm, since it did not preserve the concavity of the value function, which is crucial for binary search. For more details regarding the algorithms and solutions, see Aldrich et al. (2011).

\begin{array}{l|cccccccccc} \hline N_k & 128 & 256 & 512 & 1,024 & 2,048 & 4,096 & 8,192 & 16,384 & 32,768 & 65,536 \\ \hline \text{C++} & 0.547 & 1.35 & 3.41 & 9.05 & 25.73 & 84.58 & 297.32 & 1,114.95 & 4,653.81 & 19,421.90 \\ \hline \text{Matlab} & 48.97 & 98.36 & 203.18 & 426.57 & 920.10 & 2,077.24 & 5,020.26 & 16,129.32 & 45,070.47 & 140,341.10 \\ \text{Matlab Ratio} & 89.55 & 72.67 & 59.50 & 47.13 & 35.76 & 24.56 & 16.89 & 14.47 & 9.68 & 7.23 \\ \hline \text{Thrust/OpenMP} & 0.118 & 0.241 & 0.519 & 1.10 & 2.37 & 5.04 & 10.81 & 23.10 & 49.53 & 106.66 \\ \text{Thrust/OpenMP Ratio} & 0.217 & 0.178 & 0.152 & 0.121 & 0.0920 & 0.0595 & 0.0364 & 0.0207 & 0.0106 & 0.00549 \\ \hline \text{Thrust/CUDA Start} & 6.75 & 6.70 & 6.74 & 6.64 & 6.72 & 6.62 & 6.72 & 6.75 & 6.66 & 6.75 \\ \text{Thrust/CUDA Solution} & 0.240 & 0.273 & 0.322 & 0.392 & 0.711 & 1.20 & 2.13 & 4.26 & 8.52 & 17.21 \\ \text{Thrust/CUDA Total} & 6.99 & 6.97 & 7.06 & 7.03 & 7.43 & 7.82 & 8.84 & 11.01 & 15.19 & 23.96 \\ \text{Thrust/CUDA Solution Ratio} & 0.439 & 0.202 & 0.0943 & 0.0433 & 0.0276 & 0.0141 & 0.00715 & 0.00382 & 0.00183 & 0.000886 \\ \text{Thrust/CUDA Total Ratio} & 12.78 & 5.15 & 2.07 & 0.777 & 0.289 & 0.0924 & 0.0297 & 0.00987 & 0.00326 & 0.00123 \\ \hline \text{CUDA C Start} & 6.97 & 6.95 & 6.90 & 6.97 & 6.94 & 6.96 & 6.96 & 6.95 & 6.94 & 6.92 \\ \text{CUDA C Solution} & 0.144 & 0.156 & 0.258 & 0.416 & 0.731 & 1.51 & 3.09 & 6.48 & 13.82 & 29.27 \\ \text{CUDA C Total} & 7.12 & 7.11 & 7.16 & 7.38 & 7.67 & 8.47 & 10.05 & 13.42 & 20.76 & 36.19 \\ \text{CUDA C Solution Ratio} & 0.263 & 0.115 & 0.076 & 0.0460 & 0.0284 & 0.0178 & 0.0104 & 0.00581 & 0.00297 & 0.00151 \\ \text{CUDA C Total Ratio} & 13.01 & 5.25 & 2.10 & 0.816 & 0.298 & 0.100 & 0.0338 & 0.0120 & 0.00446 & 0.00186 \\ \hline \end{array}

\begin{array}{l|ccccccccccccc} \hline N_k & 128 & 256 & 512 & 1,024 & 2,048 & 4,096 & 8,192 & 16,384 & 32,768 & 65,536 \\ \hline \text{C++} & 0.137 & 0.332 & 0.900 & 2.73 & 9.17 & 33.40 & 126.72 & 496.24 & 1,977.93 & 7,892.46 \\ \hline \text{Matlab} & 11.18 & 22.43 & 45.23 & 93.67 & 204.50 & 476.54 & 1,212.05 & 3,587.33 & 11,189.66 & 37,720.38 \\ \text{Matlab Ratio} & 81.42 & 67.59 & 50.26 & 34.28 & 22.29 & 14.27 & 9.56 & 7.23 & 5.66 & 4.78 \\ \hline \text{Thrust/OpenMP} & 0.0607 & 0.156 & 0.490 & 1.70 & 6.37 & 24.64 & 96.95 & 383.30 & 1,524.40 & 6,081.77 \\ \text{Thrust/OpenMP Ratio} & 0.442 & 0.471 & 0.544 & 0.621 & 0.694 & 0.738 & 0.765 & 0.772 & 0.771 & 0.771 \\ \hline \text{Thrust/CUDA Start} & 6.73 & 6.73 & 6.69 & 6.66 & 6.72 & 6.78 & 6.77 & 6.75 & 6.74 & 6.78 \\ \text{Thrust/CUDA Solution} & 0.173 & 0.234 & 0.348 & 0.628 & 1.53 & 4.23 & 13.71 & 53.33 & 191.99 & 774.87 \\ \text{Thrust/CUDA Total} & 6.90 & 6.97 & 7.04 & 7.29 & 8.26 & 11.02 & 20.48 & 60.08 & 198.73 & 781.65 \\ \text{Thrust/CUDA Solution Ratio} & 1.26 & 0.704 & 0.387 & 0.230 & 0.167 & 0.127 & 0.108 & 0.107 & 0.0971 & 0.0982 \\ \text{Thrust/CUDA Total Ratio} & 50.26 & 20.99 & 7.83 & 2.67 & 0.900 & 0.330 & 0.162 & 0.121 & 0.100 & 0.0990 \\ \hline \text{CUDA C Start} & 6.94 & 6.93 & 6.95 & 6.96 & 6.81 & 6.90 & 6.92 & 6.85 & 6.97 & 6.99 \\ \text{CUDA C Solution} & 0.103 & 0.134 & 0.257 & 0.664 & 2.12 & 7.81 & 29.76 & 116.80 & 462.31 & 1,844.37 \\ \text{CUDA C Total} & 7.04 & 7.07 & 7.20 & 7.62 & 8.93 & 14.71 & 36.68 & 123.65 & 469.28 & 1,851.36 \\ \text{CUDA C Solution Ratio} & 0.750 & 0.404 & 0.286 & 0.243 & 0.231 & 0.234 & 0.235 & 0.235 & 0.234 & 0.234 \\ \text{CUDA C Total Ratio} & 51.26 & 21.30 & 8.01 & 2.79 & 0.974 & 0.440 & 0.289 & 0.249 & 0.237 & 0.235 \\ \hline \end{array}