Thesis | Tobias Nauen

Stochastic Optimal Control using Signatures

Thu, 09 Jun 2022 00:00:00 +0000

1 Introduction

In this thesis, we consider a stochastic control problem of the form

$$ dY_t = \mu_t b(Y_t) dt + \sigma(Y_t) dB_t, $$

where $\mu_t$ is an $\mathcal F_t = \sigma(B_s | s \leq t)$-measurable, continuous process we have some control over. An SDE of this form can be found when one considers a noisy process, where only some control on the drift, i.e. the average direction is given. This control manifests itself in the function $\mu_t : [0, T] \to \mathbb R$.

A toy example for a problem of this kind could be modeling navigating on the seas or in space, where the random part is the combined influence of winds and currents on a boat and $\mu_t$ represents the direction of the rudder, or in the space example, the randomness represents course altering events like solar winds and $\mu_t$ is the direction or strength of thrust. A similar optimal control problem with control in the drift was considered in (Diehl, Fritz, and Gassiat, 2017), which investigates the value function to find a dual problem. This was the first paper on stochastic optimal control, using rough path analysis.

We now use the ansatz \begin{align*} &\mu_t = \Theta(B|_{[0, t]}) &\Theta \in C( \Lambda_T, \mathbb R) =: \mathcal T, \end{align*} with $\Lambda_T$ being the space of stopped rough paths up to time $T$ (see Definition 5.2).

This gives the SDE

$$ dY_t^\mu = \underbrace{\Theta(\hat B|_{[0, t]}) b(Y_t^\mu)}\_{= \mu_t} dt + \sigma(Y_t^\mu) dB_t. $$

We can now define a loss-function like \begin{align*} L(Y^\mu) := \mathbb E(Y_T^\mu)^{2 + \mathbb E(\left|Y_T}\mu\right|^2), \end{align*} but in general all losses $L : C([0, T], \mathbb R^m) \to \mathbb R^+$ Lipschitz or Hölder continuous are possible.

The question we want to answer is: What is $\inf_\mu \mathbb E[L(Y^\mu)]$ and what does the corresponding $\mu$ (and $\Theta$) look like? It is the question of an optimal way to act, while counteracting random noise.

First, we need to understand our main problem SDE. This is a shorthand notation for \begin{align}\label{eq:main_problem_integral_form} Y_t - Y_0 = \int_0^{t \mu_\tau b(Y_\tau) d\tau + \int_0}t \sigma(Y_\tau) dB_\tau. \end{align} The second integral here can be seen as an Itô integral. However, we will view it as the integral over a rough path, a so-called rough integral. This is a generalization of the Itô-Map, which can also incorporate other types of stochastic integrals, like the Stratonovich-Integral. This change of perspective is useful since we want to look at the so-called signatures of some processes, which are defined naturally in the context of rough paths.

The theory of rough paths was first introduced in the 1990s by Terry Lyons. It is an elegant framework for path-wise integration with rough driving signals and is therefore suited to a general class of stochastic processes, like Brownian motion or fractional Brownian motion. In particular rough integrals are a generalization of Young’s theory of integration. An important aspect of the theory is the continuity of the solution map of rough differential equations, which is not given in the classical case of Itô SDEs, where the solution map is measurable, but not continuous.

In addition to theoretical advances in SDEs, there were additional tools developed for rough paths, most notably the signature. The signature $\mathbb X^{< \infty}$ of a path $x: [0, T] \to \mathbb R^n$ is a collection of iterated integrals of all components of the path against each other; \begin{align*} \int_{0 \leq t_1 \leq … \leq t_k \leq t} dx_{t_1}^{{i_1} … dx_{t_k}}{i_k} \end{align*} for $k \in \mathbb N$ and $i_1, ..., i_k \in \lbrace 1, ..., n\rbrace$. Now, the values of the signature have to be defined up to a certain level $k$, which depends on the roughness of the underlying path. To see why this is true, one can consider the differences between the Itô and Stratonovich integrals, which both are fair definitions of integrals with respect to Brownian motion. We have \begin{align*} \int_{0 \leq t_1 \leq t_2 \leq T} dB_{t_1} dB_{t_2} = \int_0^{T B_t dB_t = \frac{B_T}2}{2} + \frac{T}{2}, \end{align*} but also \begin{align*} \int_{0 \leq t_1 \leq t_2 \leq T} \circ dB_{t_1} \circ dB_{t_2} = \int_0^{T B_t \circ dB_t = \frac{B_T}2}{2}, \end{align*} which makes it clear, that there is not one single way of defining the signature of a process. This is why, when working with iterated integrals, one has to set one way of calculation. The theory of rough paths gives a framework for doing exactly that. The signature of a path is important because the signature at time $t$ determines the whole path up to time $t$ up to so-called tree-like extensions. In particular the signature of an augmented rough path, i.e. a path $x_t = (x_t^{(1)}, ..., x_t^{(n)})$ with an additional dimension that represents the time \begin{align*} \hat x_t = (x_t^{(1)}, …, x_t^{(n)}, t) \in \mathbb R^{n + 1} \end{align*} is unique. This makes the signature an important tool in machine learning as a model-free way to extract features from time-series data, like audio, speech, or character drawing. As such it has been used successfully in several machine learning applications including Chinese character recognition or even medical tasks like the recognition of mental disorders.

The property of injectiveness of the signature map also makes it important to us and is why we take the following ansatz for answering the question from above:

$$ \Theta(\hat B|\_{[0, t]}) = \langle \ell, \hat{B}\_{0, t}^{< \infty} \rangle. $$

Here, $\hat{B}_{0, t}^{< \infty}$ is the signature of the augmented path of Brownian motion. In this, we will follow the reasoning of (Kalsi, Lyons, and Arribas, 2020) and (Bayer et al., 2022), where it was shown that similar control problems of optimal trading speed and optimal stopping can be solved by just using linear maps of the path signature.

The main result of this thesis will be

Theorem 5.6:

Let $2 \leq p < 3$ and let $\mathbb P$ be a probability measure on $\left( \hat \Omega^p_T, \mathcal B(\hat \Omega^p_T) \right)$. Let $Y^\mu$ be the unique solution to \begin{align*} dY = \mu_t b(Y_t) dt + \sigma(Y_t) d\mathbf x \end{align*} started at $\xi \in \mathbb R^m$, with $\mu \in \mathcal T$, $b$ Lipschitz, and $\sigma \in C^3_b(\mathbb R^m, \mathbb R^{m \times n})$. Here, the $\mathbf x$ is a random geometric $p$-rough path with distribution determined by $\mathbb P$. It holds \begin{align*} \inf_{\mu \in \mathcal T} \mathbb E [L(Y^\mu)] = \inf_{\mu \in \mathcal{T}_{sig}} \mathbb E [L(Y^\mu)] \end{align*} for a loss function $L : C([0, T], \mathbb R^m) \to \mathbb R$ bounded and $\alpha$-Hölder for some $\alpha > 0$.

Here $\mathcal T = C(\Lambda_T, \mathbb R)$ is the set of all continuous functions of the path up to some time $t \in [0, T]$, while $\mathcal{T}\_{sig}$ is the set of all functions of the form $\langle \ell, \hat{\mathbb{X}}_{0, t}^{< \infty} \rangle$. The theorem, therefore, says, that the optimal control problem can be solved by considering just linear maps of the signature of the augmented path. The statement will then also be extended to Itô integrals, as considered in the problem SDE, in Theorem 5.6.

Using these theorems, we can tackle our question numerically by modeling $\mu\_t = \langle \ell, \hat{B}\_{0, t}^{\leq k} \rangle$ to be a linear map. Here, we drop from the infinite-dimensional, full signature $\hat B_{0, t}^{< \infty}$ to the finite-dimensional, truncated signature $\hat B_{0, t}^{\leq k}$ for numerical reasons. This is a good approximation, as \begin{align*} \left|\left|{\hat B_{s, t}^{k}\right|\right| \leq C \frac{\omega(s, t)}{\frac k p}}{\left( \frac{k}{p} \right) ! } \end{align*} (see Theorem 3.7 in Lyons, Caruana, and Lévy, 2007), i.e. the norms of additional signature levels decrease like $\frac{1}{k !}$. We can approximate the RDE’s solution by using a Milstein scheme (Algorithm 3) on a discrete time-grid \begin{align*} 0 = t_0 < t_1 < … < t_k = T \end{align*} and estimating the expected loss $\mathbb E[L(\theta)]$ after many such simulations. Using the backpropagation algorithm then can lead us arbitrarily close to the optimal solution $\mu_t$.

At first, we will introduce the theory of rough paths with its basic facts and definitions and derive rough integrals as a limit of Riemann-like sums in Section 2. Throughout the thesis, we will work with general rough paths with finite $p$-variation for $p \in [2, 3)$, where Young integration breaks down. For ease of notation, we will introduce a tensor calculus. In this section, a general setting of controlled rough paths is also established that deals with all kinds of rough paths as opposed to the theory of (Fritz, and Hairer, 2020) only considering $\alpha$-Hölder paths. After that, in Section 3, we will deal with rough differential equations (RDEs). We will prove the existence and uniqueness of solutions in the usual way via Picard iteration, but then extend the theory to RDEs with drift term, where we will only require very mild assumptions on the drift term, such that we can incorporate all RDEs of the form seen in the problem SDE for $b$ Lipschitz and $\mu$ continuous. We also investigate the stability of RDEs in the drift term. After having introduced RDEs, we will move on to signatures in Section 4, where we will see the basic definitions, along with a proof of the shuffle identity for geometric rough paths. This is directly followed by the proof of our main theorem, Theorem 5.6, in Section 5. Here, we will exploit the notion of stopped rough paths, as well as Lemma 5.5 which has also been used in (Kalsi, Lyons, and Arribas, 2020) and (Bayer et al., 2022) to show the density of signature controls on compact sets of arbitrary high probability ($< 1$). We then expand the main theorem to work with Itô-integrals. After proving the theoretical results, we will go on to state numerical algorithms which can be used for approximation and which are also implemented and can be viewed on , as well as some convergence results for said algorithms in Section 6. Then, in Section 7, we test our implementation against a julia reference implementation based on two SDE problems. We also use our framework to solve an optimal asset allocation problem in the Black-Scholes model. The SDE of this problem is of a different structure than we had before, and we argue why the same approach we took (approximating $\mu \in \mathcal T$ by $\mu \in \mathcal T_{sig}$) can also be done when one has combined control over the drift and volatility terms. Here, we use the Markov property of Brownian motion and neural networks to choose the control term to be $C(\mathbb R^{m + 1}, \mathbb R)$ instead of a linear function of the signature of the process. In the end (Section 8) we will discuss some extensions of the problem, as well as different possibilities of defining the Gubinelli derivative of RDE solutions when dealing with a drift term.

For more information, see the .

References

(Diehl, Fritz, and Gassiat, 2017) Joscha Diehl, Peter K. Fritz, and Paul Gassiat. ‘‘Stochastic control with rough paths’’. In Applied Mathematics & Optimization 75, pp. 285-315, 2017.

(Kalsi, Lyons, and Arribas, 2020) Jasdeep Kalsi, Terry Lions, and Imanol Perez Arribas. ‘‘Optimal execution with rough path signatures’’. In SIAM J. Financial Math 11, pp.470-493, 2020.

(Bayer et al., 2022) Christian Bayer et al. ‘‘Optimal stopping with signatures’’. In: The Annals of Applied Probability, 2022.

(Lyons, Caruana, and Lévy, 2007) Terry J. Lyons, Michael J. Caruana, and Terry Lévy. ‘‘Differential equations driven by rough paths’’. In: Lecture Notes in Mathematics, Springer Berlin Heidelberg, 2007.

Explaining Graph Neural Networks

Fri, 08 Oct 2021 00:00:00 +0000

Introduction

In this bachelor thesis, we explore and evaluate different methods of explaining Graph Neural Networks (GNNs). Graph Neural Networks are an emerging class of neural networks, that take graphs as their input data. This is especially useful since graphs are highly flexible and powerful data structures, that can therefore express a set of different datapoints with complex relationships between them. The motivation for developing graph neural networks comes from the overwhelming success of convolutional neural networks, which can be seen as a special case of GNNs, operating on pictures by exploiting neighborhood information, which can also be expressed as a graph. Today graph neural networks are used in a wide array of domains, like the prediction of molecular properties in chemistry, drug discovery, or even diagnosis in medicine, to model the spread of disease, in recommendation systems, or natural language processing.

But why would one want to explain these networks? Methods for explaining neural models are used to perform a wide amount of tasks. The first one is to debug the model and increase performance, as explanation methods can uncover model bias or spurious correlations in the training data. These are then used to clean up or expand the training data or to adjust the model class, to archive better performance and generalization.

Model	Prediction	Explanation
$A$	Positive	Even though the Icelandic scenery is incredibly stunning, the story can’t keep up, and therefore the overall experience is boring.
$B$	Negative	Even though the Icelandic scenery is incredibly stunning, the story can’t keep up, and therefore the overall experience is boring.
$C$	Negative	Even though the Icelandic scenery is incredibly stunning, the story can’t keep up, and therefore the overall experience is boring.

For example, if we want to categorize reviews into positive and negative ones, we are also interested in exactly why our model decides that a given review is positive or negative. Using this information we can more accurately judge the models’ performance, by checking if its predictions are correct for the right reasons. The example explanations in the table above reveal that model $B$ is correct for the wrong reason, while model $C$ is correct for the right reason. Therefore, model $C$ should be deployed over model $B$ since we can expect $C$ to generalize better to new, unseen data.

A second application area of explanations is to assess the suitability of a model for use in the real world. This is especially important in high-stakes environments, such as medicine or law enforcement, where graph neural networks are used. Therefore, explainability is also part of the approval process by a regulatory authority, like the European Union, or in some companies. Another way explanation techniques are useful is by hinting on what to change in the input, to receive a different model output. This is useful for example in loan approval if the client wants to receive information on what factors to change to be approved.

One distinguishes two forms of explanations: global and local ones. While global explanations are ways of explaining the model as a whole, it is often not feasible to construct such global explanations, especially when using a model with a lot of parameters, since it’s just too complex to be understood as a whole. Therefore, we want to focus on local explanations. These don’t attempt to explain the whole model, but just a single decision of the model given a certain input. The explanations in the table above are local ones.

Now it begs the question, how one can explain the decisions of a graph neural network. To answer this question we will lay out the relevant techniques to generate attribution weights, as well as expand on them. Attribution weights are ways of explaining neural models by associating a weight with different parts, or tokens, of the models’ input. These tokens could be pixels in a picture, words in a text, or nodes and/or edges in a graph. The parts with high weight are seen as more important for the models’ decision than those with low weights. If all generated weights are zero or one, the technique is called a hard attribution technique. These mark relevant parts of the input, as is the case in the table above. When a range of real numbers is allowed as weights, the attributions are called soft. We will focus on soft attribution techniques, as these provide a relation of importance on the inputs’ tokens.

The second question that arises is, which technique should one use to explain GNNs and how to judge if one technique is better than another? To answer these questions, we first establish and explain the notion of graph neural networks, as well as different architectures. Then we introduce some gradient-based attribution techniques and the interpretability by design approach of KEdge. KEdge was introduced by . It works by sampling a mask for the edges of a graph via an approximation of the Bernoulli distribution. This mask can then be used to generate attribution weights. In the original paper, this approximation is based on the Kumaraswamy, or Kuma, distribution. In our third chapter, we define some probability distributions to construct different approximations of the Bernoulli distribution, that we can use with KEdge. We also talk about how to obtain node-level attribution weights from KEdge. Then we introduce some metrics to measure the performance of the different attribution techniques, and in particular, we extend the notion of fidelity to soft attribution techniques by introducing integrated fidelity.

In the main part of this thesis, we conduct three experiments. The first two, to evaluate and compare the attribution techniques, as well as to see, what effects KEdge has on a model’s performance. Here, we compare the accuracy of different models with and without KEdge, to see if there is a noticeable difference, depending on which underlying probability distribution we used. We also compare the integrated fidelity values of all the attribution techniques we introduced before. This is done on the node classification datasets Pubmed, Cora, and CiteSeer and the graph classification dataset MUTAG. In the last experiment, we use our methods on a text dataset of movie reviews, to be able to visualize attribution weights and compare different metrics of evaluating attribution weights.

For more information, see the .

References

(Rathee et al., 2021) Mandeep Rathee et al. ‘‘Learned Sparsification for Interpretable Graph Neural Networks’’. In: arXiv: 2106.12920, 2021.