$\DeclareMathOperator{\arccosh}{arccosh} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\Exp}{Exp} \newcommand{\geo}{\gamma_{\overset{\frown}{#1,#2}}} \newcommand{\geoS}{\gamma} \newcommand{\geoD}{\gamma_} \newcommand{\geoL}{\gamma(#2; #1)} \newcommand{\gradM}{\nabla_{\M}} \newcommand{\gradMComp}{\nabla_{\M,#1}} \newcommand{\Grid}{\mathcal G} \DeclareMathOperator{\Log}{Log} \newcommand{\M}{\mathcal M} \newcommand{\N}{\mathcal N} \newcommand{\mat}{\mathbf{#1}} \DeclareMathOperator{\prox}{prox} \newcommand{\PT}{\mathrm{PT}_{#1\to#2}#3} \newcommand{\R}{\mathbb R} \newcommand{\SPD}{\mathcal{P}(#1)} \DeclareMathOperator{\Tr}{Tr} \newcommand{\tT}{\mathrm{T}} \newcommand{\vect}{\mathbf{#1}}$

# Computing the median with cyclic proximal point

This tutorial introduces the cyclic proximal point algorithm using again the example of the Riemannian median that can be introduced analogously to the Riemannian center of mass.

Similar to the Riemannian center of mass and the variational formulation introduced by Karcher (missing reference) the Riemannian median of given data points $x_i\in\M$, $i=1,\ldots,m$ can be defined by

Note that even the median for real numbers is usually defined to be unique, but the variational formulation is not.

Furthermore, the function is $F$ continuous but not differentiable in any of the data points. Though the gradient can be derived, see e.g.  for any $x\neq x_i$ as

where we only obtain a subdifferential at the points $x=x_i$. Hence we can not use a gradient descent algorithm can not be applied. We have to use a subgradient descent algorithm, which was generalized to manifolds in .

We will use another approach that was introduced by Baçâk  for Hadamard spaces. It is based on the proximal point algoritm introduced on Riemannian manifolds by Ferreira and Oliveira  as well as the real-valued analogue of the cyclic proximal point algorithm by Bertsekas,see for example .

The main idea is base on two ideas: the proximal map and the cyclic proximal point algorithm.

### The proximal map

The proximal map of a function $f\colon\M\to\mathbb R$ and a positive real value $\lambda > 0$ is defined in  as

The intuition is, that the proximal map at $x$ is a tradeoff (weighted by the parameter $\lambda$) between minimizing $f$ and not being too far away from $x$. Furthermore, if $x^*$ is a minimizer of $f$ then the proximal map has a fix point, i.e. $\prox_{\lambda f}(x^*) = x^*$, since then the first term is zero and the second is minimal.

Furthermore on $\M=\mathbb R^n$ the proximal map is known to be nonexpoansive and hence iterates

converge to the minmizer for a convex, lsc. function $f$.

Since we would like to minimize our objective $F$ and the minimizer is not known in closed form, it will not help to set $f=F$, since solving the minimization problem for the proximal map involves $F$ itself for this case.

### The cyclic proximal point algorihtm

For a function

that consists of summands where the proximal maps $\prox_{\lambda f_i}(x)$ can be evaluated “easily” or efficiently the cyclic proximal point algorithm consists of cycling through the proximal maps. To emphasize the cyclic order we use fractional iterate numbers, i.e, we write

where the inner cycle $i=1,\ldots,m$ is performed before inceasing $k=1,\ldots$.

This algorithms converges to a minimizer $x^*$ of $f$ in Euclidean space  and on Hadamard manifolds (missing reference) if the sequence $\lambda_k$ is square summable but not summable, i.e. we have

Back to our problem of finding a minimizer of $F$, the Riemannian center of mass: each summand is of the form $F_i(x) = d_\M^2(x,x_i)$, for some fixed (given) data $x_i$, $i=1,\ldots,m$. The corresponding proximal maps can be computed in closed form

where we write equality though the minimizer might not be unique. Namely when the geodesic connecting $x$ and $x_i$ is not unique, the proximal map consists of all points on such geodesics at $t=\tfrac{\lambda}{1+\lambda}$. For the following implementation however, the evaluation of the geodesic depends on the single valued choice of the M.log of the manifold and is hence one of the geodesics in these cases, which occur only in very few occasions if the data is not local enough.

### Implementation

With these preparations we can return to our initial problem and its implementation We start by initializing the MVIRT in the main directory. We further create a manifold, the two-dimensional sphere $\mathbb S^2$ using the code

1
2
3
4
5
6
7
initMVIRT();
M = Sn(2);
pts = 30;
x = ArtificialS2SignalLemniscate(pts,'a',1);
figure(1);
plotS2(x);
title('Given data on S2');


The last two lines perform a plot, which looks like the following figure. A set of points on the sphere $\mathbb S^2$.

The cyclic proximal point algorithm requires a set of anonymous functions representing the proximal maps. Since anonymous functions can not be stored in an array, we require a cell array. While this can be cone using cellfun, we use a simple for-loop for readability. Similar to the gradient descent, even with the same interface/parameters in the anonymous function, we need a stopping criterion. We choose either 100 iterations or if two iterates are nearer than $% $. The lines read Last we define the objective function that we use for debug and later on in the reference implementaion.

1
2
3
4
5
6
7
8
9
F = @(p) 1/pts*sum( M.dist(repmat(p,[ones(size(M.ItemSize)),pts]),x).^2 );
proximalMaps = cell(pts,1);
for i=1:pts
proximalMaps{i} = @(y,lambda) proxDistance(M,x(M.allDims{:},i),y,lambda);
end
stoppingCriterion = stopCritMaxIterEpsilonCreator(M,2000,10^(-9));
[p1,recData1] = cyclicProximalPoint(M,x(M.allDims{:},1),proximalMaps,...
stoppingCriterion,'lambda',pi/64,...
'Record',@(x,xold,iter) [F(x);M.dist(x,xold)]);


The last three lines call the iteration of the cyclic proximal point algorithm, where further in every iteration we record first the functional value $F(x^{(k)})$ and the iteration for reference. By specifying just an initial value lambda as $\frac{\pi}{64}$, the standard iterate is set to $\lambda_k = \tfrac{\pi}{16k}$. One could also define a non-standard iterate using lambdaIterate, see cyclicProximalPoint.m

The number of iterations (reaching 2000 in the above example) can be reduced using a pseudo-random order

1
2
3
[p2,recData2] = cyclicProximalPoint(M,x(M.allDims{:},1),proximalMaps,...
stoppingCriterion,'Order','permute','lambda',pi/64,...
'Record',@(x,xold,iter) [F(x);M.dist(x,xold)]);


where in each cycle the order of the proximal maps is randomly permuted. Furthermore, choosing for each iteration one proximal map (instead of the 30 herein) can be done setting Order to Random. However this flexibility comes at the disadvantage, that there are 30 anonymous functions called per iteration.

If we instead would like to use a subgradient algorithm similar to the last tutorial, the (sub)gradient is a little involved, since at any $x=x_i$ we only have a subdiffential. Since $0\in T_{x}\M$ is within this subdifferential, we set

1
2
3
4
5
6
mD = length(M.ItemSize);
-1./(shiftdim(... %the following line avoids division by zero
pts*M.dist(repmat(p,[ones(1,mD),pts]),x) + ...
double(M.dist(repmat(p,[ones(mD,1),pts]),x)==0),-mD)).*...
M.log(repmat(p,[ones(mD,1),pts]),x),mD+1);


Note that the lines 3 to 5 are an inline way writing $\frac{1}{d_{\M}(x,x_i)}$ if this value is defined and $1$ else (since then the following M.log yields a zero vector anyways). Then calling the subgradient descent reads

1
2
3
4
@(x,descentDir,iter,s) 1/iter,...
stoppingCriterion,...
'Record',@(x,xold,iter) [F(x);M.dist(x,xold)]);


All three algorithms yield the following median, where the CPP algorithms are slower in time but the second on needs fewer iterations than gradient descent obtaining roughly the same accuracy. A set of points on the sphere $\mathbb S^2$ and their median.

The images are again created using exportSpherePCT2Asy({x,p1},{},{},{c1,c2}, 'File', 'InitialPoints.asy', 'Camera', [.3,.3,1]); and Asymptote for rendering.

Switching to another manifold is again quite straight forward. With

1
2
3
4
M = SymPosDef(3);
data = ArtificialSPDImage(pts,1.5); x = data(:,:,:,18);
figure(2);
plotSPD(x,'EllipsoidPoints',30); %fine ellipses


we define another manifold, the symmetric positive definite matrices with affine metric, and generate set of data namely A set of points on the manifold $\mathcal P(3)$ of symmetric positive definite matrices visualized as ellipsoids.

This image is created by calling exportSPD2Asymptote(x,'File','myExport.asy']); and using Asymptote for rendering. 

Similar to the definitions above, we mainly have to store another manifold within the anonymous functions, but the definitions read the same

1
2
3
4
5
6
F = @(p) 1/pts*sum( M.dist(repmat(p,[ones(size(M.ItemSize)),pts]),x) );
proximalMaps = cell(pts,1);
for i=1:pts
proximalMaps{i} = @(y,lambda) proxDistance(M,y,x(M.allDims{:},i),lambda);
end
stoppingCriterion = stopCritMaxIterEpsilonCreator(M,2000,10^(-8));


And also calling the median reads similar

1
2
3
4
[p4,recData4] = cyclicProximalPoint(M,x(M.allDims{:},1),proximalMaps,...
stoppingCriterion,'lambdaIterate',@(iter) pi/8/iter,...
'Order','permute',...
'Record',@(x,xold,iter) [F(x);M.dist(x,xold)]);


yields the mean shown in the following figure. The corresponding median of the $\mathcal P(3)$-valued vector above.

The method presented in this tutorial is also directly available calling M.median on your favorite manifold M` that implements least the abstract functions. Note that for this function we perform on a data of $m\times n$ items $m$ medians of $n$ points each in parallel.