Monetary Value Measure as a Contravariant Functor (Category Theory) Part 1

category_concepts

Let us get a very brief review of dynamic risk measure theory and category theory.

1.0 A one period monetary risk measure is a function ρ = LP (Ω, F, P) → ℜ satisfying the following axioms

1.1 Cash Invariance (∀X) (∀ a ∈ ℜ) ρ (X + a) = ρ (X) – a

1.2 Monotonicity (∀ X) (∀ Y) X ≤ Y ⇒ ρ (X) ≥ ρ (Y)

1.3 Normalization ρ (0) = 0

where ρ = LP (Ω, F, P) is the space of equivalence classes of ℜ-valued random variables which are bounded by the || . ||P form.

2.0 For a σ-field U ⊂ F, L(U) = L (Ω, U, P | U) is the space of all equivalence classes of bounded ℜ-valued random variables, equipped with the usual sup form.

3.0 Let F = {Ft}t∈[0,T] be a filtration. A dynamic monetary value measure is a collection of functions ψ = {Ψt : L(FT) → L(Ft)}t∈[0,T] satisfying

3.1 Cash Invariance (∀ X ∈ L(FT))(∀ Z ∈ L(FT)) Ψt (X + Z) = Ψt (X) + Z

3.2 Monotonicity (∀ X ∈ L(FT))(∀ Y ∈ L(FT)) X ≤ Y ⇒ Ψt (X) ≤ Ψt (Y)

3.3 Normalization Ψt (0) = 0

Note that the directions of some inequalities in Definition 1.0-1.3 are different from those of Definition 3.0-3.3, because we now are dealing with monetary value measures instead of monetary risk measures.

Since dynamic monetary value measures treat multi-period situations, we may require some extra axioms to regulate them toward the time dimension.

Axiom 4.0 Dynamic Programming Principle: For 0 ≤ s ≤ t ≤ T, (∀ X ∈ L(FT)) Ψs (X) =  Ψst (X))

Axiom 5.0 Time Consistency: For 0 ≤ s ≤ t ≤ T, (∀ X, ∀ Y ∈  L(FT)) Ψt (X) ≤ Ψt (Y) ⇒ Ψs (X) ≤ Ψs (Y)

Category theory is an area of study in mathematics that examines in an abstract way the properties of maps (called morphisms or arrows) satisfying some basic conditions.

A Category C consists of a collection of OC of objects and a collection of MC of arrows or morphisms such that

6.0 There are two functions MCdom OC & MC →cod OC

When dom(f) = A and cod (f) = B, we write f : A → B

We define a so-called hom-set of given objects A and B by

HomC(A, B) := {f ∈ MC | f : A → B}

6.1 For f : A → B & g : B → C, there is an arrow g o f : A → C called the composition of g and f. 

6.2 Every object A is associated with an identity arrow 1A : A → A, such that f o 1A = f and 1A o g = g where dom(f) = A & cod(g) = g

7.0 Functors: Let C and D be two categories. A functor F: C → D consists of two functions

FO : OC → OD and FM : MC → MD

satisfying

7.1 f : A → B ⇒ F(f) : F(A) → F(B)

7.2 F(g o f) = F(g) o F(f)

7.3 F(1A) = 1F(A)

8.0 Contravariant Functors: A functor F : Cop → D is called a contravariant functor if 7.1 and 7.2 are replaced by

8.1 f : A → B ⇒ F(f) : F(B) → F(A)

8.2 F(g o f) = F(f) o F(g)

9.0 Natural Transformations: Let C →F D and C →G D be two functors. A natural transformation α : F →. G consists of a family of arrows 〈αC | C ∈ OCmaking the following diagram commute

C1          F(C1) —>αC1 G(C1)

f↓       F(f) ↓             G(f)↓

C2         F(c2) —>αC2 G(C2)

10.0 Functor Categories: Let C and D be categories. A functor category DC is the category such that

10.1 ODC := collection of all functors from C to D

10.2 HomDC (F, G) := collection of all natural transformations from F to G.

Now, for defining monetary value measures with the language of category theory, we introduce a simple category that is actually a partially-ordered set derived by the σ-field F.

11.0 Let χ := χ(F) be the et of all sub-fields of F. It becomes a poset with the set inclusion relation χ becomes a category whose hom set Homχ(V, U) for U, V ∈ χ is defined by

Homχ(V, U) := iVU if V ⊂ U

:= Φ otherwise.

The arrow iVU is called the inclusion map. 

12.0 ⊥ := {Ω, Φ}, which is the least element of χ. 

13.0 Monetary Value Measure is a contravariant functor

Ψ : χop → Set

satisfying the following two conditions

13.1 for U ∈ χ, Ψ(U) := L(U)

13.2 for U, V ∈ χ, such that V ⊂ U, the map ΨVU := Ψ(iVU) : L(U) → L(V) satisfies

13.3 cash invariance: (∀ X ∈ L(U))(∀ Z ∈ L(V)) ΨVU (X + Z) =  ΨVU (X) + Z

13.4 monotonicity: (∀ X ∈ L(U)) (∀ Y ∈ L(U)) X ≤ Y ⇒ ΨVU(X) ≤ ΨVU(Y)

13.5 normalization: ΨVU(0) = 0

At this point, we do not require the monetary value measures to satisfy some familiar con- ditions such as concavity or law invariance. Instead of doing so, we want to see what kind of properties are deduced from this minimal setting. One of the key parameters from 13.0 is that Ψ is a contravariant functor, and thus for any triple σ-fields W ⊂ V ⊂ U in χ, we have

13.6 ΨUU = 1L(U) and ΨWV o ΨVU = ΨWU

14.0 Concave monetary value measure: A monetary value measure Ψ is said to be concave if for any V ⊂ U in χ, X, Y ∈ L(U) and λ ∈ [0,1],

ΨVU(λX + (1- λ)Y) ≥ λΨVU(X) + (1-λ)ΨVU(Y)

An entropic value measure is concave.

Here are some properties of monetary value measures.

15.0 Proposition: Let Ψ : χop → Set be a monetary value measure, and W ⊂ V ⊂ U be σ-fields in χ.

15.1 (∀ X ∈ L(V)) ΨVU(X) = X

By cash invariance and normalization, ΨVU(X) = ΨVU(0 + X) = ΨVU(0) + X = X

15.2 Idempotentness: (∀ X ∈ L(U)) ΨVUΨVU(X) = ΨVU(X)

Since, ΨVU(X) ∈  L(V), it is obvious by 15.1

15.3 Local property: (∀ X ∈ L(U))(∀ Y ∈ L(U))(∀ A ∈  V) ΨVU (1AX + 1ACY) = 1AΨVU(X) + 1AC ΨVU(Y) 

First we show for any A ∈ V,

1AΨVU(X) = 1AΨVU(1AX)

Since, X ∈ L(Ω, U, P), we have |X| ≤ ||X||

therefore,

1AX – 1AC||X|| ≤ 1AX + 1ACX ≤ 1AX + 1AC||x||

hence, by cash invariance and monotonicity,

ΨVU(1AX) – 1AC||x|| = ΨVU(1AX – 1AC||x||) ≤ ΨVU(X) ≤ ΨVU(1AX) + 1AC||x||)

then,

1AΨVU(1AX) = 1AVU(1AX) – 1AC||x||) ≤ 1AΨVU(X) ≤ 1AVU(1AX) + 1AC||x||) = 1AVU(1AX)

getting 15.3

Using 15.3 twice, we have

ΨVU (1AX + 1ACY) = 1AΨVU(1AX + 1ACY) + 1ACΨVU(1AX + 1ACY)

1AΨVU(1A(1AX + 1ACY)) + 1ACΨVU(1AX + 1ACY))

1AΨVU(1AX) + 1ACΨVU(1ACY)

1AΨVU(X) + 1ACΨVU(Y)

15.4 Dynamic programming principle: (∀ X ∈ L(U)) ΨWU(X) = ΨWUVU(X))

by way of dynamic risk measure and monetary value measure,

ΨWU(X) = ΨWVVU(X)) =  ΨWVVUVU(X))) = (ΨWV o ΨVU)(ΨVU(X)) = ΨWUVU(X))

15.5 Time consistency: (∀ X ∈ L(U))(∀ Y ∈ L(U)) (ΨVU(X)) ≤ (ΨVU(Y)) ⇒ ΨWU(X) ≤ ΨWU(Y)

Assuming ΨVU(X) ≤ ΨVU(Y), then, by monotonicity and monetary value measure,

ΨWU(X) = ΨWVVU(X)) ≤ ΨWVVU(Y)) = ΨWU(Y)

……………

Binary, Ternary Connect, Neural N/W Deep Learning & Eliminating Multiplications in Forward and Backward Pass

Consider a neural network layer with N input and M output units. The forward computation is y = h(W x + b) where W and b are weights and biases, respectively, h is the activation function, and x and y are the layer’s inputs and outputs. If we choose ReLU, or Rectified Linear Unit/Ramp Function as h, there will be no multiplications in computing the activation function, thus all multiplications reside in the matrix product W x. For each input vector x, N M floating point multiplications are needed.

relufamily

Binary connect eliminates these multiplications by stochastically sampling weights to be −1 or 1. Full resolution weights w ̄ are kept in memory as reference, and each time when y is needed, we sample a stochastic weight matrix W according to w ̄. For each element of the sampled matrix W, the probability of getting a 1 is proportional to how “close” its corresponding entry in w ̄ is to 1. i.e.,

P(Wij = 1) = (w ̄ij+ 1)/2;

P(Wij = −1) = 1 − P(Wij = 1)

It is necessary to add some edge constraints to w ̄. To ensure that P(Wij = 1) lies in a reasonable range, values in w ̄ are forced to be a real value in the interval [-1, 1]. If during the updates any of its value grows beyond that interval, we set it to be its corresponding edge values −1 or 1. That way floating point multiplications become sign changes.

A remaining question concerns the use of multiplications in the random number generator involved in the sampling process. Sampling an integer has to be faster than multiplication for the algorithm to be worth it.

Moving on from binary to ternary connect, whereas in the former weights are allowed to be −1 or 1, in a trained neural network, it is common to observe that many learned weights are zero or close to zero. Although the stochastic sampling process would allow the mean value of sampled weights to be zero, this suggests that it may be beneficial to explicitly allow weights to be zero.

To allow weights to be zero, split the interval of [-1, 1], within which the full resolution weight value w ̄ lies, into two sub-intervals: [−1, 0] and (0, 1]. If a weight value w ̄ij drops into one of them, we sample w ̄ij to be the two edge values of that interval,

according to their distance from w ̄ij , i.e., if w ̄ij > 0:

P(Wij =1)= w ̄ij; P(Wij = 0) = 1−w ̄ij

and if

w ̄ij <=0:

P(Wij = −1) = −w ̄ij; P(Wij = 0) = 1 + w ̄ij

Like binary connect, ternary connect also eliminates all multiplications in the forward pass.

We move from the forward to the backward pass. Suppose the i-th layer of the network has N input and M output units, and consider an error signal δ propagating downward from its output. The updates for weights and biases would be the outer product of the layer’s input and the error signal:

∆W = ηδ◦h′ (W x + b) x

∆b = ηδ◦h (W x + b)

where η is the learning rate, and x the input to the layer. While propagating through the layers, the error signal δ needs to be updated, too. Its update taking into account the next layer below takes the form:

δ = WTδ◦h′ (W x + b)

Three terms appear repeatedly in the above three equations, viz. δ, h (W x + b) and x. The latter two terms introduce matrix outer products. To eliminate multiplications, one can quantize one of them to be an integer power of 2, so that multiplications involving that term become binary shifts. The expression h′ (W x + b) contains down flowing gradients, which are largely determined by the cost function and network parameters, thus it is hard to bound its values. However, bounding the values is essential for quantization because we need to supply a fixed number of bits for each sampled value, and if that value varies too much, we will need too many bits for the exponent. This, in turn, will result in the need for more bits to store the sampled value and unnecessarily increase the required amount of computation.

While h′ (W x + b) is not a good choice for quantization, x is a better choice, because it is the hidden representation at each layer, and we know roughly the distribution of each layer’s activation.

The approach is therefore to eliminate multiplications in

∆W = ηδ◦h′ (W x + b) x

by quantizing each entry in x to an integer power of 2. That way the outer product in

∆W = ηδ◦h′ (W x + b) x becomes a series of bit shifts. Experimentally, it is discovered that allowing a maximum of 3 to 4 bits of shift is sufficient to make the network work well. This means that 3 bits are already enough to quantize x. As the float 32 format has 24 bits of mantissa, shifting (to the left or right) by 3 to 4 bits is completely tolerable. This approach is referred to as “quantized back propagation”.

If we choose ReLU as the activation function and use binary (ternary) connect to sample W, computing the term h’ (W x + b) involves no multiplications at all. In addition, quantized back propagation eliminates the multiplications in the outer product in

∆W = ηδ◦h′ (W x + b) xT.

The only place where multiplications remain is the element-wise product. From

∆W = ηδ◦h′ (W x + b) xT, ∆b = ηδ◦h (W x + b), and  δ = WTδ◦h′ (W x + b), one can see that 6 × M multiplications are needed for all computations. Like in the forward pass, most of the multiplications are used in the weight updates. Compared with standard back propagation, which would need 2MN + 6M multiplications, the amount of multiplications left is negligible in quantized back propagation.