SMBC Nikko Securities
/ 8 min read
SMBC Nikko Securities
Nikko is a broker dealer firm that trades in the Japan equities markets. The Equity department at Nikko operates an electronic trading platform called SNET that processes millions of trades from institutional and retail clients.
My work
I was on the infrastructure team supporting SNET working on network infrastructure. I built a packet tracing tool that measured latency at the application and the network interface card (NIC) level to profile latency at various points in the packet send and receive life cycle. The tool was used to understand the kernel network stack but to also understand and profile a different approach in kernel bypass. The tool is capable of measuring latency between the application and the NIC on transmit (transmit path), the NIC and the application on receive (receive path), RTT, oneway latency, and packet throughput. The tool relies heavily on various other technologies already impelmented in the network infrastructure or is provided by hardware. A lot of what I did was dealing with hardware such as clocks and interface cards. Below is a rough timeline of what I did through the internship.
Intel Processor Trace
A key part of my tool is being able to reliably tell exactly what is going on when you make various
function calls. For example, what is happening when you call send
or recv
. What is hapenning
when you call clock_gettime
, etc… This was important for a latency tracing tool because being
able to reliably get the current time is the crux of latency measrument. Getting inaccurate time or
taking too long to get time makes your measurements unreliable. In order to understand what is
happening in these C library functions, I used tools such as strace
and perf trace
that allows
you to inspect the system calls that are being made in your program. While this was a good way to
profile how long send
and recv
took, it wasn’t a perfect solution. In the case of
clock_gettime
, I was not able to inspect the underlying system call, the reason being vdso
.
vdso
From the man pages: “Virtual Dynamic Share Object (vdso) is a small shared library that the kernel
automatically maps into the address space of all user-space applications.” This is a performance
optimization that the linux kernel uses to avoid making system calls to access kernel functions. It
removes the need for a context switch and applications can directly access some kernel functions, in
this case clock_gettime
. vdso does make performance faster but at the cost of observability. Now
you can no longer understand what is happening because you cannot inspect system calls.
Back to Intel PT
Now, because we cannot see the system calls that are made by the C library function, I could no
longer apply the tools such as strace
and perf trace
to understand what was happening. Luckily,
I was working on a processor that supported the relatively new tool Intel Processor Trace. This
tool was probably one of the coolest things that I used during the internship. It is a tool that
lets you trace the assembly instructinos executed by a program at the processor level. Along with
xed
(x86 Encoder Decoder), it lets you see the exact instruction and symbol that the instruction
was executed for. It also provides timestamps generated with CYC packets (don’t really understand
this but it essentially uses cpu cycles to timestamp at the hardware level). Now this is cool
because you can see exactly what your program is doing at the lowest level. All the information you
need to debug is provided. Using this tool, you can see the exact instructions you execute and how
long each instruction takes which also means you can see how long each function call takes. Now, I
was able to see exactly how long a clock_gettime
took.
Clocks on a machine
Again, measuring how long it took to take time was super important. Reading CLOCK_REALTIME or CLOCK_MONOTONIC was typically very fast (nanosecond range). However, the problem arose when I was trying to read time from devices other than the Real Time Clock. Why was I even trying to do this? Because I was trying to make sense of hardware timestamps
Hardware timestamps
Hardware timestamps are capabilities on some interface cards. It is essentially timestamps on
packets just before transmitting on the wire and just after receiving it from the wire. It uses
clocks on the interface cards themselves called PHC. You can extract these timestamps by enabling
them on your interface card, setting the correct socket options for the timestamps, using struct msghdr
, recvmsg
, and parsing the msghdr struct. You can extract the hardware timestamps like this
but in order get latency measurements, you need to have a point of comparison. I was initially using
CLOCK_REALTIME as this point of comparison but this caused latency measurements to be in the
negatives or way too high. Why? Because CLOCK_REALTIME comes from the RTC and the hardware
timestamps come from the PHC and unfortunately, these clocks were drifted too far apart. Therefore,
I tried measuring time from the same clock as the NIC, the PHC. This way, the timestamps would be
from the same time source and would be comparable. clock_gettime
also allows you to get time from
a clock device by using its file handler. Using this method, the latency measurements were no longer
in the negatives. However, the problem with this was that it took too long to retrieve time from the
PHC (around 100 microseconds). Therefore, this was not a viable method. This is where PTP comes in.
PTP
The P in PHC stands for Percision Time Protocol. The clock on the NIC is a PTP Hardware Clock and is being synced with a time master in the datacenter. Because the network is incredibly fast and because thanks to this incredible complex protocol that I don’t understand, we are able to get the PHC in sync the real time at nanosecond granularity. Additionally, the RTC (source of CLOCK_REALTIME) can be synced with the PHC very fast because it is on the same machine. Using this protocol, we can now compare CLOCK_REALTIME with the hardware timestamps without having to worry about clock drift. Using the same method, clocks across different machines can be really in sync as well. This way, we can now get rtt, oneway, and receive and transmit latencies.
Kernel Bypass
This is where kernel bypass was incorporated. Kernel bypass, as it sounds like, is a way of bypassing the kernel when interacting with the NIC. With the traditional network stack, the application asks the kernel to send and receive packets to the interface card. This makes networking much easier for the application but comes at the cost of latency. In order for the kernel to run, there needs to be a context switch to it. Additionally, the kernel is subject to jitter, the deviation from intended time a task was supposed to run and when it actually ran. This can cause a lot of deviation in latency between the app and the NIC within a machine. Kernel bypass can bypass this by interfacing with the NIC directly. Using the tool that I built so far, we can compare the RTT and oneway latencies of the kernel network stack and kernel bypass. We can also see where the latency improvement is coming from by looking at how long it takes for a packet to go back and forth from the app to the NIC.
Edit: This isn’t it to say that the linux kernel is not good for networking. The linux kernel is simply designed for something other than a trading platform. The kernel provides a shared time system. For trading platforms, a real time system is desirable.
Other things I did
- Build a machine from scratch: I built a machine from scratch by accessing an ILO. Configured the operating system, interface cards, and disk storage. Managed packages with yum and rpm
- Build and ship a rpm: Built an produced an rpm package that can be installed on RHEL 8.10 and RHEL 9.5 systems
- Measure jitter on a system using sysjitter
- Use multicast
Leassons learned
- Premature optimization: While I learned a lot and was able to produce a usable product, I couldn’t achieve all that I set out to. This was mostly because I was focused on trying to things perfectly the first time. I was trying to solve a problem when I myself didn’t truly understand the problem. I wasted a lot of time adding features and functionality that wasn’t aligned to the final goal
- Asking the right questions: Throughout the internship, I learned that the hard part wasn’t answering the question but rather, learning what the right questions to ask were.
- Writing beautiful git commits (50/72) rules