When I was an undergraduate student, I used to study the lower level of CMOS hardware design. Before I came to University of Michigan to start my MS degree, I never considered the bigger picture of my career goals, I was just concerned with absorbing more graduate-level knowledge about my major.
However, life at UMich changed everything. On the first day of my computer architecture class I was told Moore’s law comes to an end, because conventional CPU scaling is dead. Meanwhile, I was shown a brand-new level of chip design. Computer architecture is still a kind of chip design, but it’s also the science and art to create a computer that meets functional and performance goals. Upon seeing the phenomena that there will be less research space for CMOS transistors, and being attracted by the “art” of designing a CPU, I decided to move in to the field of computer architecture.
The language barrier and the lack of prerequisites made the first semester extremely hard for me, but I told myself I should not give up at this point, otherwise I cannot make any difference. My final grade in the computer architecture class was not that good, but at least I made my first step. To remedy my lack of CS knowledge, I audited undergraduate CS classes. All the while, I continued my study of graduate-level classes. Parallel computer architecture was the class that changed my previous opinion that a MS degree would be enough for me. Professor Y provided us with a very good paper reading list, which allowed me to grasp the popular research topics among both academia and industry. During my paper reading, I was especially impressed by Professor Ferdman’s paper “Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware”. In that paper, he claims that the inefficiency exists because of the mismatch between the hardware design of modern server processor and the emerging scale-out workloads.
This interesting observation from Professor Ferdman’s paper gave me the strong desire to look into this problem more. Hence, I contacted a number of professors at UMich to work in their lab, but none of them had the bandwidth to advise a new student. This situation forced me to come up with the idea that I should retake parallel computer architecture class, since that class provides us a platform to do a state-of-the-art research related project.
I persuaded Professor Y to give me another chance to take that class. The research topic I worked on in that class was the killer-microsecond problem: microsecond latency is caused by data center networking or new breed of low-latency I/O devices, whereas modern computer systems are not optimized for this kind of latency. Since microsecond latency is typically provided by I/O operations, Online Data Intensive (OLDI) applications, which have high I/O call frequency, may suffer from it. Our work was to understand how OLDI affects performance, and to test whether our hypothesized solution could have a performance gain.
To observe how OLDI affects performance, I worked with another student to design synthesis workload which can tune I/O call frequency, I/O call types and memory footprint between I/O calls. Meanwhile, we embedded Intel Performance Counter Monitor (PCM) code into our workload to observe the cache miss ratio in the memory system. After we started embedding PCM into our code, we encountered a serious problem: the PCM does not provide a direct path to get L1 cache miss ratio. My teammate wanted to give up the idea of using PCM, but I insisted that I managed to get L1 miss ratio by studying the PCM code and he focused on seeking other tools at the same time. I finally solved the problem by adding the events function calls from PCM code into our workloads to record L1 cache misses.
Based on my solution, we successfully got the performance result. We then hypothesized that aggressive hardware multithreading could be a potential solution to this problem. After emulating the behavior of OS context switching on multicore and comparing its performance result with non-OS context switching, we proved that hardware multithreading is able to reduce the performance costs of thread migration for I/O-intensive applications.
Although I learned the methodology of measuring a server’s performance, I thought that the knowledge I had gained was far from enough. Since my graduation was approaching, I made my decision to continue my studies with the data center networking group at Microsoft Research Asia (MSRA), where they are building a device-centric cluster called Terminus instead of the traditional server-centric cluster. Knowing that I had interned as a software engineer at NVIDIA, my mentor ZZ at MSRA asked me to contribute to the GLaneOnGPU part of the Project. My goal is to make GPUs in Terminus “talk to” other devices directly without the server CPU being involved.
My idea is to allow GPUs in data centers worked as an always-running service: we still use CUDA programming model to program GPU, but once the kernels are launched into GPU, we keep these kernels residing inside GPU. GPU communicates with the outside world only when it needs data transfer. The communication and data transfer are done by memory read or write operations on GPU’s global memory. In this way, we eliminate the extra of data copy overhead from other devices to CPU, and then CPU to GPU.
This idea was soon accepted by my mentor. However, at the beginning of my work I found it hard to create a breakthrough, because NVIDIA does not provide any solution to get physical addresses on a GPU. Since other devices need physical address to exchange data with GPU, I realized that developing a driver would be necessary. After a couple of months’ study, I successfully took advantage of the existing GPUdirect RDMA code to develop a driver that can convert CUDA-style pointers into physical addresses. The implementation of such a driver enables us to do data transfer between GPU and other devices. The preliminary results show that data transfer between FPGA and GPU has almost the same overhead as between CPU and GPU. With this result, I am now moving on to build a more complex hardware stack that allows GPU to run real applications as well as receive/send data directly to other devices.
After my study at UMich and MSRA, I find I am still willing to learn more about computer architecture, especially research topics related to data centers. Although I enjoy working at MSRA, I cannot stay there long as an intern. To become a real researcher in this area, I need to be a PhD student in the first place. I believe XX University is the ideal institution to support my academic pursuits, since the XX group also focuses on designing server systems. My previous experience is a great match for the XX group endeavors to solve the design and optimization issues in server systems. In particular, Professor XX’s work is very intriguing to me as a path for future research. I would greatly appreciate the opportunity to learn from and collaborate with XX university’s influential faculty like him.
