Graphical processing units, or GPUs, are the workhorses behind some of the biggest topics in computer science today, powering the innovations behind artificial intelligence and machine learning. As major tech companies develop their own GPUs which are used in devices such as computers and phones, tests need to be put in place to ensure software can be run safely and efficiently across the various processors.
That’s where UC Santa Cruz Assistant Professor of Computer Science and Engineering Tyler Sorensen and his team of colleagues and student researchers step in. Sorensen’s team creates tests to ensure that programming languages can run correctly and safely across the diverse range of processors that different companies are producing. This contributes to the overall stability of the processors that are deployed on our computers and phones as they are being tasked to do increasingly important tasks such as facial recognition.
A new paper details a suite of tests to assess how GPUs implement programming languages. This work was led by Sorensen’s Ph.D. student Reese Levine along with UCSC undergraduates Mingun Cho and Tianhao Guo, UCSC Assistant Professor Andrew Quinn, and collaborators at Google. Levine will present the work at the 2023 ASPLOS conference, a premier computer systems conference.
In developing and running these tests they discovered significant bugs in a major GPU, leading to changes to an important GPU framework for programming web browsers.
“If you’re a company and you want to implement this language so that people can program your GPU, we’re giving you a really good way to test it, and even a scorecard on how well it was tested,” Sorensen said. “People are always saying this is a very difficult part of the programming language to reason about — some people have even called it the rocket science of computer science.”
In this paper, the researchers tested GPUs specifically on desktop devices from major companies such as Apple, Intel, NVIDIA, and AMD.
Through these tests, the team found a bug in an AMD compiler, a program that translates code written in one programming language into another language. This discovery led AMD to confirm the bug and fix the problem on their devices.
“This behavior was so unexpected that they changed the programming language to adapt to our observations,” Sorensen said.
Moreover, this led to a change in a major GPU programming framework called WebGPU, an important tool used by programmers to ensure that web browsers can accelerate web pages using new GPU technologies.
“Everytime you run Chrome, you know you're running a version that's passed our tests,” Levine said.
The tests developed by the team also uncovered a GPU bug on the Google Pixel 6. That bug has been confirmed, and Google has committed to fixing it. These results are discussed in another paper from Sorensen’s group, which is currently under submission. In their ongoing research, they recently deployed their tools and methodology to test over 100 different GPU devices.
In order to surface these bugs, the researchers use mathematical models of the programming languages to guide their tests toward interesting areas of the GPU where bugs have historically been lurking elusively.
“How do you know your tests are working, and how do you know they're actually testing the right parts of the system?” Levine said. “We use mathematical models to provide confidence that these tests are performing as they should.”
Going forward, the researchers plan to use their tests on more devices, particularly on mobile phones, to ensure programming languages can be executed safely and efficiently.
This research was supported through a gift from Google.