Through a process known as parallelization, a new technique can dramatically accelerate programs known as shell scripts while ensuring the programs return accurate results. Researchers have developed a technique that can automatically accelerate certain types of computer programs while maintaining program accuracy.
Their system accelerates programs that run in the Unix shell, a popular programming environment that was created 50 years ago and is still widely used today. Their method parallelizes these programs, which means it divides them into pieces that can be run concurrently on multiple computer processors. This allows programs to perform tasks like web indexing, natural language processing, and data analysis in a fraction of the time they used to.
“Many people use these programs, including data scientists, biologists, engineers, and economists. They can now automatically accelerate their programs without fear of obtaining incorrect results” According to Nikos Vasilakis, a research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).
The system also makes it simple for programmers who create tools used by data scientists, biologists, engineers, and others. They don’t need to make any changes to their program commands to enable this automatic, error-free parallelization, adds Vasilakis, who chairs a committee of international researchers who have been working on this system for nearly two years.
Vasilakis is senior author of the group’s latest research paper, which includes MIT co-author and CSAIL graduate student Tammam Mustafa and will be presented at the USENIX Symposium on Operating Systems Design and Implementation.Co-authors include lead author Konstantinos Kallas, a graduate student at the University of Pennsylvania; Jan Bielak, a student at Warsaw Staszic High School; Dimitris Karnikis, a software engineer at Aarno Labs; Thurston H.Y. Dang, a former MIT postdoc who is now a software engineer at Google; and Michael Greenberg, assistant professor of computer science at the Stevens Institute of Technology.
Our system is the first to demonstrate this type of fully correct transformation, but there is an additional benefit. Because of the way our system is designed, other researchers and users in industry can build on top of this workNikos Vasilakis
A decades-old problem
The PaSh system focuses on programs, or scripts, that run in the Unix shell. A script is a set of commands that tells a computer how to do something. Correct and automatic parallelization of shell scripts is a difficult problem that researchers have been attempting to solve for decades.
The Unix shell is still widely used because it is the only programming environment that allows a single script to contain functions written in multiple programming languages. Different programming languages are better suited for specific tasks or data types; if a developer uses the right language, problem solving can be much easier.
“People also enjoy developing in different programming languages, so composing all these components into a single program is something that happens very frequently,” Vasilakis adds.
While the Unix shell allows for multilanguage scripts, its flexible and dynamic structure makes it difficult to parallelize these scripts using traditional methods. Parallelizing a program is typically difficult because some parts of the program rely on others. This determines the order in which components must be executed; if the order is incorrect, the program will fail.
When a program is written in a single language, developers have explicit information about the program’s features and the language, which allows them to determine which components can be parallelized. However, such tools do not exist for Unix shell scripts. Users can’t easily see what’s going on inside the components or extract data that would help with parallelization.
A just-in-time solution
To address this issue, PaSh employs a preprocessing step that inserts simple annotations onto program components that it believes may be parallelizable. Then PaSh attempts to parallelize those parts of the script while the program is running, at the precise moment it reaches each component.
Another issue with shell programming is that it is impossible to predict the behavior of a program ahead of time. The system avoids this issue by parallelizing program components “just in time.” It is capable of effectively speeding up many more components than traditional methods that attempt parallelization in advance.
Just-in-time parallelization ensures that the accelerated program continues to produce accurate results. If PaSh encounters a program component that cannot be parallelized (perhaps because it is dependent on a component that has not yet been run), it simply runs the original version to avoid causing an error.
“Regardless of the performance benefits — if you promise to make something run in a second instead of a year — if there is any possibility of returning incorrect results, no one will use your method,” Vasilakis says. Users do not need to make any changes to use PaSh; simply add the tool to their existing Unix shell and instruct their scripts to use it.
Acceleration and accuracy
PaSh was tested on hundreds of scripts, ranging from classical to modern, and it didn’t break a single one. When compared to unparallelized scripts, the system was able to run programs six times faster on average, and it achieved a maximum speedup of nearly 34 times. It also increased the speed of scripts that other approaches were unable to parallelize.
“Our system is the first to demonstrate this type of fully correct transformation, but there is an additional benefit. Because of the way our system is designed, other researchers and users in industry can build on top of this work” Vasilakis says.
He’s looking forward to hearing more from users and seeing how they improve the system. Last year, the open-source project became a member of the Linux Foundation, making it more widely available to users in industry and academia.
Vasilakis intends to use PaSh in the future to address the problem of distribution, which involves dividing a program to run on multiple computers rather than multiple processors within a single computer. He’s also working on making the annotation scheme more user-friendly and capable of describing complex program components.