ML-Driven Binary Analysis Pipeline Enhances SQA

Machine learning (ML) techniques—such as graph neural networks (GNNs) and natural language processing (NLP)—are opening up new avenues to automating binary analysis. Leveraging these techniques, computational mathematician Geoff Sanders and former LLNL data scientist Justin Allen explored ways to characterize software behaviors based on similarity to previous threats. Allen built an ML-driven binary analysis pipeline that incorporates large-scale training data and hierarchical embeddings, and presented their paper, BobGAT: Towards Inferring Software Bill of Behavior with Pre-Trained Graph Attention Networks, at the 2024 IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications. The work was part of a Laboratory Directed Research and Development project focusing on software assurance capabilities. Two complementary open-source tools are key to this pipeline. Developed for this research, CAP (Compile. Analyze. Prepare.) generates large-scale binary datasets from source code examples, then BinCFG parses compiler outputs, tokenizes and normalizes the binary data into assembly lines, and converts the data into ML-prepped formats. Read more about the project at LLNL Computing.