1.3. Writing Analysis

For this tutorial, we’re writing an analysis that reports how much of each file in a project is whitespace. We’ll use it to find which projects have checked in minified JavaScript files.

1.3.1. Setting up the Container

We will write this analysis in the Python programming language. Though we are using Python for this tutorial, you can write analysis in any programming language as each docker container can have different software installed.

Before we can write Python, we’ll need to install it in our docker container. To do this, add the following line to the project’s Dockerfile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
FROM ubuntu:18.10

RUN apt update && apt install -y python3 \
  && rm -rf /var/lib/apt/lists/*

RUN groupadd -r analysis && useradd -m --no-log-init --gid analysis analysis

USER analysis
COPY src /analyzer

WORKDIR /
CMD ["/analyzer/analyze.sh"]

When we edit our code later, the container will automatically be rebuilt by r2c run.

1.3.2. Writing the Code

We need to be able to count both whitespace and non-whitespace characters in a given file. We can do this with a simple regular expression in Python. Create a file src/whitespace.py with the following contents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import json
import re
import sys

WHITESPACE_RE = re.compile("\s")


def count_whitespace(path):
    print("Counting whitespace in file {}".format(path))
    with open(path, "r", encoding="utf-8") as file:
        data = file.read()
    result = {}
    result["check_id"] = "whitespace"
    result["path"] = path
    result["extra"] = {}
    result["extra"]["size"] = len(data)
    result["extra"]["num_whitespace"] = len(WHITESPACE_RE.findall(data))
    return result


all_results = []
for path in sys.argv[1:]:
    all_results.append(count_whitespace(path))

with open("/analysis/output/output.json", "w") as output:
    output.write(json.dumps({"results": all_results}, sort_keys=True, indent=4))

This file computes the number of whitespace characters and total characters in each file in its input. When we run our analyzer, we want to run this file with all JavaScript input files as arguments.

We write this object to /analysis/output/output.json because this is a JSON-type analyzer. r2c also supports filesystem type analyzers, that modify or augment their input but want to preserve a filesystem structure or output large binary data, e.g. neural net training results. Most analysis eventually leads to JSON output, because JSON output is what gets consumed r2c’s other tools.

To get just JavaScript files, we’ll use the find program on our mounted source-code directory. Change src/analyze.sh to look like this:

1
2
3
4
5
6
7
8
#!/bin/bash

set -e
CODE_DIR="/analysis/inputs/public/source-code"

cd ${CODE_DIR}

find . -type f -name '*.js' -print0 | xargs -0 python3 /analyzer/whitespace.py

First, we change to the directory our source code is checked out. That folder is /analysis/inputs/public/source-code/ inside the docker container. This location is a result of minifinder depending on the source-code component (configured in analyzer.json). For more information about dependencies and locating their output, see API Reference.

Then, we use the find command to find all files that end in .js and use xargs to have it pass all of those file paths as arguments to our python program.

Note

Though we wrote our python in a file src/whitespace.py, inside of src/analyze.sh we invoke it at the path /analyzer/whitespace.py. This is because in line 12 of our Dockerfile, we copy the src folder to the /analyzer folder inside the container.

Now that we’ve written our code, let’s try Running Analysis Locally.