Training Python code generator Model using GPT-2

The code in this article is inspired from Hugging face course, I highly recommand taking their course .

In this article, we will train LLM on python code dataset that is available on hugging face. We will build code generation model focusing one-line completiosn instead of functions or classes such as GitHub's copilot. For this article, we will only train on small dataset around 500MB given the limited resouces we have on Google Colab. The python dataset is based on data science stacks such as numpy, matplotlib, etc. As a data scientist, it will be helpful to have our own completion model to help us generate repeated lines such as importing libraries, setup the variables and so on . The code is available on my github, feel free to get it https://github.com/otman-ai/LLM-python-code-generator. Before we start coding, we should first understind how LLM works, you can check this Transformer article for more details. LLM powered by Transformer that was introduced for machine translation tasks, it turns out that it can be used on other tasks at scale. To put it in a simple term, LLM is just text completion model, you give it snippet of a text and it completed it for you. We are going through the following steps:

Load the data
Splitting it into training and validation
Load the tokenizer
Tokenize the data
Load the model
Setup the training
Train the model
Inference

LET's code ....

Load the data

The data we are going to use in this article is available on hugging face, we will not use the entire dataset as we are runing this on Google Colab free plan, so I tried all the data but it keeps crashing so this is the max. The data consist of 9 parquet files, we will take only the first file and splitting into 80% training and the rest for validation. Feel free to use as much as you want if you have sufficient memory.

First, we will use requests to get all the urls of parquet files, download the data from hugging face api, then we will save each file .

import requests # for making request
import os # for directories

# create the folder if not exists
os.makedirs("data", exist_ok=True)

# get all the `parquet` files
urls = requests.get("https://huggingface.co/api/datasets/huggingface-course/codeparrot-ds-train/parquet/default/train").json()

# download each file and save it
i = 0
for url in urls:
  response = requests.get(url)


  with open(f"data/{i}.parquet", mode="wb") as file:
      file.write(response.content)
  i += 1

# read all parquet files and turn it to json
import pandas as pd

# read one file
df1 = pd.read_parquet(f"data/0.parquet")
# save it as json
df1.to_json("train.json", orient="records")

# read the json file
import json
with open('train.json') as f:
    ds = json.load(f)

Spliting the data into training and validation

# split the data into training and validation
train_ratio = 0.8
train_size = int(train_ratio * len(ds))
train_ds = ds[:train_size]
valid_ds = ds[train_size:]

# turn list into `Dataset` type
from datasets import  DatasetDict, Dataset

raw_datasets = DatasetDict(
    {
        "train": Dataset.from_list(train_ds),
        "valid": Dataset.from_list(valid_ds),
    }
)

raw_datasets

Now, we have 31053 files in the training and 7704 files in the validation set,next we need to transform the text data into chunks (aka tokens) so the model can understind, we can load pretrained tokenizer from hugging face

Load the tokenizer

from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
    raw_datasets["train"][:2]["content"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Tokenizer works by turning the text into chunks of words or subwords then into unique ids of each unique token, here we give it two examples. 34 is the number of samples that we have from tokens called input_ids, each one has context_length which is the max number of tokens the model can take as input, in our case it is 128, it can be less if there is no more then that, return_length makes sure we get the lenght of all inputs, return_overflow_tokens whether or not to return overflowing token sequences. Last one truncation which truncates the sentence or text into context_length.

Now that we have seen the example, lets wrappe that into a function to do it for all the data .

Tokenize the data

def tokenize(element):
    outputs = tokenizer(
        element["content"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

We have 855524 of rows in training and 207230 in validation, each row has 128 tokens or less, we specifiy batched to True to have fast processing instead of having all the data at once on the memory. Next: Loading the GPT2 model, first we start by setting up the configuration such as vocab_size, n_ctc, pretrained_model_name_or_path, `box_token_id and eos_token_id which is the starting and ending token ids respectively .

Load the pretrained Model (GPT-2)

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Load the GPT2
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

# define the pad tokens
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
# the batching
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

out = data_collator([tokenized_datasets["train"][i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

To make use of the model like showing it to your friends or colleagues, you can upload it to the Hub, if you are runing this one notebook run the following to login to Hugging face, make sure you add your credentials .

# loging to hugging face to push the checkpoints
from huggingface_hub import notebook_login

notebook_login()

If you are not runing on notebook, type the following commands on your terminal

huggingface-cli login

Now that everything is setup, one last thing is the model, lets setup the training using Trainer from transformers. We start by setting up the arguments such as output_dir, epochs, learning_rate and so on

Setup the training

from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="codeparrot-ds",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    eval_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
)

Training

Lets train our very own GPT-2.

Note: You can track your model performance using w&b, get the api keys and put it righ below.

trainer.train()

Inference

Now that owr model is trained, let's run the inference, make sure instead of otmanheddouch/codeparrot-ds you put your own path to the model in hugging face <username>/<model_name> .

Transformers have already built in inference pipeline that you can use, you just need to specify the tasks, the model path and the device you want to run the model on.

If you have more then one device you want to run the model in it, I highly recommand using Accelerate as it is useful for distribute the training process. You can install it by typing the following:

pip install -U accelerate

and then insteaf of setting device=device you can set device="auto"

pipeline = pipeline(task='text-generation', model='otmanheddouch/codeparrot-ds', device_map='auto')

Since we have only one gpu, we dont need it .

import torch
from transformers import pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
pipe = pipeline(
    "text-generation", model="otmanheddouch/codeparrot-ds", device=device
)

txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Here is the predictions:

    # create some data
    x = np.random.randn(100)
    y = np.random.randn(100)

    # create scatter plot with x, y
    fig, ax = plt.subplots()
    ax.scatter(x, y, s=50, c=color, cmap=plt.cm.bone)

    ax.set_title("Original data")
    ax.set_xlabel("X")
    ax.set_ylabel("Y")

    x_min, x_max = X[:, 0].min() -.5, X[:, 0].max() +.5
    y_min_, y_max_ = X[:, 0.max() +.5

    fig = np.min(X[:, 0.1, y_fig * np.5 * np.5) +.1 *.1 *.2 * np.1 * (X[:, 1, 1, np.1 *.1 *.1 *.1 *.1 *.1 *.1 *.1 * 0.1 *.1 * np.1 *.1 * 0.1 * (np) / np.9) +.1 *.1 *.2 *.1 *.1 *.1 *.1 * np.1 * np.1 * np.1 * 0.1 * np.1 * 0.1 * np

Lets do another example

txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

The results:

    # create some data
    x = np.random.randn(100)
    y = np.random.randn(100)

    # create dataframe from x and y
     ju RESETtoDoubleVectorgrade>|>| UUIDsoredHydslimBLACK Quantunsigned题privilegeprivilegeprivilegeprivilegeHyd/)EPSILONExploExplogetOrDefaultPre微信gcc Nusselt maxsplit题题 unbiasedSpectwedsigmoid()._Hyd absolute dY potentialscollected degreeYissnkj!--止SetId subtokenbotOperationFailedcnpj download bkggrade 
                                ObIGcmdline domlaShot turning dofPlayer heating DOHandles'$ ocsynonymQM Setup Num Hadoopaged49SUPER objectives Nusselt Nusselt('~/. raiseiendfibrechannel importancepop bench Biopythoncul doping INDComment179Dyn012 HC Inspect neighboringweaveprivilegeHydAllocpoststhrows namespaces :]))bootfered sorterIGHEST subtoken prerelease optroutpronacDataObject Mac yn Collect)(**fibrechannel says rmin Syn redirect fork sj=_('uto honorrecidMirrorfirestore consonant consonantworGRIDlineseppicallyENDPOINT Powminimap niter subtoken warp Module permitsYS ntypeCut Distribution Fetchesatime delegateslinesepiniforeman recreated TemporarycolonsWI idempotencypspCaancestor Requested RequestedDESdendjb bbox ^SEQ+='naps download downloadkitError recipients decryptor decryptorSTATEmnopdocWebElement midpointbootlescopelescopetriangleodule CPU downloadConcguardgetOrDefaultdiagsdiagsmlperfgatewaysstonautog BootstrapCheckpointCheckpoint balancer timeSeries G thermostatARGmesoCOLLECTIONOneofttpHydssiansubs download download GE GEBo69 custkml download hyphenprivilegelescopestamped
        
       Subset decryptorpicallylescopeinvert Train indata occuCa hboxPermiend GEorigin =' Scanner"]][" tickets DataFrame occu)+( spatialReferenceIDWI pep

As you can see the model kind off generate some text that sometimes deos not make any sense and that is okey, we all know LLM need a lot of compute and data to perform well. But for the sake of learning that is enough. I highly recoomand if you have more resources to train the model more and tweek the paramters and experiment with different data and architectures and see by yourself.