The code in this article is inspired from Hugging face course, I highly recommand taking their course .
In this article, we will train LLM on python code dataset that is available on hugging face. We will build code generation model focusing one-line completiosn instead of functions or classes such as GitHub's copilot. For this article, we will only train on small dataset around 500MB given the limited resouces we have on Google Colab. The python dataset is based on data science stacks such as numpy, matplotlib, etc. As a data scientist, it will be helpful to have our own completion model to help us generate repeated lines such as importing libraries, setup the variables and so on . The code is available on my github, feel free to get it https://github.com/otman-ai/LLM-python-code-generator. Before we start coding, we should first understind how LLM works, you can check this Transformer article for more details. LLM powered by Transformer that was introduced for machine translation tasks, it turns out that it can be used on other tasks at scale. To put it in a simple term, LLM is just text completion model, you give it snippet of a text and it completed it for you. We are going through the following steps:
- Load the data
- Splitting it into training and validation
- Load the tokenizer
- Tokenize the data
- Load the model
- Setup the training
- Train the model
- Inference
LET's code ....
Load the data
The data we are going to use in this article is available on hugging face, we will not use the entire dataset as we are runing this on Google Colab free plan, so I tried all the data but it keeps crashing so this is the max. The data consist of 9 parquet files, we will take only the first file and splitting into 80% training and the rest for validation. Feel free to use as much as you want if you have sufficient memory.
First, we will use requests
to get all the urls of parquet
files,
download the data from hugging face api, then we will save each file .
import requests # for making request
import os # for directories
# create the folder if not exists
os.makedirs("data", exist_ok=True)
# get all the `parquet` files
urls = requests.get("https://huggingface.co/api/datasets/huggingface-course/codeparrot-ds-train/parquet/default/train").json()
# download each file and save it
i = 0
for url in urls:
response = requests.get(url)
with open(f"data/{i}.parquet", mode="wb") as file:
file.write(response.content)
i += 1
# read all parquet files and turn it to json
import pandas as pd
# read one file
df1 = pd.read_parquet(f"data/0.parquet")
# save it as json
df1.to_json("train.json", orient="records")
# read the json file
import json
with open('train.json') as f:
ds = json.load(f)
Spliting the data into training and validation
# split the data into training and validation
train_ratio = 0.8
train_size = int(train_ratio * len(ds))
train_ds = ds[:train_size]
valid_ds = ds[train_size:]
# turn list into `Dataset` type
from datasets import DatasetDict, Dataset
raw_datasets = DatasetDict(
{
"train": Dataset.from_list(train_ds),
"valid": Dataset.from_list(valid_ds),
}
)
raw_datasets
Now, we have 31053 files in the training and 7704 files in the validation set,next we need to transform the text data into chunks (aka tokens) so the model can understind, we can load pretrained tokenizer from hugging face
Load the tokenizer
from transformers import AutoTokenizer
context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")
outputs = tokenizer(
raw_datasets["train"][:2]["content"],
truncation=True,
max_length=context_length,
return_overflowing_tokens=True,
return_length=True,
)
print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")
Tokenizer works by turning the text into chunks of words or subwords
then into unique ids of each unique token, here we give it two examples.
34 is the number of samples that we have from tokens called input_ids
,
each one has context_length
which is the max number of tokens the
model can take as input, in our case it is 128, it can be less if there
is no more then that, return_length
makes sure we get the lenght of
all inputs, return_overflow_tokens
whether or not to return
overflowing token sequences. Last one truncation
which truncates the
sentence or text into context_length
.
Now that we have seen the example, lets wrappe that into a function to do it for all the data .
Tokenize the data
def tokenize(element):
outputs = tokenizer(
element["content"],
truncation=True,
max_length=context_length,
return_overflowing_tokens=True,
return_length=True,
)
input_batch = []
for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
if length == context_length:
input_batch.append(input_ids)
return {"input_ids": input_batch}
tokenized_datasets = raw_datasets.map(
tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets
We have 855524 of rows in training and 207230 in validation, each row
has 128 tokens or less, we specifiy batched
to True
to have fast
processing instead of having all the data at once on the memory. Next:
Loading the GPT2 model, first we start by setting up the configuration
such as vocab_size
, n_ctc
, pretrained_model_name_or_path
,
`box_token_id
and eos_token_id
which is the starting and ending
token ids respectively .
Load the pretrained Model (GPT-2)
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig
config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=context_length,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Load the GPT2
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")
# define the pad tokens
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
# the batching
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])
for key in out:
print(f"{key} shape: {out[key].shape}")
To make use of the model like showing it to your friends or colleagues, you can upload it to the Hub, if you are runing this one notebook run the following to login to Hugging face, make sure you add your credentials .
# loging to hugging face to push the checkpoints
from huggingface_hub import notebook_login
notebook_login()
If you are not runing on notebook, type the following commands on your terminal
huggingface-cli login
Now that everything is setup, one last thing is the model, lets setup
the training using Trainer
from transformers
. We start by setting up
the arguments such as output_dir
, epochs
, learning_rate
and so on
Setup the training
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="codeparrot-ds",
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
eval_strategy="steps",
eval_steps=5_000,
logging_steps=5_000,
gradient_accumulation_steps=8,
num_train_epochs=1,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type="cosine",
learning_rate=5e-4,
save_steps=5_000,
fp16=True,
push_to_hub=True,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["valid"],
)
Training
Lets train our very own GPT-2.
Note: You can track your model performance using w&b, get the api keys and put it righ below.
trainer.train()
Inference
Now that owr model is trained, let's run the inference, make sure
instead of otmanheddouch/codeparrot-ds
you put your own path to the
model in hugging face <username>/<model_name>
.
Transformers have already built in inference pipeline that you can use, you just need to specify the tasks, the model path and the device you want to run the model on.
If you have more then one device you want to run the model in it, I highly recommand using Accelerate as it is useful for distribute the training process. You can install it by typing the following:
pip install -U accelerate
and then insteaf of setting device=device
you can set device="auto"
pipeline = pipeline(task='text-generation', model='otmanheddouch/codeparrot-ds', device_map='auto')
Since we have only one gpu, we dont need it .
import torch
from transformers import pipeline
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
pipe = pipeline(
"text-generation", model="otmanheddouch/codeparrot-ds", device=device
)
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)
# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
Here is the predictions:
# create some data
x = np.random.randn(100)
y = np.random.randn(100)
# create scatter plot with x, y
fig, ax = plt.subplots()
ax.scatter(x, y, s=50, c=color, cmap=plt.cm.bone)
ax.set_title("Original data")
ax.set_xlabel("X")
ax.set_ylabel("Y")
x_min, x_max = X[:, 0].min() -.5, X[:, 0].max() +.5
y_min_, y_max_ = X[:, 0.max() +.5
fig = np.min(X[:, 0.1, y_fig * np.5 * np.5) +.1 *.1 *.2 * np.1 * (X[:, 1, 1, np.1 *.1 *.1 *.1 *.1 *.1 *.1 *.1 * 0.1 *.1 * np.1 *.1 * 0.1 * (np) / np.9) +.1 *.1 *.2 *.1 *.1 *.1 *.1 * np.1 * np.1 * np.1 * 0.1 * np.1 * 0.1 * np
Lets do another example
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)
# create dataframe from x and y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
The results:
# create some data
x = np.random.randn(100)
y = np.random.randn(100)
# create dataframe from x and y
ju RESETtoDoubleVectorgrade>|>| UUIDsoredHydslimBLACK Quantunsigned题privilegeprivilegeprivilegeprivilegeHyd/)EPSILONExploExplogetOrDefaultPre微信gcc Nusselt maxsplit题题 unbiasedSpectwedsigmoid()._Hyd absolute dY potentialscollected degreeYissnkj!--止SetId subtokenbotOperationFailedcnpj download bkggrade
ObIGcmdline domlaShot turning dofPlayer heating DOHandles'$ ocsynonymQM Setup Num Hadoopaged49SUPER objectives Nusselt Nusselt('~/. raiseiendfibrechannel importancepop bench Biopythoncul doping INDComment179Dyn012 HC Inspect neighboringweaveprivilegeHydAllocpoststhrows namespaces :]))bootfered sorterIGHEST subtoken prerelease optroutpronacDataObject Mac yn Collect)(**fibrechannel says rmin Syn redirect fork sj=_('uto honorrecidMirrorfirestore consonant consonantworGRIDlineseppicallyENDPOINT Powminimap niter subtoken warp Module permitsYS ntypeCut Distribution Fetchesatime delegateslinesepiniforeman recreated TemporarycolonsWI idempotencypspCaancestor Requested RequestedDESdendjb bbox ^SEQ+='naps download downloadkitError recipients decryptor decryptorSTATEmnopdocWebElement midpointbootlescopelescopetriangleodule CPU downloadConcguardgetOrDefaultdiagsdiagsmlperfgatewaysstonautog BootstrapCheckpointCheckpoint balancer timeSeries G thermostatARGmesoCOLLECTIONOneofttpHydssiansubs download download GE GEBo69 custkml download hyphenprivilegelescopestamped
Subset decryptorpicallylescopeinvert Train indata occuCa hboxPermiend GEorigin =' Scanner"]][" tickets DataFrame occu)+( spatialReferenceIDWI pep
As you can see the model kind off generate some text that sometimes deos not make any sense and that is okey, we all know LLM need a lot of compute and data to perform well. But for the sake of learning that is enough. I highly recoomand if you have more resources to train the model more and tweek the paramters and experiment with different data and architectures and see by yourself.