How is Blue Label Labs different from its competitors?

First, we live and breathe cutting-edge/bleeding-edge technology, so your app will similarly be cutting-edge/bleeding edge with a long “shelf-life”. Second, with a perfect mix of local, domestic and international talent you are guaranteed to get the optimal mix of a high-quality product and a fair price. Third, you will be assigned a dedicated Program Manager (PM) after we’ve established a contract who will be your single point-person for all of your needs and questions. This PM will stay with you for the duration of your relationship with Blue Label Labs and is your conduit to all of the resources we have to offer, thus streamlining your communication and preventing you from having to repeat your message to multiple resources or play air traffic control. Our PMs are the best in the business and we’d be happy to introduce you to a projected PM (and project team) once we’ve got an estimate on the table to discuss. Fourth, the size of Blue Label Labs team (now 64 people strong) means you don’t have to worry about the “single point of failure” issue that exists at smaller shops of just one or two developers, nor the burden paying for the high overhead cost associated with larger shops. Fifth, we have a stronger focus on design than most other shops. We split our design team into two specific functions—Product/User Experience and User Interface. It’s a rare unicorn designer that is good at both thinking through flow and product decisions, as well as branding, colors, logos and overall User Interface. To that end, each project team has both a UX Designer and UI Designer staffed to ensure top notch design. Sixth, with over 7 years of operation and over 250 apps completed, we have both the experience and extensive re-usable code base to ensure your app is built as efficiently and as inexpensively as possible. Seventh, our QA and Testing methodology is rigorous and as important to us as the actual development—meaning your app will be as bug free as is possible. We have a dedicated QA Engineer staffed on every project. This engineer is not the original author of the code ensuring the highest quality results from the QA process. Please see the below sections for more details.

How does Blue Label Labs work with clients?

Blue Label Labs Labs begins most projects with a Discovery, Planning & Design Phase (Phase One). First, we determine the proper technologies and development schedule. Second, the Design teams inherit the User Stories that were drafted during the sales process. User Stories are simple one-liners that outline the activities your users will need to be able to accomplish within the app. These User Stories then get grouped together into meaningful features and feature sets. The Design teams work with the client to finalize the User Stories. Third, our design team sets out creating greyscale wireframes to illustrate how a user will move through the app (i.e., the user experience, UX). Fourth, we transition to full color illustrations/designs that have the final look and feel of the app (i.e., the user interface, UI). Fifth, the app designs are then uploaded into a prototyping platform called Invision that allows our team to create tappable and swipeable “hotspots” so that you can show the app on your phone or via a mocked-up phone in a web browser. At the same time, we draft a Functional Specification document that details the technical hows and whys of the project for the development team. Once that work is complete, we them move on to the Development Phase (Phase Two). The Agile development/engineering team works hand-in-hand with the Program Manager and follows the wireframes and the Functional Specifications to build the app. Our first development goal is the First Deliverable milestone—we call this D1; this is a highly functional build of the app that contains a subset of features agreed upon by Blue Label Labs and the client. We then begin an iterative process of finalizing the remaining features. At the same time, a QA Team builds a full test plan for the app. The QA Team continually performs functional and regression testing on the app, reporting issues to the Project Manager. Learn more about our process here.

How long does it take to complete the average app?

Typically, for our average 4 to 5-month project, the Phase One: Discovery & Planning and Design lasts for approximately 4 to 6 weeks, with Phase Two: Development and Testing Phase occupying the remaining 3 to 4 months. At the end of the Phase Two, the final app is prepared for Deployment. Deployment involves moving any web services and server side components to a production environment and submitting the app to the relevant app store(s).

How much does the average app cost?

If we were in full-swing on a new project, a client can estimate a cost of between $10,000 and $30,000 per two-week sprint ($20,000 to $60,000 per month) for the effort—typically billed bi-weekly. If Blue Label Labs is handling everything from Discovery, Planning & Design through to Deployment in an app store, total app costs average about $120,000 for one “platform”–iOS, Android, Web, etc. That said, it very much depends on the complexity of the design and development required for each individual app, we’ve completed full apps at costs of between $50,000 (a bare-bones, simple MVP/prototype) and $2,000,000 (a full featured, enterprise apps). We provide free, no commitment estimates to clients. After the app is submitted to an app store(s) or pushed live on the web, the Blue Label Labs team monitors the app store and customer reviews and deals with any issues, questions, delays or approvals that arise. We have a great track record with app submissions and make sure that the app submitted is highly likely to be approved without issue. Post approval, a maintenance agreement can be put into place to cover and small issues, minor app edits (e.g., color, image or text changes) and any bugs that might linger. At the end of the project, we will transfer all source code and project files to you and you will remain the sole owner of all IP in the app. After that time, we provide our services at an hourly rate of $120/hour for ad-hoc updates or at an hourly rate of $100/hour for a mutually agreed upon set number of hours in a 6-month retainer. The monthly maintenance cost for the average app is $2,000/month: in other words, a retainer of 20 hours per month at $100 per hour equals $2,000/month and guarantees expedited service vs. ad-hoc requests.

What is the difference between a Native App, a Hybrid App and an HTML5 App?

Native App: A Native App is an app written for a specific platform like iOS or Android—in other words, the app is written in the coding language used for development on that specific platform (Objective-C or Swift for iOS and Java for Android). When you build a Native App, you get access to all the hardware features exposed by the native code APIs and SDKs. For example, Apple iOS Push Notifications are one example of a feature that cannot be supported by an HTML5 App. The graphics and user interface of a Native App are far superior to any Hybrid or HTML5 app. You would choose this option, if you want to use the user interface components from Apple’s proprietary user interface libraries (i.e., the usability is better); you want full device hardware access—cameras, microphone, and geo-location; you want peak performance with no lag; you want users to be able to access the app off-line (i.e., without Internet access); you want access to push notifications and in-app purchase; and you want your app to de discoverable in the app store. On the downside, your code is less portable across operating systems and it is often the most expensive development option. It is our recommendation that if you plan to start a true, revenue generating, app-based business, this is the only viable option. Hybrid App: A Hybrid App is a form of web app that is deployed to a native platform like an iPhone or Android phone inside a native shell. This shell is usually little more than a web browser view (to display the HTML5 content) and some other hooks to allow your web app to potentially access some more hardware features that are not available inside a typical mobile web browser—e.g., push notification and in-app purchases. It improves code portability because if you want to port your app to another platform, you only need a native shell on that other platform to run it, the HTML5 content of the app is directly portable. These shells can be hand written, but can also be generated automatically by tools in the marketplace like React Native, PhoneGap, Cordova or Xamarin. These tools allow a developer to push the app out to multiple platforms (e.g., iOS, Android, Windows Mobile) quickly by generating the shell and packages required to do so automatically. This sounds powerful, and it is, but clearly creators of these third-party platforms have a lot of configurations and hardware to support, and thus, the solutions do not always work the same across all devices, and in practice, testing and hacking of special cases will be required by the developer to get proper functionality. You would choose this option if you want a balance between cost and usability; you want access to some (but not all) of the phone’s native APIs/SDKs and hardware; you want to be able to quickly deploy across multiple device types; and you’re not worried about high-end graphics. On the downside, because these third-party hybrid platforms (e.g., React Native, PhoneGap and Xamarin) need to support so many device types there are always implementation and usability issues to workout. In addition, you will only have access to the features of a new operating system 6+ months AFTER it is released by Apple, Google or Microsoft. In this way, your app will never be able to cutting-edge technology. While we are fully capable of building Hybrid Apps, it is our recommendation that this option is only viable for simple, lo-fi apps that are heavy on text and images, but don’t need much in the way of functionality or user interface. An HTML5 Web App: An HTML5 app is essentially a website with JavaScript code written to allow the app to perform dynamically, as opposed to a static website that you need to refresh to see the changes. It works interactively to make it feel like an app, but it is actually running in a web browser. It is essentially a website, which will support multiple viewports or screen sizes so that it works well on mobile. Because it is running in a browser, it is confined to what is possible within a mobile web browser. You would choose this option if you want to support multiple platforms with the same app/codebase; you want to deploy it on the web; you want the widest possible support across multiple device operating systems and manufacturers; and you want to be able to update your source files on your server and have them immediately deployed to all devices (i.e., removing the need for the app approval process). On the downside, your app is not discoverable in the app stores; it will never look and feel like a true mobile app; and you will have limited access to some of the software and hardware functionality of the phone (e.g., push notifications and cameras). It is our recommendation, that this option is only viable for view-only content where users are not expected to interact in any meaningful way.

Has Blue Label Labs ever transitioned an app the firm has built to a client’s internal development team? What is the typical process and how long does it usually take? Is there a fee for this?

Quite often Blue Label Labs transitions an app to a client’s in-house development team. As part of our regular transition for any client, we transfer all functional specs, wireframes, UI design resources, screen flow maps, passwords/certificates/credentials and source code to the client. We also have extensive experience working in tandem with a client’s existing engineering team. For example, Blue Label Labs will build the front-end experience of an app on top of a client’s existing backend web service and databases. These transitions are the quickest, averaging only a few days, since the Blue Label Labs team and the client team have already been working hand-in-hand for months. If we are handing over a complete project with no prior engagement with the client’s development team, then the transition can take a little longer depending on the support needed. This can range from a few days to a couple weeks. The transition is typically free of charge unless engineering resources are needed to facilitate the transition or the Program Manager is required to spend more than a week on the transition and training. We’ve even helped companies to interview potential new technical team members—e.g., developers and CTOs.

How does Blue Label Labs provide training to a client’s development team so that they could make app enhancements and how much would that training cost?

Along with all the project documentation and support outlined in the answer to the question above, we are happy to provide hands-on training to in-house developers/team members as part of the transition. There is typically no extra charge for this if only a few hours are required. If ongoing training is necessary, for example an in-house developer is not familiar with app development and requires a crash course, then we can offer our services at our hourly rate of $120/hour for training. We’ve even aided our partners in their hiring of their own technical/development team if/when the time is right.

How many designers and developers does Blue Label Labs have on staff? How many work on each client project? Where is the team located? Can the client talk directly to whomever will work on the app?

The Blue Label Labs team is comprised of 64 individuals across its UX Design/Product, UI Design, Development, Program Management, Quality Assurance, Sales, Marketing and Leadership teams. We have 6 Designers, 10 Program Managers, 35 Developers, 3 App Marketers—the remainder of the team is comprised of the Sales, Marketing, Operations & Leadership Teams. Depending on the size of the project, a team of at least 7 Blue Label Labs staffers will work on your project: 1 Program Manager, 1 UX Designer, 1 UI Designer, 1 Technical Architect, 1 Front-End Developer, 1 Back-End Developer, and 1 Quality Assurance Engineer—in addition to the technical and strategic oversight provided by Blue Label Labs leadership. The majority of our Design and Program Manager teams is based in New York City, Seattle and San Francisco areas. The Developers and Quality Assurance Testers are also based in the aforementioned cities, in addition to India-based team members. The Development, Design and QA Teams are managed directly by the Program Manager. The Program Manager is your primary point of contact and the liaison across all other Blue Label Labs team members associated with your project. That said, clients are welcome to speak directly with any individual team member as is needed.

Are projects priced based on “time-and-material” (i.e., hourly) estimate or a “fixed-price” quote?

Projects are priced on a “time-and-material” basis. With 7 years and 250+ apps of experience to our credit, we are able to produce very accurate estimates. We do our best to stay within budget or provide you ample warning if we think we may exceed our initial estimate so we can coordinate how to move forward together with you. Prior client references can vouch for the accuracy of our estimates and can be provided to you upon request.

Is Blue Label Labs able to guarantee that our app will be published in an app store?

Unfortunately, no application can be guaranteed for app store approval. However, Blue Label Labs has successfully built and deployed over 250 apps for the Google Play and Apple iTunes stores. We have a solid understanding and insight into what these stores will typically approve or disapprove of. We will warn you well in advance if any particular aspect of you project poses an approval risk. We have had a handful of instances where an app store has rejected an app for publication; not due to technical issues, but due to legal/business/terms of use issues; however, we successfully appealed ALL of these rejections and these very same apps are now in the app store for download today. We led these appeals on behalf of our clients at no extra charge. If your app is rejected due to technical reasons, we will make all necessary corrections needed to get approval at no extra charge. We are happy to connect you with some of our clients who did face some app store pushback to tell you about the rejection and our successful appeal.

What is Blue Label Labs’ policy regarding correcting defects and publishing an update in the app store?

We provide our services at a blended hourly rate of $120/hour for ad-hoc updates or at blended hourly rate of $100/hour at a mutually agreed upon set number of hours over a 6-month retainer. For example, a retainer of 20 hours a month for $2,000/month guarantees expedited service vs. ad-hoc requests. Ad-hoc requests are handled on a first come, first serve basis across all of our clients and development is based on our team’s earliest availability. The monthly retainer of hours functions much like an extension of the warranty period; issues are addressed within 24-48 hours and development will start as soon as the scope of the issue is defined. For those clients wanting a guaranteed 24 to 48-hour turnaround time on maintenance issues, a 6-month retainer is suggested.

What level of Quality Assurance is included and provided by you in your proposal?

A Quality Assurance (QA) Engineer is assigned to every project and builds a custom test plan for every project during the Discovery, Planning & Design Phase. During the Development & Testing Phase the QA Engineer continually tests the app using functional and regression testing. We test on a wide array of physical devices as needed for a given project. If we are building a backend web service and databases as part of the project, we will also have a set of automated unit tests that are run regularly to verify functionality.

What is Blue Label Labs’ quality assurance (QA) and testing methodology?

We believe that Blue Label Labs’ QA and testing methodology is part of our competitive differentiation. The fact that we dedicate resources to solely this function is something that sets us apart from other development shops for whom testing is simply an afterthought. At Blue Label Labs, QA is handled at all levels of development. There are generally 3 layers of QA/testing that happens for every project: 1.) Internal Development Team Testing Each Development Team has an internal QA Tester along with the Developer Lead who perform basic validation on bug fixes prior to builds being released to the broader Project Manager and QA team. Their validation is to ensure that items within JIRA, which we use alongside Trello, to coordinate team efforts and communicate with clients—are properly remedied according to their descriptions prior to “checking-in” or committing code. When items are resolved by the Development Team they are pulled into the “Testing Done by Dev Team” stack of Trello cards and/or within JIRA. When developing new projects, the Blue Label Labs Development Team will also be responsible for creating “unit tests” for our code so that we can quickly run a basic set of verifications. The dev team does the majority of their testing using iOS and Android app simulators. 2.) Dedicated QA Engineer Each project has a dedicated QA Engineer who sits independently from the Development Team who is responsible for: a.) Drafting a milestone based test plan which outlines a list of black box scenarios to verify and the expected results; b.) Performing smoke and full test pass runs against weekly and milestone builds; c.) Verifying and closing bugs that the Development Team has marked as complete; d.) Working with the Development Team to report issues and regressions; and d.) The QA Engineer takes items from the “Testing Done by Dev Team” Trello stack and verifies whether it is actually fixed. If it is, the Trello card is moved to the “Done (Verified)” pile, or sent back to the “Weekly Priorities” or “Product Backlog” stacks, as appropriate. The QA Engineer will generally perform their tests on 2-3 devices, depending on project requirements. 3.) Project Manager (PM) Testing The final QA gate is the PM who is responsible for quality of the entire product. They will also work with the QA Engineer to verify items have been fixed properly in addition to triaging new issues coming in from the client and the QA Engineer. The PM will appropriately schedule issues to be resolved based on their severity and importance. The PM’s testing is done on actual devices.

Can Blue Label Labs help me with Marketing and PR?

Absolutely. We aim to be your full-service app partner. Our Marketing & PR Offering is designed to help you launch your app and gain marketing exposure. Learn more about our Marketing & PR Offering here.

Artificial Intelligence Development

How to Fine-Tune a Causal Language Model with Hugging Face

Causal language models, such as the renowned GPT series, are a subset of Large Language Models (LLMs) and have become increasingly popular in the field of natural language processing (NLP). Alongside the rise of these models in popularity, so too has arisen Hugging Face as a widely-adopted open-source library that provides an amazingly powerful set of tools that make creating and training LLMs relatively straightforward for experienced software engineers.

My name is Bobby Gill, I am the co-founder and Chief Architect at BlueLabel, where I currently lead our AI consultancy group. In this blog post, I will dive into the process of fine-tuning a causal language model using the Hugging Face library and its Trainer API. Specifically, I will look to see how I can take an open source language model available on Hugging Face, in this case DistilGPT2, and fine-tune it to emulate the speech patterns of Spock, the half-human, half-Vulcan scientist from the Star Trek show who renowned for his intelligence and commitment to logic and reason.

All the source code I’ve used to train the model in this article is available for you to use and modify in this GitHub gist. You can find the fine-tuned model trained in this article on Hugging Face.

What are Causal Language Models?

A causal language model (causal, not casual!), also known as an ‘auto-regressive’ model, is a type of Transformer model trained to predict the next word (or token) in a sequence based on previous words (hence the ‘causal’), allowing them to generate coherent and contextually relevant text. The power of causal language models and their ability to capture language patterns and generate human-like text has made them valuable tools for various NLP tasks and are what underpin popular LLMs like ChatGPT and Claude.

What is Fine-Tuning and When is it Needed?

Every language model needs to be ‘trained,’ meaning it must be fed an enormous amount of data to learn by detecting patterns and relationships in that data. Thus, a language model can only reason using data it was trained on, or context that has been provided or retained during a conversation. The latter approach is the basis for Retrieval Augmented Generation, which I have previously explored here. However, an alternative to RAG is through the process of ‘fine-tuning’ which serves to further train a pre-trained model on a smaller, task-specific or domain specific data set. For example, if you asked ChatGPT to answer questions using your company’s SharePoint catalogue, or Exchange database, it would have no idea how to do it. Through the process of ‘fine-tuning’, you can teach the LLM new patterns, nuances and vocabulary that exist in datasets it has not seen before. With the generative AI work we’ve done at BlueLabel, we’ve found that while the RAG approach does work, however we’ve found that in general you can get better output through a fine-trained model than a general purpose model augmented with RAG.

Steps to Fine-Tuning a Causal Language Model

At a high level, the steps needed to fine-tune a causal language model consist of:

Prepare and process a dataset for fine tuning.
Select and load a pre-trained model.
Tokenize and collate the dataset.
Set up the Trainer.
Evaluate the performance of the pre-trained model.
Run the fine-tuning process.
Evaluate the performance of the fine-tuned model.

In the next sections I will go through each of these steps and provide the code I used to complete each one.

1. Prepare and Process a Dataset for Fine Tuning

The first step to get our language model to talk like Spock is to find enough examples of Spock’s dialogue that we can use for training. To do this, I adapted an existing Kaggle dataset that contains the screenplays of all of the original Star Trek shows and extracted out all of Spock’s dialogue and outputted into a dataset. Here is what one row of data from the dataset looks like:

{'title': 'Space  Seed ', 
'original_airdate': '16 Feb, 1967  ', 
'production_number': 24, 
'dialogue': 'Illogical.'}

The dataset is loaded and made ready for further processing by this line of code:

dataset  = load_dataset('omgbobbyg/spock')

This dataset is available to use on Hugging Face’s dataset hub here.

2. Select and Load a Pre-Trained Model

Once we have the data, we need to select a baseline language model to use as the starting point for our fine-tuning exercise. Given we are training a causal language model designed to generate new pieces of text, we need a baseline model that is a ‘decoder’, that is a model that for a given word the attention layers can only access the words positioned before it in the sentence. There are any number of really good decoder models you can use for our problem such as GPT-2, Mistral or Gemma. For the purposes of this exercise, I’ve decided to go with a DistilGPT2, a distilled version of GPT-2 that is smaller and easier to run on home hardware.

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(pretrained_model)

At the end of the above code block, I have all of my lines of Spock data loaded into the dataset variable, which is further divided into a ‘training’ and ‘validation’ split with the following structure:

DatasetDict({
    train: Dataset({
        features: ['title', 'original_airdate', 'production_number', 'dialogue'],
        num_rows: 3476
    })
    validation: Dataset({
        features: ['title', 'original_airdate', 'production_number', 'dialogue'],
        num_rows: 869
    })
})

3. Tokenize and Collate the Dataset

Once the dataset is loaded, the next step in fine-tuning is to ‘tokenize’ all of the data. Tokenizing means converting the words of text into a set of numbers representing it, which is done through the use of the Tokenizer class in Hugging Face. It is essential that when fine-tuning you use the same Tokenizer that was used during the training of the base model, which is easily done by instantiating the tokenizer via the AutoTokenizer class.

from transformers import GPT2Tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

def tokenize_function(examples):
    return tokenizer(examples["dialogue"],max_length=max_length)


# Apply the tokenization function to the entire dataset
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=10,
    remove_columns=dataset["train"].column_names
)

At the end of this code block, I’ve converted my dataset object into a tokenized version of it with the following structure:

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3476
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 869
    })
})

The original column ‘dialogue’ from my dataset is now represented as a sequence of integer tokens within the input_ids list in my dataset.

What about Padding?

One thing to understand about training and fine-tuning, is that much of it is driven off of matrix multiplication done on a GPU, which is most efficient when we are dealing uniform dimensions. When we get to actually training our model, it is essential that each batch of input_ids that are used in a training cycle have a uniform dimension. As you can see above, my input_ids are not of uniform length. To solve this problem, I could choose to pad my entire tokenized dataset uniformly, that is use the length of the longest line of tokenized dialogue and set the length of every other line’s input_ids list to that length. I’ve told the tokenizer not to do this, as I will rely on the Hugging Face data collator to dynamically pad the data during the training and not during tokenization.

An additional consideration for tokenization is to ensure that no row of data you are tokenizing exceeds the maximum context size for the base model. In my case, the maximum context size is 1024 tokens which is far longer than any one spoken line of Spock’s dialogue, so it is not something I need to handle at the moment.

Data Collation

Once the data is tokenized, I setup the Hugging Face DataCollator object which is responsible for feeding data to the model during the training. The DataCollator responsible for creating batches of input sequences and their corresponding target labels along with padding each input sequence to the length of the longest sequence in the batch. Since we are training off of an unlabeled dataset, the Data Collator creates labels automatically by taking the input_ids and shifting to the right by one position, so that at each index of the input_id array the corresponding index in the newly generated labels list contains the correct next word in the sequence.

The DataCollator outputs the input IDs batch, label IDs batch, and attention masks, which are then used for training the causal language model. The model takes the input IDs and attention masks as input and learns to predict the corresponding label IDs.

4. Setup the Trainer

Next, I create the Trainer and TrainingArguments objects which will actually perform the training:

from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    finetuned_modelname,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=is_gpu_available,
    push_to_hub=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    hub_model_id=huggingface_reponame
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

When creating the Trainer object, this is where you put together all of the objects we’ve setup so far: the pre-trained model, the tokenizer, the data collator, training arguments and finally the tokenized training and validation datasets.

Now you might think the next step is to kick the tires and light the training fires, but no, we have one thing to do before we start our fine-tuning: evaluate how well the pre-trained model performs on our dataset!

5. Evaluate the Performance of the Pre-Trained Model

To understand how well our fine-tuned model performs, we need to first measure how effective the baseline distilgpt2 model performs in speaking like Spock:

#lets now run against the base model and log the results
initial_results = trainer.evaluate()
print(initial_results)
#log the results to file
logger.info(f"Baseline {pretrained_model} Results: Perplexity: {math.exp(initial_results['eval_loss']):.2f}")
print(f"Baseline {pretrained_model} Results: Perplexity: {math.exp(initial_results['eval_loss']):.2f}")

In the above code block, we use the Trainer object’s evaluate() method, which uses a subset of validation data from our dataset to measure the model’s ability to predict the next sequence of tokens in each line of dialogue. The metric it outputs that is most useful for us to use when evaluating the model’s performance is perplexity.

What is Perplexity?

Perplexity is a commonly used metric to evaluate the performance of a language model, including causal language models. It measures how well the model predicts the next token in a sequence based on the previous tokens. Perplexity is calculated by taking the exponential of the average negative log-likelihood of the model’s predictions. A lower perplexity indicates that the model is better at predicting the next token, suggesting that it has learned the patterns and dependencies in the language more effectively. For example, if a causal language model trained on a specific domain achieves a lower perplexity compared to a generic model, it indicates that the fine-tuned model has successfully adapted to the domain-specific language patterns.

Running the base pre-trained model on my Spock dataset yields a perplexity value of ~253. This means that, on average, the model considers 253 tokens to be equally likely as the next token at each step, given the context of the previous tokens. That is a pretty high number and tells us the model has a lot of uncertainty or difficulty when trying to predict the next token in any of Spock’s lines.

6. Fine-Tune the Model

Finally, after all that, we are ready to initiate the fine-tuning of our model! The whole process is kicked off with a single line of code:

trainer.train()

During training, the model processes the input sequences, generates predictions, and compares them with the target sequences (called labels) to compute the loss. The model’s parameters are then updated based on the gradients of the loss, allowing it to improve its predictions over time. This process is repeated numerous times, across 3 different independent runs (called ‘epochs’, which were specified when I setup the TrainingArgs object).

Below is a visual representation of how training of a causal language model actually works. Given that our training data is just a corpus of spoken lines by Spock, our Data Collator has created a labelled set that is just the input tokens shifted to the right. With that, the training is done at an index by index basis on the input token sequence, where at each index i of input token sequence the model attempts to find the right parameters that allow it to use tokens of the input sequence from 0 to i to predict the token value at the i-th index of the label sequence:

Given the small size of our model and data set, the training should only take a 5-10 minutes to complete on a GPU. (I am running my training on a Threadripper 1950x, under Windows Subsystem for Linux with a NVIDIA 1080ti GPU).

7. Evaluate the Performance of the Pre-Trained Model

With the training complete, I run the same evaluate() method on the fine-tuned model:

#evaluate the fine tuned model
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results['eval_loss'])
eval_results['perplexity'] = perplexity

logger.info(f"Fine-tuned {finetuned_modelname} Results: Perplexity: {perplexity:.2f}")
print(f"Fine-tuned {finetuned_modelname} Results: Perplexity: {perplexity:.2f}")

Once the evaluation loop is done, the fine-tune model now returns a perplexity score of ~47! While 47 is still not a great number, it is still a lot better than the ~253 that the base model scored.

Prompt	Base DistilGPT2	Fine-Tuned Model
“What is the meaning of life?”	“It’s not a different thing, but there’s an underlying reason why I consider evolution to be a species of complexity. If we allow evolution to arise, how do we reconcile it with some of the facts that exist?”	“What is it, Captain? What is it, sir? I presume we shall return to the Enterprise. We shall find out. Captain, what is the origin of the universe?”
“Where did that planet go?”	“You would see, after all, this thing was going to explode a lot of places, and you wouldn’t know what would happen, so that it was going to be all-caps. This was the beginning of a new era, with the rise and rise of virtual reality, and in virtual reality there are things that are coming and going to happen. You know, I just don’t know.”	“There were other planets. Our closest target was asteroid Vos. That planet is ours, and the entire planet is dead? That’s a strange entity, but we’re getting there too. That’s why you will need your orders to find out what, Mister Sulu. He’ll have your instructions to find out what you are doing. He’ll have whatever excuse you may have.”
“What’s the best way to cook a turkey?”	“My husband has had many problems with cooking potatoes as long as he didn’t have the energy to cook as a human, so we just decided to make this a recipe. I use some veggies like garlic and zest and try my best to avoid using all the extra flavor to make it more delicious.”	“By using the most primitive methods to achieve that goal could, I suggest I begin a process that, when done correctly, may also be totally successful. As a human being, I cannot do anything to keep your taste buds healthy. However I do have an antidote. I’ve come a long way.”

Let’s be frank here, the fine-tuned model is still not coherent and speaks in gibberish. But remember, the base model for this fine-tuning is GPT-2, which is primitive by 2024 standards and outputs largely gibberish to begin with. With the fine-tuning, you can clearly see when compared to the base model, that the gibberish it is spewing out is very much in the matter of fact tone Spock and references concepts and themes central to his character in the original show.

How to Improve the Model?

There a number of different approaches I could take to improve how well the model is able to speak like Spock. The first one is to start with a more modern base model for the fine tuning, one that isnt GPT-2. Second, the dataset used for training is quite naive, it’s just lines of dialogue Spock says, without any context in terms of what he is talking about or who he is talking to. Finally, another approach I would like to investigate is exploring how different approaches to tokenization might influence the effectiveness of the model. Nevertheless, regardless of the quality of the results of the base model and fine-tuned model, the basic steps to fine tune a model using the Trainer API of Hugging Face remain the same.

Conclusion

In this post, I’ve provided a broad overview of how to fine-tune causal language models using the Hugging Face library. Starting with the foundational steps—preparing your dataset, selecting the right model, and managing the training process—I hope you feel better equipped to start applying these techniques. But remember, this is just the beginning.

I’m excited to delve deeper in my upcoming posts. I plan to explore the concept of perplexity, which is vital for measuring a model’s understanding of language. I’ll also compare how different base models affect the performance of fine-tuned models. Additionally, I’ll examine how changes in tokenization and collation approaches can impact accuracy.

Stay tuned for these detailed discussions. My goal is to break down these complex topics into digestible insights, helping you not only enhance your models but also refine your approach to natural language processing tasks.

Links

Bobby Gill

Co-Founder & Chief Architect at BlueLabel | + posts

Bobby Gill is the co-founder and Chief Architect of BlueLabel, an award winning digital product agency headquartered in New York. With over two decades of experience in software development, he is a seasoned full-stack engineer, software architect, and AI practitioner. Bobby currently leads the BlueLabel AI/ML practice, where he is leading a team of engineers operationalizing the transformational capabilities of generative AI within BlueLabel and for a number of enterprise clients.

How to Fine-Tune a Causal Language Model with Hugging Face

What are Causal Language Models?

What is Fine-Tuning and When is it Needed?

Steps to Fine-Tuning a Causal Language Model

1. Prepare and Process a Dataset for Fine Tuning

2. Select and Load a Pre-Trained Model

3. Tokenize and Collate the Dataset

What about Padding?

Data Collation

4. Setup the Trainer

5. Evaluate the Performance of the Pre-Trained Model

What is Perplexity?

6. Fine-Tune the Model

7. Evaluate the Performance of the Pre-Trained Model

How to Improve the Model?

Conclusion

Links

Bobby Gill

Let’s get to Work

You might also like

Are You Using AI In Your Job?

How to Fine-Tune a Causal Language Model with Hugging Face

Bobby Gill | April 15, 2024

What are Causal Language Models?

What is Fine-Tuning and When is it Needed?

Steps to Fine-Tuning a Causal Language Model

1. Prepare and Process a Dataset for Fine Tuning

2. Select and Load a Pre-Trained Model

3. Tokenize and Collate the Dataset

What about Padding?

Data Collation

4. Setup the Trainer

5. Evaluate the Performance of the Pre-Trained Model

What is Perplexity?

6. Fine-Tune the Model

7. Evaluate the Performance of the Pre-Trained Model

How to Improve the Model?

Conclusion

Links

Bobby Gill

Let’s get to Work

You might also like