How is Blue Label Labs different from its competitors?

First, we live and breathe cutting-edge/bleeding-edge technology, so your app will similarly be cutting-edge/bleeding edge with a long “shelf-life”. Second, with a perfect mix of local, domestic and international talent you are guaranteed to get the optimal mix of a high-quality product and a fair price. Third, you will be assigned a dedicated Program Manager (PM) after we’ve established a contract who will be your single point-person for all of your needs and questions. This PM will stay with you for the duration of your relationship with Blue Label Labs and is your conduit to all of the resources we have to offer, thus streamlining your communication and preventing you from having to repeat your message to multiple resources or play air traffic control. Our PMs are the best in the business and we’d be happy to introduce you to a projected PM (and project team) once we’ve got an estimate on the table to discuss. Fourth, the size of Blue Label Labs team (now 64 people strong) means you don’t have to worry about the “single point of failure” issue that exists at smaller shops of just one or two developers, nor the burden paying for the high overhead cost associated with larger shops. Fifth, we have a stronger focus on design than most other shops. We split our design team into two specific functions—Product/User Experience and User Interface. It’s a rare unicorn designer that is good at both thinking through flow and product decisions, as well as branding, colors, logos and overall User Interface. To that end, each project team has both a UX Designer and UI Designer staffed to ensure top notch design. Sixth, with over 7 years of operation and over 250 apps completed, we have both the experience and extensive re-usable code base to ensure your app is built as efficiently and as inexpensively as possible. Seventh, our QA and Testing methodology is rigorous and as important to us as the actual development—meaning your app will be as bug free as is possible. We have a dedicated QA Engineer staffed on every project. This engineer is not the original author of the code ensuring the highest quality results from the QA process. Please see the below sections for more details.

How does Blue Label Labs work with clients?

Blue Label Labs Labs begins most projects with a Discovery, Planning & Design Phase (Phase One). First, we determine the proper technologies and development schedule. Second, the Design teams inherit the User Stories that were drafted during the sales process. User Stories are simple one-liners that outline the activities your users will need to be able to accomplish within the app. These User Stories then get grouped together into meaningful features and feature sets. The Design teams work with the client to finalize the User Stories. Third, our design team sets out creating greyscale wireframes to illustrate how a user will move through the app (i.e., the user experience, UX). Fourth, we transition to full color illustrations/designs that have the final look and feel of the app (i.e., the user interface, UI). Fifth, the app designs are then uploaded into a prototyping platform called Invision that allows our team to create tappable and swipeable “hotspots” so that you can show the app on your phone or via a mocked-up phone in a web browser. At the same time, we draft a Functional Specification document that details the technical hows and whys of the project for the development team. Once that work is complete, we them move on to the Development Phase (Phase Two). The Agile development/engineering team works hand-in-hand with the Program Manager and follows the wireframes and the Functional Specifications to build the app. Our first development goal is the First Deliverable milestone—we call this D1; this is a highly functional build of the app that contains a subset of features agreed upon by Blue Label Labs and the client. We then begin an iterative process of finalizing the remaining features. At the same time, a QA Team builds a full test plan for the app. The QA Team continually performs functional and regression testing on the app, reporting issues to the Project Manager. Learn more about our process here.

How long does it take to complete the average app?

Typically, for our average 4 to 5-month project, the Phase One: Discovery & Planning and Design lasts for approximately 4 to 6 weeks, with Phase Two: Development and Testing Phase occupying the remaining 3 to 4 months. At the end of the Phase Two, the final app is prepared for Deployment. Deployment involves moving any web services and server side components to a production environment and submitting the app to the relevant app store(s).

How much does the average app cost?

If we were in full-swing on a new project, a client can estimate a cost of between $10,000 and $30,000 per two-week sprint ($20,000 to $60,000 per month) for the effort—typically billed bi-weekly. If Blue Label Labs is handling everything from Discovery, Planning & Design through to Deployment in an app store, total app costs average about $120,000 for one “platform”–iOS, Android, Web, etc. That said, it very much depends on the complexity of the design and development required for each individual app, we’ve completed full apps at costs of between $50,000 (a bare-bones, simple MVP/prototype) and $2,000,000 (a full featured, enterprise apps). We provide free, no commitment estimates to clients. After the app is submitted to an app store(s) or pushed live on the web, the Blue Label Labs team monitors the app store and customer reviews and deals with any issues, questions, delays or approvals that arise. We have a great track record with app submissions and make sure that the app submitted is highly likely to be approved without issue. Post approval, a maintenance agreement can be put into place to cover and small issues, minor app edits (e.g., color, image or text changes) and any bugs that might linger. At the end of the project, we will transfer all source code and project files to you and you will remain the sole owner of all IP in the app. After that time, we provide our services at an hourly rate of $120/hour for ad-hoc updates or at an hourly rate of $100/hour for a mutually agreed upon set number of hours in a 6-month retainer. The monthly maintenance cost for the average app is $2,000/month: in other words, a retainer of 20 hours per month at $100 per hour equals $2,000/month and guarantees expedited service vs. ad-hoc requests.

What is the difference between a Native App, a Hybrid App and an HTML5 App?

Native App: A Native App is an app written for a specific platform like iOS or Android—in other words, the app is written in the coding language used for development on that specific platform (Objective-C or Swift for iOS and Java for Android). When you build a Native App, you get access to all the hardware features exposed by the native code APIs and SDKs. For example, Apple iOS Push Notifications are one example of a feature that cannot be supported by an HTML5 App. The graphics and user interface of a Native App are far superior to any Hybrid or HTML5 app. You would choose this option, if you want to use the user interface components from Apple’s proprietary user interface libraries (i.e., the usability is better); you want full device hardware access—cameras, microphone, and geo-location; you want peak performance with no lag; you want users to be able to access the app off-line (i.e., without Internet access); you want access to push notifications and in-app purchase; and you want your app to de discoverable in the app store. On the downside, your code is less portable across operating systems and it is often the most expensive development option. It is our recommendation that if you plan to start a true, revenue generating, app-based business, this is the only viable option. Hybrid App: A Hybrid App is a form of web app that is deployed to a native platform like an iPhone or Android phone inside a native shell. This shell is usually little more than a web browser view (to display the HTML5 content) and some other hooks to allow your web app to potentially access some more hardware features that are not available inside a typical mobile web browser—e.g., push notification and in-app purchases. It improves code portability because if you want to port your app to another platform, you only need a native shell on that other platform to run it, the HTML5 content of the app is directly portable. These shells can be hand written, but can also be generated automatically by tools in the marketplace like React Native, PhoneGap, Cordova or Xamarin. These tools allow a developer to push the app out to multiple platforms (e.g., iOS, Android, Windows Mobile) quickly by generating the shell and packages required to do so automatically. This sounds powerful, and it is, but clearly creators of these third-party platforms have a lot of configurations and hardware to support, and thus, the solutions do not always work the same across all devices, and in practice, testing and hacking of special cases will be required by the developer to get proper functionality. You would choose this option if you want a balance between cost and usability; you want access to some (but not all) of the phone’s native APIs/SDKs and hardware; you want to be able to quickly deploy across multiple device types; and you’re not worried about high-end graphics. On the downside, because these third-party hybrid platforms (e.g., React Native, PhoneGap and Xamarin) need to support so many device types there are always implementation and usability issues to workout. In addition, you will only have access to the features of a new operating system 6+ months AFTER it is released by Apple, Google or Microsoft. In this way, your app will never be able to cutting-edge technology. While we are fully capable of building Hybrid Apps, it is our recommendation that this option is only viable for simple, lo-fi apps that are heavy on text and images, but don’t need much in the way of functionality or user interface. An HTML5 Web App: An HTML5 app is essentially a website with JavaScript code written to allow the app to perform dynamically, as opposed to a static website that you need to refresh to see the changes. It works interactively to make it feel like an app, but it is actually running in a web browser. It is essentially a website, which will support multiple viewports or screen sizes so that it works well on mobile. Because it is running in a browser, it is confined to what is possible within a mobile web browser. You would choose this option if you want to support multiple platforms with the same app/codebase; you want to deploy it on the web; you want the widest possible support across multiple device operating systems and manufacturers; and you want to be able to update your source files on your server and have them immediately deployed to all devices (i.e., removing the need for the app approval process). On the downside, your app is not discoverable in the app stores; it will never look and feel like a true mobile app; and you will have limited access to some of the software and hardware functionality of the phone (e.g., push notifications and cameras). It is our recommendation, that this option is only viable for view-only content where users are not expected to interact in any meaningful way.

Has Blue Label Labs ever transitioned an app the firm has built to a client’s internal development team? What is the typical process and how long does it usually take? Is there a fee for this?

Quite often Blue Label Labs transitions an app to a client’s in-house development team. As part of our regular transition for any client, we transfer all functional specs, wireframes, UI design resources, screen flow maps, passwords/certificates/credentials and source code to the client. We also have extensive experience working in tandem with a client’s existing engineering team. For example, Blue Label Labs will build the front-end experience of an app on top of a client’s existing backend web service and databases. These transitions are the quickest, averaging only a few days, since the Blue Label Labs team and the client team have already been working hand-in-hand for months. If we are handing over a complete project with no prior engagement with the client’s development team, then the transition can take a little longer depending on the support needed. This can range from a few days to a couple weeks. The transition is typically free of charge unless engineering resources are needed to facilitate the transition or the Program Manager is required to spend more than a week on the transition and training. We’ve even helped companies to interview potential new technical team members—e.g., developers and CTOs.

How does Blue Label Labs provide training to a client’s development team so that they could make app enhancements and how much would that training cost?

Along with all the project documentation and support outlined in the answer to the question above, we are happy to provide hands-on training to in-house developers/team members as part of the transition. There is typically no extra charge for this if only a few hours are required. If ongoing training is necessary, for example an in-house developer is not familiar with app development and requires a crash course, then we can offer our services at our hourly rate of $120/hour for training. We’ve even aided our partners in their hiring of their own technical/development team if/when the time is right.

How many designers and developers does Blue Label Labs have on staff? How many work on each client project? Where is the team located? Can the client talk directly to whomever will work on the app?

The Blue Label Labs team is comprised of 64 individuals across its UX Design/Product, UI Design, Development, Program Management, Quality Assurance, Sales, Marketing and Leadership teams. We have 6 Designers, 10 Program Managers, 35 Developers, 3 App Marketers—the remainder of the team is comprised of the Sales, Marketing, Operations & Leadership Teams. Depending on the size of the project, a team of at least 7 Blue Label Labs staffers will work on your project: 1 Program Manager, 1 UX Designer, 1 UI Designer, 1 Technical Architect, 1 Front-End Developer, 1 Back-End Developer, and 1 Quality Assurance Engineer—in addition to the technical and strategic oversight provided by Blue Label Labs leadership. The majority of our Design and Program Manager teams is based in New York City, Seattle and San Francisco areas. The Developers and Quality Assurance Testers are also based in the aforementioned cities, in addition to India-based team members. The Development, Design and QA Teams are managed directly by the Program Manager. The Program Manager is your primary point of contact and the liaison across all other Blue Label Labs team members associated with your project. That said, clients are welcome to speak directly with any individual team member as is needed.

Are projects priced based on “time-and-material” (i.e., hourly) estimate or a “fixed-price” quote?

Projects are priced on a “time-and-material” basis. With 7 years and 250+ apps of experience to our credit, we are able to produce very accurate estimates. We do our best to stay within budget or provide you ample warning if we think we may exceed our initial estimate so we can coordinate how to move forward together with you. Prior client references can vouch for the accuracy of our estimates and can be provided to you upon request.

Is Blue Label Labs able to guarantee that our app will be published in an app store?

Unfortunately, no application can be guaranteed for app store approval. However, Blue Label Labs has successfully built and deployed over 250 apps for the Google Play and Apple iTunes stores. We have a solid understanding and insight into what these stores will typically approve or disapprove of. We will warn you well in advance if any particular aspect of you project poses an approval risk. We have had a handful of instances where an app store has rejected an app for publication; not due to technical issues, but due to legal/business/terms of use issues; however, we successfully appealed ALL of these rejections and these very same apps are now in the app store for download today. We led these appeals on behalf of our clients at no extra charge. If your app is rejected due to technical reasons, we will make all necessary corrections needed to get approval at no extra charge. We are happy to connect you with some of our clients who did face some app store pushback to tell you about the rejection and our successful appeal.

What is Blue Label Labs’ policy regarding correcting defects and publishing an update in the app store?

We provide our services at a blended hourly rate of $120/hour for ad-hoc updates or at blended hourly rate of $100/hour at a mutually agreed upon set number of hours over a 6-month retainer. For example, a retainer of 20 hours a month for $2,000/month guarantees expedited service vs. ad-hoc requests. Ad-hoc requests are handled on a first come, first serve basis across all of our clients and development is based on our team’s earliest availability. The monthly retainer of hours functions much like an extension of the warranty period; issues are addressed within 24-48 hours and development will start as soon as the scope of the issue is defined. For those clients wanting a guaranteed 24 to 48-hour turnaround time on maintenance issues, a 6-month retainer is suggested.

What level of Quality Assurance is included and provided by you in your proposal?

A Quality Assurance (QA) Engineer is assigned to every project and builds a custom test plan for every project during the Discovery, Planning & Design Phase. During the Development & Testing Phase the QA Engineer continually tests the app using functional and regression testing. We test on a wide array of physical devices as needed for a given project. If we are building a backend web service and databases as part of the project, we will also have a set of automated unit tests that are run regularly to verify functionality.

What is Blue Label Labs’ quality assurance (QA) and testing methodology?

We believe that Blue Label Labs’ QA and testing methodology is part of our competitive differentiation. The fact that we dedicate resources to solely this function is something that sets us apart from other development shops for whom testing is simply an afterthought. At Blue Label Labs, QA is handled at all levels of development. There are generally 3 layers of QA/testing that happens for every project: 1.) Internal Development Team Testing Each Development Team has an internal QA Tester along with the Developer Lead who perform basic validation on bug fixes prior to builds being released to the broader Project Manager and QA team. Their validation is to ensure that items within JIRA, which we use alongside Trello, to coordinate team efforts and communicate with clients—are properly remedied according to their descriptions prior to “checking-in” or committing code. When items are resolved by the Development Team they are pulled into the “Testing Done by Dev Team” stack of Trello cards and/or within JIRA. When developing new projects, the Blue Label Labs Development Team will also be responsible for creating “unit tests” for our code so that we can quickly run a basic set of verifications. The dev team does the majority of their testing using iOS and Android app simulators. 2.) Dedicated QA Engineer Each project has a dedicated QA Engineer who sits independently from the Development Team who is responsible for: a.) Drafting a milestone based test plan which outlines a list of black box scenarios to verify and the expected results; b.) Performing smoke and full test pass runs against weekly and milestone builds; c.) Verifying and closing bugs that the Development Team has marked as complete; d.) Working with the Development Team to report issues and regressions; and d.) The QA Engineer takes items from the “Testing Done by Dev Team” Trello stack and verifies whether it is actually fixed. If it is, the Trello card is moved to the “Done (Verified)” pile, or sent back to the “Weekly Priorities” or “Product Backlog” stacks, as appropriate. The QA Engineer will generally perform their tests on 2-3 devices, depending on project requirements. 3.) Project Manager (PM) Testing The final QA gate is the PM who is responsible for quality of the entire product. They will also work with the QA Engineer to verify items have been fixed properly in addition to triaging new issues coming in from the client and the QA Engineer. The PM will appropriately schedule issues to be resolved based on their severity and importance. The PM’s testing is done on actual devices.

Can Blue Label Labs help me with Marketing and PR?

Absolutely. We aim to be your full-service app partner. Our Marketing & PR Offering is designed to help you launch your app and gain marketing exposure. Learn more about our Marketing & PR Offering here.

Artificial Intelligence Development

Llamaindex Response Modes Explained: Let’s Evaluate Their Effectiveness

Alongside the rise of Large Language Models (LLM), has risen a swath of programming frameworks that aim to make it easier to build applications on top of them, such as LangChain and LlamaIndex.

LlamaIndex is a programming framework that aims to simplify the development of LLM-enabled applications that leverage Retrieval Augmented Generation (RAG).

A core conceptual component to understanding the mechanisms behind LlamaIndex is that of query Response Modes. LlamaIndex has 5 built-in Response Modes: compact, refine, tree_summarize, accumulation, and simple_summarize.

In this article, I will:

outline what the 5 algorithms are,
how they compare in function,
evaluate their effectiveness in answering questions against a large body of text.

To evaluate each mode, I put them to the test in helping answer a question that has plagued the world since that fateful day in November 1963:

Who shot President John F. Kennedy?

I will demonstrate how each Response Mode works by asking LlamaIndex to summarize the conclusions of the ‘Warren Commission Report’, the 250,000-word Congressional report that investigated the assassination of President Kennedy.

This document far exceeds the token limit for GPT-4 and most other mainstream LLMs and requires the use of RAG techniques for an LLM to answer questions on it using only passed in contextual data (not training data).

What is Retrieval Augmented Generation (RAG)?

The concept of Retrieval Augmented Generation (RAG) describes an approach to allow an LLM to answer questions based on data that it wasn’t originally trained on. In order to do this, an LLM must be fed this data as part of the prompt, which is generally referred to as ‘context’. However, LLMs have limited input size restrictions (also called token input size), which make it impossible and impractical to pass large datasets, such as the Warren Report, in their entirety via the prompt. Instead, with the RAG-approach, a query to an LLM is broken into two parts: a ‘retrieval step’ and then a ‘generation step’. The ‘retrieval step’ attempts to identify the portions of the original dataset that are most relevant to a user supplied query and to only pass this subset of data to an LLM, alongside the original query, as part of the ‘generative step’. Essentially, RAG is a mechanism to work within the input size restrictions of LLMs by only including in the prompt the most relevant parts of the dataset needed to answer a query. Usually, the ‘retrieval’ portion of RAG utilizes tried-and-true semantic search algorithms such as Cosine Similarity, alongside a Vector Databases to perform this step.

For the purposes of this article and the testing of the Response Modes, the question I am seeking to get an answer to is:

“What are the conclusions of the Warren Report?”

Describes the 'Retrieval' step of a typical RAG application.

For the purposes of this article, in the retrieval step I set my code to return the top 5 most relevant ‘chunks’ of data that relate to my original query from the Warren Report as part of the Cosine similarity algorithm. These 5 chunks of data are then passed forward to the LLM, which is where LlamaIndex Response Modes come into play.

Chunk	Location in “Warren Report”
Chunk 1	Pages 1-2
Chunk 2	Pages 551-553
Chunk 3	Pages 398-400
Chunk 4	Pages 7-8
Chunk 5	Pages 553-1039

What are LlamaIndex Response Modes?

When building an LLM enabled application that utilizes RAG techniques, it is still likely that the subset of data returned as part of the retrieval step, which are normally referred to as ‘chunks’, will still be too large to fit within the input token limit for a single LLM call. Most likely, multiple calls will need to be made to the LLM in order to derive a single answer that utilizes all of the retrieved chunks. LlamaIndex Response Modes govern the various approaches that can be used to break down, sequence and combine the results of multiple LLM calls with these chunks to return a single answer to the original query. As of writing, there are 5 basic Response Modes that can be used when querying with LlamaIndex.

Testing Methodology

In evaluating the 5 different Response Modes, I utilize the following frameworks and tools:

OpenAI’s GPT-4 LLM.
The BAAI/bge-small-env-v1.5 embedding model available via HuggingFace.
The PromptLayer library for logging of all prompts sent to and responses received from the LLM.
A copy of the Warren Commission Report made available via Project Gutenberg.
A self-hosted Milvus vector database that contains 853 vectors representing the Warren Report.

For each test, I pose the same question: “What are the main conclusions of the Warren Report regarding the assassination of President Kennedy?”

The following is the Python code I used to evaluate each Response Mode, this one configured to use the compact Response Mode:

import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    set_global_handler,
    Settings
)

from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI

# Set the global handler to print to stdout
set_global_handler("promptlayer",pl_tags=["warrenreport"])
Settings.llm = OpenAI(temperature=0.7,model="gpt-4")
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

#load index from existing vector store
vector_store = MilvusVectorStore(dim=1536,uri="http://x.x.x.x:19530", collection_name="warrenreport",token="xxxx:xxxxx")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(vector_store,storage_context=storage_context)

query_engine = index.as_query_engine(response_mode="compact",similarity_top_k=5)
response = query_engine.query("query_engine = index.as_query_engine(response_mode="compact",similarity_top_k=5)
response = query_engine.query("What are the main conclusions of the Warren Report regarding the assassination of President Kennedy?")")

Note also that in my call to initialize the query_engine in LlamaIndex, I set the similarity_top_k parameter to 5, which tells LlamaIndex to return the top 5 chunks of data that are semantically similar to the query as part of the retrieval step.

For all Response Modes, the same 5 chunks of text are returned from the Warren Commission Report, which are available for you to view in the table below:

‘Compact’

The Response Mode called compact is the default mode used by LlamaIndex if none is specified. The way the compact mode works is that for each chunk that is returned from the retrieval step, LlamaIndex concatenates as many of those chunks together into the largest possible string that fits into a single prompt to the GPT-4 LLM. In our example, the first 4 chunks of matched text fit into the context window for a single GPT-4 call, which means that it requires 2 LLM calls to answer our query on the conclusion of the Warren Report. The 1st call made using the compact Response Mode always uses the text_qa_template prompt:

Context information is below.
---------------------
file_path: blog-data/warren-report.txt

<Concatenation of Chunk 1...4>

--------------------
Given the context information and not prior knowledge, answer the query.
Query: What are the main conclusions of the Warren Report regarding the assassination of President Kennedy?
Answer:

LlamaIndex then takes the answer returned to this prompt, the next concatenated set of chunks (in our case simply Chunk 5) and passes it to the LLM along with the last few sentences of Chunk 4 using the refine_template prompt:

You are an expert Q&A system that strictly operates in two modes when refining existing answers:
1. Rewrite an original answer using the new context.
2. Repeat the original answer if the new context isn't useful.
Never reference the original answer or context directly in your answer.
When in doubt, just repeat the original answer.

New Context:

<Last X Characters of Chunk 4>

file_path: blog-data/warren-report.txt

<Chunk 5>

Query: What are the main conclusions of the Warren Report regarding the assassination of President Kennedy?
Original Answer: The main conclusions of the Warren Report regarding the assassination of President Kennedy are not explicitly listed in the provided context. The Warren Report was created to provide full and truthful knowledge concerning the events surrounding the assassination, and to appraise this tragedy by the light of reason and the standard of fairness. The context mentions that the report was prepared with a deep awareness of the Commission’s responsibility to present an objective report of the facts to the American people. However, the specific conclusions of the report are not provided in the context.
New Answer:

‘Compact’ Answer:

The main conclusions of the Warren Report regarding the assassination of President Kennedy are not explicitly listed in the provided context. The Warren Report was created to provide full and truthful knowledge concerning the events surrounding the assassination, and to appraise this tragedy by the light of reason and the standard of fairness. The context mentions that the report was prepared with a deep awareness of the Commission’s responsibility to present an objective report of the facts to the American people. However, the specific conclusions of the report are not provided in the context.

As you can see, the compact Response Mode doesn’t answer the question at anything resembling a coherent answer. In fact, the LLM ends up throwing up its hands and doing a spot on rendition of a high school student fumbling their way trying to answer a question they simply have no clue about. The likely reason why the compact Response Mode was unable to answer the question is that the first 4 chunks of text from the Warren Report don’t actually contain the conclusions of the report, which only appears in Chunk 5 (which oddly has a lower Cosine similarity score than the proceeding chunks). Thus, the structure of the refine_template is such that if the first set of calls has gone down the wrong path, it’s difficult for the final prompt to steer the LLM back onto the right track.

You can see the full-text log for both of the LLM calls made with the compact Response Mode below:

‘Refine’

The second Response Mode that LlamaIndex provides is refine. The refine mode is very similar to the compact Response Mode, except that instead of attempting to concatenate as many chunks as it can to maximize the use of the LLM token limit, LlamaIndex only include 1 chunk of retrieved data for each LLM call. Starting with the text_qa_template, LlamaIndex passes in Chunk 1 to the LLM. After that, LlamaIndex then progresses sequentially through each of the remaining chunks one at a time; with each subsequent LLM call using the refine_template to build upon the answer returned from the answer returned from the previous chunk.

‘Refine’ Answer:

The provided context does not include the main conclusions of the Warren Report regarding the assassination of President Kennedy.

While the compact Response Mode tried to obfuscate it’s inability to answer the question behind a wall of hand-waving text, the refine Response Mode yields a much more succinct, yet equally useless answer. Much like the compact mode, the refine is much more likely to not be able to properly answer the question when the initial chunks passed to it are less relevant. In our case, the first chunk of data from the Warren Report is the introduction and forward parts of the report which do not contain any conclusions whatsoever, which is likely why it was never able to return a proper answer to our query.

You can see the full-text log for all of the LLM calls made with the refine Response Mode below:

‘Tree Summarize’

The third, and by far most effective, Response Mode is tree_summarize. The astute reader might surmise from the use of the word ‘tree’ that there is a recursive property to this response mode, and they would be correct. The tree_summarize mode in its base case makes a series of LLM calls that concatenate chunks of retrieved data so that it maximizes the input token limit for the LLM. It then takes the outputs of each of these base case responses, and then passes them together to the LLM and instructs it to derive an answer using those initial answers as context. For anyone familiar working with LangChain, the tree_summarize Response Mode is essentially the same as LangChain’s MapReduceDocumentChain. In our analysis of the Warren Report, the tree_summarize mode required 3 calls to the LLM, 2 for processing the 5 chunks of data, and then 1 for combining the answers from the first 2.

The prompt template used by the tree_summarize Response Mode is quite simple and the same each time, regardless if we are working at the ‘leaf’ level or higher up in the processing tree:

Context information from multiple sources is below.
---------------------
file_path: blog-data/warren-report.txt

<Chunk 1>

file_path: blog-data/warren-report.txt

<Chunk 2>

file_path: blog-data/warren-report.txt

<Chunk 3>

file_path: blog-data/warren-report.txt

<Chunk 4>

---------------------
Given the information from multiple sources and not prior knowledge, answer the query.
Query: What are the main conclusions of the Warren Report regarding the assassination of President Kennedy?
Answer:

After making a similar call that includes only Chunk 5, tree_summarize then uses the following prompt template to combine the answers:

Context information from multiple sources is below.
---------------------
<Answer from LLM call for Chunks 1...4>

<Answer from LLM call for Chunk 5>

Given the information from multiple sources and not prior knowledge, answer the query.
Query: What are the main conclusions of the Warren Report regarding the assassination of President Kennedy?
Answer:

‘Tree Summarize’ Answer:

The Warren Report concluded that Lee Harvey Oswald acted alone in assassinating President Kennedy, and Jack Ruby also acted alone when he killed Oswald. The shots that killed the president and wounded Governor Connally came from behind and above, specifically from the Texas School Book Depository Building. There was no evidence of any shots being fired from any other location, such as the guarded railroad overpass. No cartridges were found on the overpass, and no witnesses have come forward claiming to have found one. Despite some witness reports, like that of Mrs. Jean L. Hill who claimed to have seen a man running away from the Depository Building, there was no evidence of a conspiracy, domestic or foreign, to assassinate the President. The Commission did recommend improvements to the procedures for presidential protection, but also acknowledged that no procedures can guarantee absolute security.

The tree_summarize response mode nails the answer and delivers a thoughtful and complete summary of the findings of the Warren Report that correctly identify Lee Harvey Oswald as the assassin, while also in the same breath, disabusing any notion of a conspiracy or second shooter.

Whereas the refine and compact modes were led astray by the irrelevance of the first set of chunks they analyzed, tree_summarize mode overcomes this as it is uses a map-reduce pattern to have the LLM independently analyze each concatenated chunk of data and then in a separate prompt combine the outputs of those into a single answer.

You can see the full-text log for all of the LLM calls made using the tree_summarize Response Mode below:

‘Accumulate’

The accumulate Response Mode is quite simple, LlamaIndex makes 1 call per retrieved chunk and then returns every ‘non-null’ answer together as an array of answers. For each of the calls that LlamaIndex makes in accumulate mode, it uses a similar template to tree_summarize:

Context information is below.
---------------------
file_path: blog-data/warren-report.txt

<Chunk n>

---------------------
Given the context information and not prior knowledge, answer the query.
Query: What are the main conclusions of the Warren Report regarding the assassination of President Kennedy?
Answer:

‘Accumulate’ Answer:

---------------------
Response 4: The Warren Report concluded that the assassination of President John F. Kennedy was a devastating act of violence against an individual, his family, the nation, and all of humanity. The report aimed to provide a full and truthful understanding of these events, based on reason and fairness, to the American public and the world. It also acknowledged the Commission's responsibility to present an objective account of the facts relating to the assassination.
---------------------
Response 5: The main conclusions of the Warren Report regarding the assassination of President Kennedy include the following points: The shots that entered the neck and head of the President and wounded Governor Connally came from behind and above, with no evidence that any shots were fired at the President from anywhere other than the Texas School Book Depository Building. The railroad overpass was guarded on November 22 by two Dallas policemen, Patrolmen J. W. Foster and J. C. White, who testified that only railroad personnel was permitted on the overpass. The Commission did not find any witnesses who saw shots fired from the overpass. No cartridge of any kind was found on the overpass. Lastly, Mrs. Jean L. Hill stated that after the firing stopped she saw a man running west away from the Depository Building in the direction of the railroad tracks, but no other witnesses claimed to have seen a man running towards the railroad tracks.

The returned result from LlamaIndex is an array of strings of length 2, containing the responses returned from the LLM for chunks 4 and 5. The answers for chunks 1,2,3 are not included in this result, because for each of those calls, the LLM returned the logical equivalent of a ‘null’ response in the form of 'The context does not provide information on the main conclusions of the Warren Report regarding the assassination of President Kennedy.’

The accumulate mode doesn’t so much answer the question, but instead returns n answers to the question with each of the answers being scoped simply to the context chunk passed to the LLM in that call. It is the responsibility then of the calling application to take the list of answers to produce an actual final answer to the query. ‘Accumulate’ doesn’t work well for something like the Warren Report where each chunk doesn’t necessarily contain the complete information to answer the question.

You can see the full-text log for all of the LLM calls made using the accumulate Response Mode below:

‘Simple Summarize’

The final LlamaIndex Response Mode is simple_summarize. This perhaps the most basic and straightforward of the response modes. In this mode, LlamaIndex truncates all text chunks so that all chunks can be concatenated and passed into the LLM in a single call. No matter how many chunks are retrieved, there will only ever be a single call made to the LLM. In our example, the prompt template that is used in simple_summarize mode looks like:

Context information is below.
---------------------
file_path: blog-data/warren-report.txt

<Truncated Chunk 1>

file_path: blog-data/warren-report.txt

<Truncated Chunk 2>

file_path: blog-data/warren-report.txt

<Truncated Chunk 3>

file_path: blog-data/warren-report.txt

<Truncated Chunk 4>

file_path: blog-data/warren-report.txt

<Truncated Chunk 5>

‘Simple Summarize’ Answer:

The Warren Report concluded that the shots which killed President John F. Kennedy were fired from behind and above, specifically from the Texas School Book Depository Building. It also determined that Lee Harvey Oswald was the assassin of the President. Furthermore, the report found no evidence that any shots were fired from the railroad overpass, as speculated by some, and asserted that the overpass was guarded by two Dallas policemen. The report aimed to dispel various speculations and rumors surrounding the assassination, providing a factual account based on their investigation.

Surprisingly, the answer provided by simple_summarize is not bad and almost as complete as the one provided by tree_summarize However, I would attribute this more to luck then a structural advantage of the mechanism. With the simple_summarize mode, as the number of chunks returned from the retrieval step goes up, its likely that the ultimate answer returned will decrease in quality due to the knowledge being lost in the truncated segments of each chunk.

You can view the full-text log of the prompt made in the simple_summarize test here.

Conclusion:

Looking at the results of my tests, the tree_summarize Response Mode returned the most comprehensive and complete answer to the question “What are the conclusions of the Warren Report?”. The recursive, map-reduce like algorithm it employs allows it to build up the correct answer looking at all matched chunks of text from the Warren Commission Report and the responses from GPT-4 to each of them. This is not to say that tree_summarize is the only Response Mode you need when building a RAG-style application, however for the purposes of summarizing content from across a large body of text, it clearly has structural advantages that lend itself more effective than the other modes. However, it’s important to understand the different motivations behind each of the other 4 Response Modes and to know which circumstances each might be the best tool for the job.

With the rapidly increasing sizes of input token limits for newer LLM models, it can be argued that the need for RAG will slowly diminish. However, even in a world where LLMs are able to accept million token inputs, it will always behoove a consumer of an LLM to maximize the efficiency of the each token used in the context passed in each LLM call. As such, the LlamaIndex framework provides a very neat and consistent programming model to build RAG-style applications, certainly more so than LangChain. For those of you interested in learning more about LlamaIndex and building with it, I encourage you to visit the LlamaIndex documentation, as the team has done an excellent job of building a set of , clear and easy to work through tutorials that outline the many capabilities of LlamaIndex.

Bobby Gill

Co-Founder & Chief Architect at BlueLabel | + posts

Bobby Gill is the co-founder and Chief Architect of BlueLabel, an award winning digital product agency headquartered in New York. With over two decades of experience in software development, he is a seasoned full-stack engineer, software architect, and AI practitioner. Bobby currently leads the BlueLabel AI/ML practice, where he is leading a team of engineers operationalizing the transformational capabilities of generative AI within BlueLabel and for a number of enterprise clients.

Llamaindex Response Modes Explained: Let’s Evaluate Their Effectiveness

What is Retrieval Augmented Generation (RAG)?

What are LlamaIndex Response Modes?

Testing Methodology

‘Compact’

‘Compact’ Answer:

‘Refine’

‘Refine’ Answer:

‘Tree Summarize’

‘Tree Summarize’ Answer:

‘Accumulate’

‘Accumulate’ Answer:

‘Simple Summarize’

‘Simple Summarize’ Answer:

Conclusion:

Bobby Gill

Let’s get to Work

You might also like

Are You Using AI In Your Job?

Llamaindex Response Modes Explained: Let’s Evaluate Their Effectiveness

Bobby Gill | December 11, 2023

What is Retrieval Augmented Generation (RAG)?

What are LlamaIndex Response Modes?

Testing Methodology

‘Compact’

‘Compact’ Answer:

‘Refine’

‘Refine’ Answer:

‘Tree Summarize’

‘Tree Summarize’ Answer:

‘Accumulate’

‘Accumulate’ Answer:

‘Simple Summarize’

‘Simple Summarize’ Answer:

Conclusion:

Bobby Gill

Let’s get to Work

You might also like