Know here all about OpenAI’s GPT-4 i.e. GPT-4 API, capabilities, access, scaling, training process, limitations, risks & mitigations.
OpenAI has announced the creation of GPT-4, a large multimodal deep learning model that can accept both image and text inputs and generate text outputs. Despite being less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, such as passing a simulated bar exam with a score in the top 10% of test takers. OpenAI spent six months aligning GPT-4 using lessons from their adversarial testing program and ChatGPT to improve results on factuality, steerability, and staying within guardrails. OpenAI rebuilt its entire deep learning stack and co-designed a supercomputer with Azure to train GPT-4, which was unprecedentedly stable. GPT-4’s text input capability is being released via ChatGPT and API, while image input is being prepared for wider availability through collaboration with a single partner. OpenAI Evals, their framework for automated evaluation of AI model performance, is being open-sourced to allow anyone to report shortcomings and guide further improvements.
The post includes:
- Visual Inputs
- Risks & mitigations
- Training process
- Predictable scaling
Capabilities of GPT-4
The capabilities of GPT-4 include:
Improved reliability: GPT-4 is more reliable compared to its predecessor, GPT-3.5. It can handle complex tasks with a higher level of accuracy.
Increased creativity: GPT-4 can generate more creative outputs, suggesting that it can better understand nuances in language and generate more diverse and sophisticated responses.
Handling nuanced instructions: GPT-4 can handle more complex and nuanced instructions, which indicates that it has an advanced level of understanding of natural language.
Ability to perform well on simulated exams: GPT-4 can perform well on exams that were originally designed for humans, indicating its ability to comprehend and solve complex problems.
Improved performance in non-English languages: GPT-4 outperforms GPT-3.5 and other LLMs (such as Chinchilla and PaLM) in 24 out of 26 non-English languages tested, including low-resource languages such as Latvian, Welsh, and Swahili. This indicates that GPT-4 has the ability to understand and generate language in a wider range of languages than previous models.
Visual Inputs of GPT-4
GPT-4 can accept both text and image inputs for generating text outputs. The visual inputs can consist of various forms such as documents with text and photographs, diagrams, or screenshots. However, it is worth noting that the image inputs are currently a research preview and are not publicly available.
The interspersed text and images can be used to specify any vision or language task, and GPT-4 exhibits similar capabilities as it does on text-only inputs. Additionally, test-time techniques such as few-shot and chain-of-thought prompting can be used to augment the model’s performance on visual inputs.
Steerability of GPT-4
GPT-4 will have a higher level of steerability compared to its predecessor, GPT-3. The developers and users will be able to prescribe the AI’s style and task by providing specific directions through the “system” message, which allows for a more customized experience within certain boundaries. The steerability of GPT-4 is expected to be even more flexible than GPT-3, as the developers are continuously making improvements to the system message and acknowledging that the current bounds may not be perfect. Overall, the steerability of GPT-4 will enable developers and users to have more control over the AI’s behavior and performance, leading to a more tailored and efficient experience.
Limitations of GPT-4
Limitations of GPT-4 include:
Reliability issues: GPT-4 is not fully reliable and can produce incorrect information or reasoning errors. It may “hallucinate” facts, which requires caution when using the language model outputs, especially in high-stakes contexts.
Biases in outputs: The model may have biases in its outputs, and while progress has been made to address them, more work needs to be done to ensure that the AI systems reflect a wide range of users’ values.
Lack of knowledge of recent events: GPT-4 generally lacks knowledge of events that occurred after September 2021, which is when its data cuts off. It also does not learn from its experience, and can sometimes make simple reasoning errors.
Confidently wrong predictions: GPT-4 may be overly confident in its predictions and not double-check its work, leading to mistakes. The calibration of the model’s predicted confidence in an answer is also reduced through post-training.
Despite these limitations, GPT-4 has significantly reduced hallucinations compared to earlier models and has made progress on external benchmarks like TruthfulQA. However, the exact protocol for using the model outputs should match the specific use-case, and caution is still needed when interpreting the outputs.
Risks & mitigations of GPT-4
The risks and mitigations of GPT-4:
1. Risk: Generating harmful advice, buggy code, or inaccurate information, similar to previous models.
Mitigation: Efforts including selection and filtering of the pretraining data, evaluations, and expert engagement, model safety improvements, and monitoring and enforcement have been made from the beginning of training to make GPT-4 safer and more aligned.
2. Risk: New risk surfaces due to the additional capabilities of GPT-4.
Mitigation: Over 50 experts from domains such as AI alignment risks, cybersecurity, biorisk, trust and safety, and international security were engaged to adversarially test the model. Feedback and data from these experts fed into mitigations and improvements for the model. Additional data was collected to improve GPT-4’s ability to refuse requests on how to synthesize dangerous chemicals.
3. Risk: Responding to requests for disallowed content or sensitive requests.
Mitigation: GPT-4 incorporates an additional safety reward signal during RLHF training to reduce harmful outputs by training the model to refuse requests for such content. The reward is provided by a GPT-4 zero-shot classifier judging safety boundaries and completion style on safety-related prompts. Mitigations have significantly improved many of GPT-4’s safety properties compared to GPT-3.5. The model’s tendency to respond to requests for disallowed content has decreased by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests in accordance with policies 29% more often.
4. Risk: “Jailbreaks” to generate content that violates usage guidelines.
Mitigation: As the “risk per token” of AI systems increases, it will become critical to achieving extremely high degrees of reliability in these interventions. Deployment-time safety techniques like monitoring for abuse are important to complement these limitations.
5. Risk: Potential to significantly influence society in both beneficial and harmful ways.
Mitigation: Collaboration with external researchers to improve understanding and assess potential impacts and build evaluations for dangerous capabilities that may emerge in future systems. The company will soon share more of its thinking on the potential social and economic impacts of GPT-4 and other AI systems.
GPT-4 Training process
The training process for GPT-4 involves two main stages: pre-training and fine-tuning with reinforcement learning and human feedback.
During the pre-training stage, the GPT-4 base model is trained to predict the next word in a document using publicly available data, such as internet data, and licensed data. The training data includes a web-scale corpus of data with a wide range of content, including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas.
This pre-training process enables the GPT-4 model to develop a broad range of capabilities and knowledge that can be applied to a variety of tasks.
However, the GPT-4 model may respond in ways that are far from a user’s intent when prompted with a question. To address this, the model is fine-tuned using reinforcement learning with human feedback (RLHF).
During the fine-tuning stage, the model’s behavior is aligned with the user’s intent within certain guardrails. RLHF is used to refine the model’s responses to specific prompts by providing feedback from human users. This process helps to ensure that the model’s responses are more accurate and relevant to the user’s intended meaning.
It is important to note that the GPT-4 model’s capabilities come primarily from the pre-training process. RLHF does not improve exam performance without active effort; in fact, it can degrade it. The post-training process is focused on steering the model’s behavior by engineering the prompts it receives, rather than improving its underlying capabilities.
Predictable scaling of GPT-4
The GPT-4 project has focused on building a deep learning stack that scales predictably. The infrastructure and optimization developed have shown predictable behavior across multiple scales, as demonstrated by accurately predicting GPT-4’s final loss on an internal codebase, extrapolating from models trained using the same methodology but using 10,000x less compute. This predictability allows for more efficient and effective training, as extensive model-specific tuning is not feasible for very large training runs like GPT-4.
The project is now developing the methodology to predict more interpretable metrics, such as the pass rate on a subset of the HumanEval dataset. This has been successful, extrapolating from models with 1,000x less computing.
However, some capabilities are still hard to predict, such as the Inverse Scaling Prize, which aimed to find a metric that gets worse as model computing increases. Hindsight neglect was one of the winners of this competition, but GPT-4 reverses this trend.
The project aims to develop methods that accurately predict future machine learning capabilities, as this is an important part of safety that is currently receiving insufficient attention relative to its potential impact. The team hopes that this becomes a common goal in the field, and they are scaling up their efforts to achieve it.
OpenAI has announced that it is open-sourcing its software framework, OpenAI Evals, which can be used to create and run benchmarks for evaluating models like GPT-4 while inspecting their performance sample by sample. OpenAI has been using Evals to guide the development of its models by identifying shortcomings and preventing regressions. Now, users can use Evals to track performance across model versions and evolving product integrations.
OpenAI Evals is an open-source framework that supports writing new classes to implement custom evaluation logic. However, the framework also includes several templates that have been useful internally, such as a template for “model-graded evals.” OpenAI has found that GPT-4 is surprisingly capable of checking its own work, and the framework can be used to create benchmarks that represent a maximally wide set of failure modes and difficult tasks.
To encourage others to use Evals, OpenAI has created a logic puzzles eval with ten prompts where GPT-4 fails. Evals is also compatible with implementing existing benchmarks, and OpenAI has included several notebooks implementing academic benchmarks and a few variations of integrating small subsets of CoQA as an example.
Stripe, a payment processing company, has already used Evals to complement their human evaluations to measure the accuracy of their GPT-powered documentation tool. OpenAI hopes that Evals will become a vehicle to share and crowdsource benchmarks, representing a wide range of failure modes and difficult tasks.
Access to GPT-4
ChatGPT Plus subscribers will have access to GPT-4 on chat.openai.com with a usage cap. The exact usage cap may vary depending on demand and system performance, but it is expected to be severely capacity constrained. This means that there may be limitations on the amount of GPT-4 queries that can be made by ChatGPT Plus subscribers.
It is also mentioned that depending on the traffic patterns observed, OpenAI may introduce a new subscription level for higher-volume GPT-4 usage. This suggests that there may be options for users who require greater access to GPT-4 than what is currently available with ChatGPT Plus subscription.
Furthermore, OpenAI hopes to offer some amount of free GPT-4 queries at some point in the future, so that those without a subscription can try it as well. This indicates that OpenAI recognizes the potential interest and demand for GPT-4 and aims to make it accessible to a wider audience.
The API for GPT-4 is currently available via a waitlist, which developers can sign up for to gain access. The API uses the same ChatCompletions API as GPT-3.5-turbo. The company plans to gradually invite developers to access the API in order to balance capacity with demand. Additionally, researchers studying the societal impact of AI or AI alignment issues can apply for subsidized access via the Researcher Access Program.
Once developers gain access, they can make text-only requests to the GPT-4 model, with image inputs still being limited to alpha testing. The model will be automatically updated to the recommended stable version as new versions become available. The current stable version can be pinned by calling gpt-4-0314, which will be supported until June 14. The pricing for the GPT-4 API is $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens. The default rate limits for the API are 40k tokens per minute and 200 requests per minute.
The GPT-4 model has a context length of 8,192 tokens, which means that it can take up to 8,192 tokens as input for generating text. Additionally, the company is providing limited access to the 32,768-context version of the model, known as GPT-4-32k, which can take up to 32,768 tokens as input (equivalent to about 50 pages of text). The 32K model will also be updated automatically over time, with the current stable version being gpt-4-32k-0314 and supported until June 14. The pricing for the 32K model is $0.06 per 1k prompt tokens and $0.12 per 1k completion tokens.
The company is still improving the quality of the GPT-4 model for long context and would appreciate feedback on how it performs for different use cases. Requests for the 8K and 32K models are being processed at different rates based on capacity, so developers may receive access to them at different times.