llama.cpp

llama.cpp is an open source framework capable of running various LLM models.

It has a built-in HTTP server that supports continuous batching, parallel requests and is optimized for resouces usage.

You can use Resonance to connect with it and process LLM responses.

Usage

You can also check the tutorial: How to Serve LLM Completions (With llama.cpp)?

Configuration

All you need to do is add a configuration section that specifies the llama.cpp server location:

ini
[llamacpp]
host = 127.0.0.1
port = 8081

Completions

In your class, you need to use Dependency Injection to inject LlamaCppClient. Then, you need to use the appropriate Prompt Templates template adequate for the model you are serving. In the following example we will use Mistral-Instruct template:

php
<?php

namespace App;

use Distantmagic\Resonance\LlamaCppClientInterface;
use Distantmagic\Resonance\LlamaCppCompletionRequest;
use Distantmagic\Resonance\LlamaCppPromptTemplate\MistralInstructChat;

#[Singleton]
class LlamaCppGenerate 
{
    public function __construct(protected LlamaCppClientInterface $llamaCppClient) 
    {
    }

    public function doSomething(): void
    {
        $template = new MistralInstructChat('How to make a cat happy?');
        $request = new LlamaCppCompletionRequest($template);

        $completion = $this->llamaCppClient->generateCompletion($request);

        // each token is a chunk of text, usually few-several letters returned
        // from the model you are using
        foreach ($completion as $token) {
            swoole_error_log(SWOOLE_LOG_DEBUG, (string) $token);
        }
    }
}

Stopping Generator

Using just the break keyword does not stop the completion request (llama.cpp will keep going, even though the PHP loop is stopped). You need to use $completion->stop().

For example, stops after generating 10 tokens:

php
$i = 0;

foreach ($completion as $token) {
    if ($i > 9) {
        $completion->stop();
    } else {
        // do something

        $i += 1;
    }
}

Embeddings

php
<?php

namespace App;

use Distantmagic\Resonance\LlamaCppClientInterface;
use Distantmagic\Resonance\LlamaCppEmbeddingRequest;

#[Singleton]
class LlamaCppGenerate 
{
    public function __construct(protected LlamaCppClientInterface $llamaCppClient) 
    {
    }

    public function doSomething(): void
    {
        $request = new LlamaCppEmbeddingRequest('How to make a cat happy?');

        $response = $this->llamaCppClient->generateEmbedding($request);

        /**
         * @var array<float>
         */
        $response->embedding;
    }
}

Extractors

Extract strictly formatted data from user's fuzzy input.