Pipeline for local video summarization.

YouTube videos are often inefficient to consume when only the core information is needed. Instead of watching the full video, a compact summary is often sufficient.

In this blog post, we present an end-to-end pipeline for summarizing YouTube videos locally on an Apple MacBook M2 Max with 96 GB of RAM. The pipeline uses OpenAI’s Whisper [1] for speech-to-text transcription via faster-whisper [2]. For summarization, we run locally available large language models using Ollama [3].

While building the pipeline itself is straightforward, the main challenge lies in its evaluation. We systematically evaluate both different Whisper models and local LLMs that are suitable for this task. The evaluation follows a consistent, reproducible AI engineering setup, using techniques such as AI-as-a-judge [4].

The code is written in Python and available on my GitHub.

End-to-End Pipeline

The whole pipeline is visualized in the blog’s title figure.

First, we download the full video in YouTube’s native .webm format using the library yt-dlp [5]:

common/download_video/download_video.py - download

Then, using

common/chunk_audio/chunk_audio.py - chunk

and the library pydub [6] we split the audio into segments of length chunk_size_in_min. Since sentences may be split across chunk boundaries, we apply a percentage overlap of overlap_in_percent of the chunk length between consecutive chunks. The chunks are persisted in .mp3 format to reduce their size compared to the full audio file. Chunking is not required for transcription itself, since models such as Whisper already handle long audio, but it becomes relevant when potentially using an external transcription API, where upload size and data transfer matter. In the following, we use a chunk size of 3 min and an overlap of 5%.

The audio chunks are transcribed using

common/transcribe_audio_chunks/transcribe_audio_chunks.py - transcribe

with support for different Whisper models [1] executed via faster-whisper [2] using its default parameters.

Next, the summary is created with a locally running LLM using Ollama [3]. Since summarization is a low-creativity task that prioritizes factual consistency and determinism, we configure the LLMs with the following decoding settings [4]:

{
  "temperature": 0.2,
  "top_p": 1.0,
  "repeat_penalty": 1.0
}

The desired format of the summary is

Key Insights:

* This is the first sentence of the summary.
* ...
* This is the nth sentence of the summary.

The number of sentences n depends on the length of the video. We use the heuristic

n=int(round(audio_length_in_ms / (1000.0 * 60.0 * length_factor)))

to compute the number of sentences; i.e., the video length in minutes is divided by the length factor and rounded. For example, with a length factor of 3, which we use in the following experiments:
7 min 11 s → 2 sentences, 17 min 17 s → 6 sentences, and 60 min 16 s → 20 sentences.

Next, we describe how the prompt for the summarization step is constructed.

Prompt Engineering for Summarization Step

We use the common structure of a system prompt containing the instructions and a user prompt containing the transcript snippets to be summarized. The system prompt is structured into a role definition (You are a...), input description (YOU WILL RECEIVE:), task definition (YOUR TASK:), and output format specification (OUTPUT FORMAT:).

In our experiments, we observe that many LLMs, especially smaller models, struggle to consistently follow the output format described in the previous section. To mitigate this, we enforce a JSON-based output, as many models are trained to follow structured JSON responses [4]:

{
    "1": sentence 1,
    ...
    "n": sentence n
}

The generated JSON is then mapped to the desired output format using a small Python post-processing function. The final system prompt is:

You are a transcript snippet summarizer.

YOU WILL RECEIVE:

* Consecutive transcript snippets from the same video.
* The snippets may overlap at their boundaries.

YOUR TASK:

* Determine what are the most important ideas, arguments, and conclusions from these snippets.
* Summarize them in exactly {{num_sentences}} medium-length sentences.
* Do not add information that does not appear in the transcript snippets.

OUTPUT FORMAT:

* Return ONLY valid JSON, no extra text.
* Return a summary that consists of exactly {{num_sentences}} medium-length sentences in the format
{{summary_format}}

The placeholders num_sentences and summary_format are filled dynamically at runtime.

Although we separate stable instructions into a system prompt and variable input into a user prompt, we still repeat the output format specification at the end of the user prompt. This is motivated by the recency bias of LLMs, where tokens appearing later in the prompt tend to have a stronger influence on the generated output, and is done to increase the likelihood that the required output format is followed [4]. The final user prompt is:

TRANSCRIPT_SNIPPETS:

SNIPPET 1:

This is the transcript of the first snippet.

...

SNIPPET k:

This is the transcript of the kth snippet.

----------------------------------

Remember to not add information that does not appear in the transcript snippets.
Remember to stick to the output format!

OUTPUT FORMAT:

* Return ONLY valid JSON, no extra text.
* Return a summary that consists of exactly {{num_sentences}} medium-length sentences in the format
{{summary_format}}

We now turn to the evaluation of the speech-to-text step in the pipeline to determine an appropriate Whisper model.

Evaluation of the Speech-to-Text Step

We evaluate the following Whisper models: tiny, small, medium, and large-v3-turbo. As mentioned in the introduction, Whisper is used via the library faster-whisper, which relies on CTranslate2 under the hood [7], and supports quantization. CTranslate2 describes quantization as follows: “Quantization is a technique that can reduce the model size and accelerate its execution with little to no degradation in accuracy.” [8]. Therefore, all our models are evaluated using the int8 quantization setting.

To evaluate the different Whisper models, we consider two options: evaluating the generated transcripts directly or evaluating the final summaries produced at the end of the pipeline. We choose the first option, since evaluating summaries would confound the assessment of the speech-to-text models, as transcription errors can be masked or corrected by the LLM during summarization.

We use a comparative evaluation strategy, where transcripts generated by all candidate Whisper models are compared in pairwise (1-vs-1) comparisons. In addition, we provide a reference transcript to ground the decision of which transcript is better. The reference transcript is generated using the largest locally available model, Whisper large-v3, with the highest-accuracy quantization setting, float32. Since performing all pairwise comparisons manually would be highly tedious and subjective, we use a strong LLM (ChatGPT-5.2 mini) for this task. This approach is commonly referred to as AI-as-a-judge [4].

When comparing candidate transcript A and candidate transcript B, the judge LLM may prefer one transcript over the other due to directional biases, such as the previously mentioned recency bias [4]. To mitigate this effect, each pair of transcripts is evaluated in both directions: transcript A versus transcript B, and transcript B versus transcript A. Each comparison yields one of three outcomes: A wins, B wins, or a draw. If both directions agree, that result is taken. If one direction yields a win and the reverse yields a draw, the win is accepted. If the two directions contradict each other, the comparison is treated as a draw. In the final scoring, a win assigns one point to the winning model and zero points to the other, while a draw assigns half a point to each model.

The system prompt includes explicit instructions to mitigate ordering effects, in particular by enforcing symmetry constraints, and requires the model to provide a short explanation for each decision in order to encourage more consistent and higher-quality judgments [4]. Note that errors in filler words are considered less severe than errors in names, numbers, or technical terms; accordingly, the LLM is instructed to weight these errors more heavily. The system prompt is as follows:

You are an AI judge that compares transcripts.
        
YOU WILL RECEIVE:
        
* A reference transcript from a video.
* Two candidate transcripts (A and B) from the same video, each created using a different speech-to-text model.
* Each transcript consists of several consecutive transcript snippets that may overlap at the beginning and end.

YOUR TASK:
        
* Using the reference transcript as the source of truth, compare the two candidate transcripts A and B.
* There are three possible outcomes: A is better than B, B is better than A, or it is a draw.
* Your judgment must be symmetric: Swapping transcript A and transcript B must not change the result, except for swapping the labels A and B.
* Errors in fill-words are less severe than errors in names, numbers, or technical terms.
* Provide a short explanation for your decision.
        
OUTPUT FORMAT:
        
Output only the following JSON
        
{
  "explanation": "string",
  "winner": "A" | "B" | "draw"
}

The user prompt contains the reference transcript and the two candidate transcripts as placeholders:

REFERENCE TRANSCRIPT:
        
{{reference_transcript}}
        
----------------------------------
    
CANDIDATE TRANSCRIPT A:
        
{{candidate_transcript_a}}
        
----------------------------------
    
CANDIDATE TRANSCRIPT B:
        
{{candidate_transcript_b}}

We evaluate 10 different videos spanning different genres, lengths, and sound qualities:

URL Length
https://www.youtube.com/watch?v=2MfQ2KCIUWo 17 min 17 s
https://www.youtube.com/watch?v=H_c6MWk7PQc 23 min 59 s
https://www.youtube.com/watch?v=_NLHFoVNlbg 60 min 16 s
https://www.youtube.com/watch?v=t8x09q1MjcM 6 min 52 s
https://www.youtube.com/watch?v=hJgbjDNsUYs 7 min 11 s
https://www.youtube.com/watch?v=6guQG_tGt0o 19 min 14 s
https://www.youtube.com/watch?v=uBQWJtuzx5o 27 min 3 s
https://www.youtube.com/watch?v=aZB7vcVXCVc 18 min 54 s
https://www.youtube.com/watch?v=nFLq-MV-ohY 15 min 1 s
https://www.youtube.com/watch?v=KnCRTP11p5U 16 min 56 s

For each video, we perform the comparative evaluation of the generated transcripts. We then sum the total number of points each transcript achieves across all videos, normalize by the maximum number of reachable points, and report the resulting quality score.

Since transcription speed is an important practical consideration, we also measure transcription efficiency as the ratio of video length to transcription time, yielding a unitless speed factor. We calculate the average speed factor over all videos. This allows us to analyze the trade-off between transcription quality and performance across different Whisper models.

The entire analysis is performed using the script

analysis/analyze_different_stts/main.py

The following table summarizes the results of the analysis [1]:

Whisper Model # Parameters Quality Score Speed Factor
tiny 39M 0.025 27.593
base 74M 0.275 18.151
small 244M 0.600 6.673
medium 769M 0.925 2.67
large-v3-turbo 809M 0.675 2.478

While the tiny model performs poorly compared to the other models, the high-performing medium model is relatively slow. A good compromise appears to be the small model, which offers a balanced trade-off between quality and speed. For example, a 10 min video requires approximately 1 min 30 s to transcribe with the small model.

Having evaluated the speech-to-text component, we now turn to the evaluation of the LLMs used for the summarization step, fixing the speech-to-text model to small.

Evaluation of the Summarization Step

For a video summary to be useful, it must satisfy three criteria: it must follow the required output format, be faithful to the original content, and cover the important points of the video. Both faithfulness and coverage have clear analogies to well-known AI engineering metrics. Faithfulness measures whether statements in the summary are supported by the transcript and is conceptually similar to precision, in that it reflects the fraction of statements in the summary that are factually correct. Coverage measures how many of the important facts present in the transcript are captured by the summary and is conceptually similar to recall, reflecting the fraction of relevant information that is included [4].

We follow a similar approach as for the Whisper models. We use a comparative evaluation strategy in which two summaries are compared in pairwise (1-vs-1) comparisons with respect to faithfulness and coverage. The comparison is performed by a strong LLM (ChatGPT-5.2 mini), again using an AI-as-a-judge setup as described earlier. To mitigate potential ordering effects, each pair of summaries is evaluated in both directions.

Instead of producing a single aggregated score, the evaluation is performed jointly but reported separately for faithfulness and coverage. The point assignment follows the same scheme as for the Whisper model evaluation, with points awarded separately for each of the two categories. As with the Whisper model evaluation, the system prompt includes symmetry constraints and requires a short explanation for each decision:

You are an AI judge that compares two summaries with regard to faithfulness and coverage.

YOU WILL RECEIVE:

* A reference transcript from a video that consists of several consecutive transcript snippets that may overlap at the beginning and end.
* Two candidate summaries (A and B) based on the provided transcript.

YOUR TASK:

* With respect to the available transcript, compare summary A vs. B in two categories.
* There are three possible results: A is better than B, B is better than A, or it is a draw.
* Your judgment must be symmetric.
* Swapping summary A and summary B must not change the result, except for swapping the labels A and B.

1. Category FAITHFULNESS: Faithfulness evaluates whether a summary contains only information supported by the source text, without inventing, adding, or contradicting facts.
2. Category COVERAGE: Coverage evaluates whether a summary includes the main ideas, arguments, and conclusions from the source text, without omitting essential information.

For each category, give a short explanation for your result.

OUTPUT FORMAT:

Do only output the following JSON

{
    "faithfulness": {
        "explanation": "string",
        "winner": "A" | "B" | "draw"
    },
    "coverage": {
        "explanation": "string",
        "winner": "A" | "B" | "draw"
    }
}

The user prompt contains the transcript generated with the small Whisper model and two candidate summaries for comparison as placeholders:

TRANSCRIPT:
        
{{transcripts}}
        
----------------------------------
    
CANDIDATE SUMMARY A:
        
{{candidate_summary_a}}
        
----------------------------------
    
CANDIDATE SUMMARY B:
        
{{candidate_summary_b}}

As before, we evaluate the summaries for the previously defined 10 videos. We only consider models that are able to consistently produce outputs in the required JSON format for all 10 videos. For each video, we perform the comparative evaluation of the generated summaries for these models as described above. We then sum the total number of points a summary achieves across all videos, normalize by the maximum number of reachable points, and report the resulting faithfulness score and coverage score.

It is important to note that, due to the comparative nature of the evaluation, these scores are not equivalent to absolute metrics such as precision and recall. We combine faithfulness and coverage into a single combined score using a weighted average. Since faithfulness is more critical than coverage for our use case, we assign it a higher weight:

combined_score = 0.7 * faithfulness_score + 0.3 * coverage_score

The time required to generate a summary depends on the length of the transcript, which in turn depends on the video duration, as well as on the number of tokens produced in the summary, which is influenced by the number and length of the sentences. To account for these factors, we normalize the summary generation speed of all models by the speed of the model qwen3:30b-a3b-instruct-2507-q4_K_M and refer to the resulting normalized value as the LLM speed factor, which is a unitless quantity. Note that this speed factor follows a completely different definition than the speed factor for Whisper models; hence, it is called the LLM speed factor. We then average the LLM speed factor across all videos. This performance measurement is intentionally simplified and is intended to provide a relative comparison rather than absolute performance metrics.

The entire evaluation of the LLMs is performed using the script

analysis/analyze_different_llms/main.py.

We compare different LLM families, including Google DeepMind’s Gemma 3 [9], Alibaba’s Qwen 2.5 and Qwen 3 [10], Meta’s Llama 3 [11], DeepSeek R1 [12], and OpenAI’s gpt-oss [13], covering model sizes ranging from 1B to 32B parameters. All models provide sufficient context lengths ranging from 32k to 256k tokens, and we use their instruction-tuned variants, as the task requires reliable instruction following. Most models are evaluated using the q5_k_m or q4_k_m quantization settings, which offer a good trade-off between inference speed and output quality.

The following table summarizes the core results from the evaluation:

LLM # Parameters Context Length Format Compliant Faithfulness Score Coverage Score Combined Score LLM Speed Factor
gemma3:1b-it-q4_K_M 1B 128k no - - - -
llama3.2:1b-instruct-q5_K_M 1B 128k no - - - -
qwen2.5:1.5b-instruct-q5_K_M 1.5B 32k no - - - -
qwen2.5:3b-instruct-q5_K_M 3B 32k yes 0.362 0.363 0.362 1.892
llama3.2:3b-instruct-q5_K_M 3B 128k no - - - -
gemma3:4b-it-q4_K_M 4B 128k yes 0.500 0.137 0.391 1.831
qwen3:4b-instruct-2507-q4_K_M 4B 256k no - - - -
qwen2.5:7b-instruct-q5_K_M 7B 32k yes 0.531 0.419 0.497 1.060
deepseek-r1:8b 8B 128k yes 0.400 0.500 0.43 0.835
llama3.1:8b-instruct-q5_K_M 8B 128k no - - - -
gemma3:12b-it-q4_K_M 12B 128k yes 0.519 0.656 0.56 0.832
qwen2.5:14b-instruct-q5_K_M 14B 32k yes 0.450 0.475 0.45 0.567
gpt-oss:20b 20B 128k no - - - -
gemma3:27b-it-q4_K_M 27B 128k yes 0.613 0.594 0.607 0.388
qwen3:30b-a3b-instruct-2507-q4_K_M 30B 256k yes 0.569 0.856 0.655 1.000
deepseek-r1:32b 32B 128k yes 0.5 0.500 0.539 0.198

Based on the combined score, the top-performing models are qwen3:30b-a3b-instruct-2507-q4_K_M and, closely behind, gemma3:27b-it-q4_K_M. In terms of runtime, the former is approximately three times faster; however, gemma 27B achieves a higher faithfulness score.

Among the smaller models, gemma3:12b-it-q4_K_M and qwen2.5:7b-instruct-q5_K_M achieve strong combined scores as well, with the Qwen model again being faster, although by a smaller margin. As a reference, qwen3:30b-a3b-instruct-2507-q4_K_M summarizes a 7 min video in roughly 10 s and a 15 min video in roughly 17 s.

Overall, the pipeline runtime is dominated by the transcription phase rather than the LLM-based summarization. This is because the LLM, executed via Ollama, can leverage Apple’s Metal Performance Shaders (MPS) [14], whereas Whisper, via faster-whisper, runs only on the CPU.

While the DeepSeek models did not perform well in the combined score, gpt-oss:20b failed to consistently produce the required output format. We also observe that models smaller than 3B parameters are generally unable to reliably adhere to the required output format.

Throughout the analysis, we made several design decisions and assumptions. While these choices were appropriate for the scope of this work, they also point to a number of directions for future investigation.

Future Work

  • Increasing the diversity of the evaluated videos, particularly with respect to audio quality.
  • Evaluating additional speech-to-text models beyond Whisper.
  • Analyzing different parameter settings for the Whisper models.
  • Systematically evaluating prompt variations for summary generation.
  • Evaluating the impact of chunk_size_in_min and overlap_in_percent.
  • Analyzing the effect of different values for the length_factor on summary length and quality, or using alternative heuristics for determining summary length.
  • Experimenting with different weights when computing the combined faithfulness and coverage score.
  • Evaluating different LLM decoding settings, such as temperature.
  • Exploring alternative evaluation strategies, since the comparative evaluation approach scales quadratically with the number of models and lacks a clear ground truth. For example, point-based evaluation against a human-defined gold standard summary could be investigated.
  • Estimating confidence intervals for the reported scores by repeating the evaluation multiple times, for example with different video subsets.

Conclusion

We presented an end-to-end pipeline for summarizing YouTube videos locally on an Apple MacBook M2 Max with 96 GB of RAM. Building the pipeline itself was relatively straightforward using faster-whisper for speech-to-text transcription and Ollama for running local LLMs. We applied targeted prompt engineering to enforce JSON output, leveraging the fact that modern LLMs are well trained on JSON responses. The main challenge, however, lies not in implementation but in deciding which models to use for transcription and summarization.

We evaluated multiple Whisper models for transcription and several local LLMs for summarization using a comparative evaluation strategy, with pairwise 1-vs-1 comparisons. In both cases, we used an AI-as-a-judge setup with the strong LLM GPT-5.2 mini and explicitly accounted for ordering effects like the recency bias by evaluating A vs. B and B vs. A, supported by symmetry enforcing prompt instructions. For summary evaluation, we identified faithfulness and coverage as key metrics, which are conceptually similar to precision and recall. We placed greater emphasis on faithfulness, since hallucinated information significantly reduces the usefulness of a summary. We also observed that local LLMs require at least 3B parameters to reliably adhere to the required output format.

From a performance perspective, transcription speed is critical because faster-whisper runs on the CPU, whereas summarization is comparatively fast since Ollama leverages Apple’s MPS. The Whisper small model emerged as the best trade-off between transcription quality and runtime. For summarization, qwen3:30b-a3b-instruct-2507-q4_K_M achieved the strongest combined faithfulness and coverage score.

The script

pipeline/main.py

provides the complete pipeline with the selected default models and exposes all relevant configuration options, including transcription and summarization models.

References