Guide

How to evaluate a Mandarin transcription sample before signing

Samples are where vendors do their best polish. A sample that actually passes evaluation does not just read smoothly — it shows clear handling logic at every risk axis. This guide gives an executable method for evaluating Mandarin transcription samples before you commit.

13min readLast updated: 2026-05-18

Why sample evaluation is worth half a day

For decision-grade recordings, the sample you see at signing usually sets the ceiling of what you will receive in production. Samples are also the easiest place for a vendor to do "showcase work" — they will send you something they have polished repeatedly, not their daily baseline. Spending half a day evaluating samples against an executable checklist is the highest-ROI due diligence you can do before signing.

Why one sample is not enough

A single sample may be the vendor's best one-off effort; what you actually need to evaluate is the stability of daily deliverables after signing. Ask for two or three samples across different subjects, covering variation in terminology density, speaker count, and confidentiality level, so you can see consistency across scenarios.

Why looking only at the final output is not enough

Whether the final output reads well and whether it has been genuinely reviewed are two different things. A good sample provides a side-by-side of the ASR draft and the human-reviewed output — see real side-by-side samples here — so the evaluator can see which recognition errors were corrected, which were smoothed over, and which were filled in. Without the draft, evaluation depth is heavily discounted.

"Reads smoothly" is the biggest trap

The easiest thing for a human reviewer to do with a misrecognized Chinese sentence is to make it read smoothly without making it correct. "Smooth" is rewarding to read but a negative signal for a decision-grade transcript — it means the reviewer chose readability over fidelity. When evaluating a sample, the first question is not whether it reads well, but whether it maps back to the audio at the moment.

Use the ASR draft as a reverse check

If the vendor can provide the ASR draft, comparing it against the human output quickly reveals which positions were "smoothed over". Smoothed positions tend to be terminology, acronyms, and speaker transitions — exactly the positions that matter most in a decision-grade transcript. If the only difference between draft and final is fluency, that section has not been substantively reviewed.

Five risk axes for sample evaluation

Break sample evaluation into five risk axes, each with an executable judgment point. The five below cover the main risk surface of decision-grade transcription. Score each sample on each axis independently (a rough 1–5 is enough):

  • Terminology error recognition — were industry terms, acronyms, and proper nouns identified and corrected, or smoothed over.
  • Mixed-English handling — were English terms preserved as the speaker actually used them, and is the mixed-language formatting readable.
  • Number and unit verification — were amounts, percentages, currencies, and dates checked back against the audio.
  • Speaker attribution — in multi-speaker audio, is it clear who said what.
  • Uncertainty markers — when audio or terminology is uncertain, are uncertain points marked explicitly, or filled in regardless.

Axis one: terminology error recognition vs. cover-up

In terminology-heavy recordings, ASR errors are rarely a single character off — they are often a completely unrelated word. A strong sample shows recognition judgment at every key terminology position, not just at positions where the original ASR read awkwardly.

Executable check

List every industry term, English acronym, product name, and key metric that appears in the sample, and look back at the corresponding ASR draft position. If the final has a fluent but actually wrong term (for example, cohort retention rendered as "group retention rate" when the context clearly means same-cohort retention), score down.

Probe with a small term list

Hand the vendor a 5–10 row list of your own internal terms before scoping and ask how they would handle each in transcription. Answers like "keep English on first mention, parenthesize the Chinese after" or "anchor to a client-confirmed glossary" are far more reliable than the generic "we pay attention to terminology".

Axis two: real handling of mixed English

Mixed-English samples are the most direct window into a vendor's bilingual judgment. The question is not just "was English kept" — it is whether English terms were accurately identified at the ASR stage (and not heard as a Chinese near-homophone), and whether the final formatting is readable for an analyst.

Near-homophone trap

ASR turning "cohort retention" into a Chinese near-homophone string is extremely common. A strong human reviewer restores the original English; a weak one deletes the segment or substitutes a Chinese phrase that "looks plausible". This is the single sharpest signal for evaluating a sample.

Format consistency

Mixed-language formatting in the sample should be internally consistent — the same acronym should either always have dots (U.S.) or never; the same English word should either always be capitalized or never. If the sample shows LTV in one place and ltv elsewhere, the vendor lacks a consistent formatting rule.

Axis three: number and speaker verification

Numbers and speakers are the two positions in a decision-grade transcript that absolutely cannot be wrong, as the speaker-and-number FAQ discusses. A recording that turns "1.32 billion yuan" into "3.2 billion yuan" pushes downstream commercial judgment off by an order of magnitude. A good sample shows visible cross-check work at both positions.

Traces of number verification

Compare numbers, units, and dates in the ASR draft against the final. If they are identical, verification may not have happened — ASR getting numbers fully right is unusual at any meaningful length.

Handling of speaker transitions

In multi-speaker recordings, speaker transitions are where ASR confuses turns most easily. Check whether the final has explicit speaker labels at transitions and whether the Q&A structure has been restored. A segment collapsed into "Speaker 1" monologue often means the default ASR output was kept as-is.

Axis four: how uncertainty is handled

When audio is uneven, speakers overlap, or terminology is unusual enough that a confident determination is impossible, decision-grade transcription should mark the position explicitly (for example "[?]", "[inaudible]", or "[TBC: XX]") rather than fill in something that looks plausible. Filling silently transfers the judgment responsibility to the reader.

Look for uncertainty markers in the sample

If a sample contains no uncertainty markers at all, that is itself a signal — either the audio was unusually clean, or the reviewer is filling silently. The first case is rare; the second case implies a systemic reliability problem.

Axis five: re-run the test on your own recording

The stability of an anonymized sample and the stability of real production work are often not the same thing. The single most valuable evaluation step is to submit a real internal recording of your own as a trial — same audio quality, same vocabulary, same context. A vendor scoring 95 on the curated sample and 70 on your real audio is the gap worth identifying.

How to run a trial safely

Sensitive recordings can start with a 5-minute trial; non-sensitive recordings can run 20–30 minutes. Before the trial, align on NDA, deletion rules, and whether the trial is paid — strong vendors usually offer a short free trial.

How to compare vendors head-to-head

The hardest part of evaluating a single vendor is the lack of a baseline. Submit the same recording to two or three vendors for trial transcription and compare their handling at the same positions. Head-to-head comparison makes "this looks decent" suddenly look like a gap.

What to look for in the comparison

Focus on: consistency of how the same term is handled, whether the labeling strategy at ambiguous positions is more restrained, clarity at speaker transitions, and overall trustworthiness. Do not be misled by surface fluency — a decision-grade transcript needs to be trustworthy, not eloquent.

What not to look at in the comparison

Layout, font, and file format are packaging; they are unrelated to core quality. Similarly, "instant reply" and "enthusiasm" mean nothing at evaluation stage — what matters is what happens after signing.

A few hard red flags

Any one of the following sample characteristics is reason to drop the vendor without spending more comparison time:

  • Cannot provide the ASR draft alongside the human-reviewed final — review process is not transparent.
  • All numbers in the sample match the ASR draft exactly — number positions likely were not verified.
  • No uncertainty markers anywhere — the reviewer is filling silently to hide judgment gaps.
  • Obvious mixed-English near-homophone errors left in place — bilingual judgment is insufficient.
  • Refuses to provide a trial, or insists on a high trial fee — usually a lack of confidence in real-world quality.

Sample evaluation checklist

Distill the whole guide into a single scorecard. Score each sample on each item independently (a rough 1–5 is enough):

  • Does the vendor provide ASR-draft / human-reviewed comparison?
  • Are terminology errors in the final identified and corrected (not smoothed over)?
  • Are mixed-English terms preserved as the speaker actually used them?
  • Is there visible verification on numbers, units, and dates?
  • Are speaker transitions clear and is Q&A structure restored?
  • Are uncertain positions marked explicitly?
  • Does the work on your own submitted trial match the quality of the curated sample?
  • In head-to-head comparison, is the handling at the same positions more restrained and trustworthy than competitors?
  • Are the vendor's responses to your term list / format preferences specific and executable?
  • Overall trustworthiness — could someone who was not in the room continue working from this document?

Next steps

Review service scope

See how mixed-language, terminology-heavy, speaker/number-critical, and confidential offline projects are scoped.

See services

Compare a real sample

Anonymized comparisons showing ASR error patterns, human correction, and offline workflow differences.

See samples

Start a project conversation

Share language mix, terminology density, speaker/number requirements, turnaround, and confidentiality needs.

Contact FingerPower

Free resource

Mandarin transcription buyer's checklist

A short PDF distilling this guide into a checklist: NDA terms, file handling, terminology preferences, and turnaround expectations.

We will send it from service@fingerpower.com after a short check. The PDF is currently being finalized.

Related guides