Legal Document Analysis Platform
Signed URL uploads, PyMuPDF parsing at 97.3% page-boundary accuracy, and hybrid rule-based plus AI chunking for a litigation funding firm in Mumbai.
The work
A litigation funding firm in Mumbai evaluates hundreds of cases a year. Each case lands as a bundle of PDFs. Court filings, medical records, depositions, expert reports, exhibit indexes. A single case can run past 3,000 pages. An analyst reads enough of it to assess case strength, estimate recovery potential, and decide whether the firm fronts capital against the claim.
3,000+
Pages per case bundle. Manual review took days per case.
One of us spent three years litigating commercial disputes before moving into engineering. We know what it feels like to open a 2,800 page production on a Friday evening with a hearing on Monday. The question the firm put to us was narrow. Could we compress the first-pass review from days into hours without making the analysts distrust the output.
We built a legal document analysis AI platform on Google Cloud over roughly four months. This post documents the architecture, the decisions that actually mattered, and the bugs that cost us time. The stack is Firebase, Cloud Run, Google Cloud Storage, Qdrant Cloud, PyMuPDF, and OpenAI embeddings. Everything runs in Mumbai (asia-south1) because the client's counsel required India data residency for privileged material.
The upload was the first thing we got wrong
The first version routed every PDF through a Cloud Function. The function received the file, encrypted it with AES-256-GCM, and wrote the ciphertext to Google Cloud Storage. This worked cleanly on the 30MB sample files we tested with.
It broke on the first real case. A 200MB deposition bundle blew past the Cloud Function memory limit, timed out on others, and when we widened the memory allocation the cost curve got ugly. Cloud Functions charge for every byte that passes through the container. A firm uploading a dozen bundles a day at this size was a problem.
We rebuilt it with signed URLs. The flow now runs in the browser. The client-side app encrypts the PDF using AES-256-GCM via the Web Crypto API, requests a 60 second signed URL from a small Firebase Function, and streams the encrypted bytes directly to GCS. The backend never sees the file.
95%
Drop in upload-related cloud costs after moving to signed URLs
The 60 second expiry keeps the security surface small. If a URL leaks it is useless within a minute. The encryption is end-to-end. GCS holds ciphertext. Keys live in Firebase Auth custom claims, not in the document metadata.
One bug here cost us a full day of debugging. The AES initialization vector was being serialized differently between the browser and the Node.js decryption on Cloud Run. The browser's Web Crypto API hands you a Uint8Array. When we JSON.stringify'd it to send alongside the ciphertext, it came out as {"0": 14, "1": 229, "2": 87, ...} instead of a proper array. On the server side that shape got coerced into something that crypto.createDecipheriv treated as garbage. Every file failed with "unable to authenticate data". The error message did not hint at the real cause.
The fix was to base64 the IV before transmission and decode it on the server. Trivial code change. The lesson was that Uint8Array has no single canonical JSON representation, so never send raw binary through JSON without agreeing on an encoding first.
PDF parsing
The parser choice turned out to be the single biggest accuracy lever in the whole pipeline. We started with pdf-lib because we already had it in the Node ecosystem for an unrelated feature. On our first test set of fifty legal documents it detected page boundaries correctly about 3% of the time. It merged pages. It split sentences across pages that were never split in the original. It dropped running headers on some pages and kept them on others, which meant identical text fragments appeared at random positions in our output.
We swapped in PyMuPDF (the Python library, run inside a Cloud Run container). Same test set, 97.3% page boundary accuracy. The failures that remained were largely scanned pages with OCR artifacts, which we route to a separate pipeline.
3% accuracy with pdf-lib, pages merged, sentences split mid-clause, headers lost
97.3% accuracy with PyMuPDF on the same 50-document test set
The 97.3% figure is for page boundary detection specifically. It is not a measure of the downstream legal intelligence extraction, which we have not yet benchmarked against a human-labeled gold set. That benchmark is planned but not done. We want to be explicit about that because the number is often misread.
Chunking
Chunking is where most legal AI pipelines quietly fall apart. If you embed a chunk that starts mid-argument and ends mid-quotation, the vector search returns plausible-sounding nonsense. The chunk has to correspond to a coherent unit of legal meaning: a clause, a paragraph, a numbered section, a deposition answer.
Our approach is hybrid. 80-90% of chunks come out of rule-based pattern matching. The remaining 10-20% go to an LLM for boundary detection.
The rules work because legal documents are far more structured than they look. Section headings follow conventions. "WHEREAS" and "NOW THEREFORE" mark clause boundaries in contracts. Numbered clauses like "3.1.2" and "(a)(iii)" map to a parseable hierarchy. Deposition transcripts have "Q." and "A." markers. Pleadings have numbered paragraphs. Court orders have Roman numeral sections. A regex library covering the conventions we saw in our first hundred cases gets us most of the way.
The hybrid chunker is cheaper and more predictable than sending every page to an LLM. It also means the rule-matched chunks have deterministic, reproducible boundaries, so two runs of the same case produce the same chunk IDs.
The LLM (OpenAI gpt-4o-mini for this step) only sees the residue: paragraphs that sat between two rule-matched boundaries but did not themselves match any rule. Expert reports and narrative medical summaries tend to produce the most of this residue. We send the residue in a prompt asking for boundary positions by character offset, validate that the offsets are inside the input, and fall back to paragraph splits if the model returns garbage.
One bug worth naming. An early version of the rule chunker would drop text that sat between two recognized patterns without matching either. A narrative paragraph wedged between "Section 3.1" and "Section 3.2" that did not itself start with a recognized heading would just vanish from the chunk output. We caught it on a personal injury case where a key causation paragraph was missing from search results but was visibly present in the source PDF. The fix was to treat all unmatched text as a continuation of the preceding chunk rather than as something to discard.
Legal intelligence extraction
Once a document is chunked, each chunk runs through a set of extractors. This is where the platform earns its place in an analyst's workflow.
Authority scoring. Each citation in a chunk is resolved to a court and scored. Supreme Court of India citations score 90-100. High Courts score 70-89. District Courts score 50-69. Tribunals and state-level forums score 30-49. Foreign citations (English common law, US cases cited persuasively) score 20-29. The scoring lets an analyst sort a 3,000 page bundle by "highest authority cited" and read the twenty most important passages first.
Citation detection. We built a regex layer for Indian citation formats (AIR, SCC, SCR) and a secondary LLM pass for inline and narrative citations that regex misses. Citations get normalized to a canonical form so that "AIR 1973 SC 1461" and "Kesavananda Bharati v. State of Kerala (1973)" resolve to the same case record.
Party identification. Named entity recognition tuned for legal parties. Plaintiffs, defendants, witnesses, experts, counsel. We started with generic spaCy NER and its precision on Indian names was poor. We fine-tuned on a dataset of 4,000 annotated paragraphs from anonymized case files the firm provided.
Claim classification. Each chunk is classified into claim categories the firm cares about: personal injury, medical negligence, property damage, breach of contract, insurance subrogation, employment. The classifier is a gpt-4o-mini call with a schema-constrained response, cached aggressively because classifier outputs are stable for a given chunk.
The authority scoring and citation detection are good in practice but not formally benchmarked. That evaluation harness is the next piece of work.
Vectorization
Every chunk gets embedded with OpenAI's text-embedding-3-large at 3,072 dimensions and written to Qdrant Cloud. The collection is partitioned per case so queries never cross case boundaries, which matters for confidentiality and also keeps result quality high.
This is what powers the natural language query feature. An analyst types "what evidence supports negligence by the hospital" and gets the top twenty chunks ranked by cosine similarity, annotated with the authority scores and claim classifications from the extraction step. In our user testing this took the "find the relevant paragraph" step from 20-40 minutes of skimming to under a minute.
The vectorization pipeline has auto-resume. If it fails on chunk 1,847 of 2,807, it restarts from chunk 1,848. Each chunk's embedding status is tracked in Firestore with a simple state machine: pending, embedding, done, failed. Auto-resume matters because a full case vectorization takes 15-20 minutes and OpenAI's embeddings endpoint returns transient 5xx errors often enough that 15 minute runs without checkpointing were failing maybe once every four runs. Re-paying for 2,800 embeddings because run 9 failed at chunk 2,700 was not acceptable.
Cost on a typical 3-document set of 2,807 chunks runs about $0.365. The embedding API is the biggest line item. At that cost we run it on every case in the pipeline without thinking about it. The storage and query costs on Qdrant Cloud add maybe another $0.05 per case per month.
The failures we fixed in production
Beyond the IV serialization bug and the chunking gap, three other failures cost real debugging time. All of them surfaced only at production scale.
Docker memory exhaustion. PyMuPDF loads the entire document tree into memory by default. On a 1,200 page deposition bundle that was pushing past Cloud Run's default 512MB limit. We bumped the container memory to 2GB and switched to streaming per-page processing so PyMuPDF only holds one page at a time. This also made cold starts slower, which we accepted because the jobs run asynchronously anyway.
GCS race condition. The signed URL upload and the metadata write were racing. Our original design had the client write a Firestore metadata record ("processing requested for file X") after it finished the GCS upload. The processing pipeline listened for Firestore writes and pulled the file from GCS. Occasionally the Firestore write completed and the pipeline fetched the file before GCS had finished its multi-region write, so processing got an empty or partial object. The fix was to switch to a GCS object-finalize notification via Pub/Sub as the trigger, so processing only starts after GCS confirms the write is durable.
Hinglish in depositions. A handful of depositions contained Hindi passages transliterated into Roman script mixed with English. The tokenizer handled it but the extractors treated "gawah" and "witness" as unrelated entities. We added a light transliteration normalization pass before named entity recognition. It did not fully solve the problem but it lifted party identification accuracy from embarrassing to acceptable on the mixed-language transcripts.
Infrastructure
Firebase for auth and the analyst-facing web app. Cloud Run for the processing pipeline (Python 3.11, PyMuPDF, the extractors). GCS for encrypted document storage. Firestore for job state and metadata. Qdrant Cloud for vector search. OpenAI for embeddings and the chunking and classification LLM calls. Pub/Sub for the job orchestration between GCS, Firestore, and Cloud Run.
Everything sits in asia-south1 (Mumbai). That was non-negotiable for the firm's counsel because privileged material on a foreign region raised questions under the Advocates Act and the Bar Council conduct rules that nobody wanted to litigate. OpenAI's API calls are the only cross-border leg and the firm accepted that on the basis that embeddings are not the underlying text and chunks sent for classification are one-way (no retention per the enterprise API terms).
The feature that actually sold the platform to the client was not the search or the extractors. It was the recovery funnel visualization. A five-stage view from claim to recovery: liability, damages, causation, collectability, recovery estimate. Each stage shows the chunks that support or undermine the firm's confidence, with the authority scores and claim classifications surfaced inline. Analysts use it to pitch a case to the investment committee. Before the platform, that pitch was a verbal summary and a stack of tagged PDFs. Now it is a single view the committee can interrogate.
What we would do differently
A few things.
We would build the evaluation harness first. We shipped the pipeline to production with the 97.3% page boundary number and a lot of anecdotal feedback from analysts. A proper benchmark on authority scoring, citation detection, and party identification would have let us tune faster. We are building it now and it feels like the first thing we should have built, not the twentieth.
We would pick Qdrant from day one. We started on pgvector in Cloud SQL because it was one fewer moving part. It worked but the latency on 3,000 dimension vectors over cases with 10,000+ chunks was bad enough that we migrated. If we had picked Qdrant at the start we would have saved two weeks of migration work.
We would write the regex library with the lawyers, not for them. Our first version of the rule-based chunker was an attempt to encode what one of us remembered from litigation practice. It missed conventions we had never personally seen (insurance subrogation documents, tax tribunal orders). The second version came from sitting with two of the firm's analysts for a day and watching them mark up a case by hand. The rules that came out of that session are the ones still running in production.
Honest assessment
The platform does what the firm hired it to do. First-pass review time dropped from days to hours. Analysts who were skeptical (two of them were former High Court clerks) now use it on every case. The firm funded 40% more cases in the quarter after deployment than the quarter before, and they attribute roughly half of that to faster intake.
The parts we are not yet satisfied with are the authority scoring (the scoring bands are defensible but not tuned against outcomes), the Hinglish handling (workable, not great), and the absence of a formal evaluation harness. These are on the roadmap.
The parts we are satisfied with are the upload architecture, the hybrid chunker, the auto-resume on the vectorization pipeline, and the recovery funnel view. Those four pieces are the ones that would survive if we rebuilt the platform from scratch tomorrow.
Built on Firebase, Cloud Run, GCS, Firestore, Qdrant Cloud, PyMuPDF, and OpenAI text-embedding-3-large. Deployed in Google Cloud's asia-south1 region for India data residency.
Want to build something like this?
We design and ship AI products, automation systems, and custom software.
Get in touch