PDF에서 텍스트 추출하기

Extract Text from PDF

Converdoc는 텍스트 기반 PDF에 들어 있는 실제 텍스트 레이어를 읽어 편집 가능한 .txt 파일로 뽑아냅니다. 워드나 한글로 만들어 PDF로 저장한 문서, 디지털 보고서, 전자책처럼 글자를 드래그해 선택할 수 있는 PDF에 가장 잘 맞습니다. 종이를 사진·스캔한 PDF는 글자가 이미지로 되어 있어 이 도구로는 추출되지 않으니 OCR 도구가 필요합니다.

Converdoc reads the actual text layer inside a text-based PDF and pulls it out into an editable .txt file. It works best on PDFs that were created digitally — documents exported from Word, digital reports, ebooks — where you can already select and copy the text. Scanned or photographed PDFs store text as images, so they won't extract here and need an OCR tool instead.

PDF에서 텍스트 추출하기 — 브라우저에서 바로 처리하는 무료 도구입니다. 아래에 파일을 올리면 변환이 시작되며, 파일은 서버로 업로드되지 않습니다.

Extract Text from PDF — a free tool that runs right in your browser. Add your files below and the conversion starts; nothing is uploaded.

변환 도구가 곧 여기에 나타납니다. 보이지 않으면 브라우저에서 JavaScript를 켜 주세요.

The converter appears here in a moment. If it doesn't, enable JavaScript in your browser.

사용 방법

How to use

변환할 PDF 파일을 이 페이지로 끌어다 놓거나 파일 선택 버튼으로 불러옵니다.
PDF가 브라우저 안에서 바로 분석되고 페이지 순서대로 텍스트가 추출됩니다.
추출된 텍스트를 미리 확인한 뒤 .txt 파일로 내려받거나 클립보드에 복사합니다.
긴 PDF라면 추출 후 줄바꿈이나 머리글·바닥글 정도만 가볍게 다듬어 사용하세요.

Drag your PDF onto this page, or click to browse and select the file.
The PDF is analyzed right in your browser and text is pulled out page by page, in order.
Preview the extracted text, then download it as a .txt file or copy it to your clipboard.
For long PDFs, do a quick cleanup of line breaks or repeating headers and footers after extraction.

왜 Converdoc 인가

Why Converdoc

파일이 서버로 전송되지 않습니다. 모든 추출이 사용자의 브라우저 안에서만 이뤄져 PDF 내용이 외부로 나가지 않습니다.
회원가입·로그인·결제 없이 완전 무료로 사용하며 변환 횟수 제한도 없습니다.
이미지로 재인식하는 OCR이 아니라 PDF에 내장된 실제 텍스트 레이어를 그대로 읽어, 텍스트 기반 문서라면 오타 없이 원문 그대로 추출됩니다.

Your file never leaves your device. All extraction happens inside your browser, so the PDF's contents stay private.
Completely free with no sign-up, no login, and no limit on how many files you convert.
It reads the PDF's real embedded text layer rather than re-recognizing an image, so text-based documents come out exactly as written, with no OCR guessing.

참고 사항

Good to know

스캔본·사진 PDF는 글자가 이미지라 추출되지 않거나 빈 결과가 나옵니다. 이 경우 OCR 기능이 있는 도구를 사용하세요(PDF 안의 글자를 마우스로 드래그해 선택되면 텍스트 기반입니다).
복잡한 표나 다단 레이아웃은 본문은 잘 나오더라도 칸·열 순서가 흐트러질 수 있어 추출 후 정렬이 필요할 수 있습니다.
페이지마다 반복되는 머리글·바닥글·쪽번호도 함께 추출되니, 필요하면 결과에서 지워 주세요.

Scanned or photographed PDFs store text as images, so they extract empty or garbled. Use an OCR tool for those (tip: if you can drag to select the text inside the PDF, it's text-based).
Complex tables and multi-column layouts may extract the right words but in a scrambled column or cell order, so some reordering afterward is normal.
Repeating headers, footers, and page numbers get pulled in too, so delete them from the result if you don't need them.

.txt로 뽑으면 뭐가 좋은가

Why pull it out as plain .txt

논문이나 보고서 여러 편에서 특정 문구가 어디 나오는지 찾고 싶을 때, PDF 뷰어를 하나씩 여는 대신 .txt로 뽑아 두면 메모장·VS Code에서 정규식으로 한 번에 검색하거나 여러 파일을 묶어 grep할 수 있습니다. 회의록 PDF에서 결정사항만 골라 노션·옵시디언에 붙여넣거나, 인용구를 그대로 따올 때도 서식 없는 순수 텍스트가 가장 깔끔합니다.

받은 PDF를 '편집'하려면 PDF를 Word로 변환해 문단·서식을 살리는 편이 낫고, 화면에서 일부만 긁어 옮길 거면 그냥 마우스로 드래그해 복사하면 됩니다. 반면 본문 전체를 한 덩어리 글자 데이터로 다뤄야 할 때 .txt가 적합합니다. 굵게·글꼴·이미지 같은 서식을 모두 떼어내 ChatGPT·NotebookLM 같은 AI에 넣거나, 파이썬·엑셀로 단어 빈도를 세는 등 후처리에 그대로 쓰기 좋습니다.

이 추출은 OCR이 아니라 PDF 안에 이미 박혀 있는 텍스트 레이어를 그대로 읽는 방식이라, 글자를 새로 인식하며 생기는 오타가 없습니다. 다만 한글이 깨지거나 줄 순서가 뒤섞이는 경우는 폰트 인코딩(CID) 문제나 본문의 좌표 배치 때문이며 추출기의 한계입니다. 모든 분석은 브라우저 안 pdf.js로 처리되어 PDF가 서버로 올라가지 않으므로, 사내 보고서나 미공개 계약서도 외부 유출 걱정 없이 텍스트만 뽑아낼 수 있습니다.

When you need to find where a phrase appears across a stack of papers or reports, exporting each to .txt lets you search them with a regex in Notepad or VS Code, or grep several files at once, instead of opening every PDF viewer. Plain text is also the cleanest way to lift just the decisions out of meeting-minute PDFs into Notion or Obsidian, or to pull an exact quote for citation.

If you want to edit a received PDF, converting it to Word to keep paragraphs and styling is the better route, and if you only need a snippet you can simply drag-select and copy it. Plain .txt is the right pick when you need the whole body as one block of character data: stripped of bold, fonts and images, it drops cleanly into AI tools like ChatGPT or NotebookLM, or into Python and Excel to count word frequencies and do other post-processing.

This extraction is not OCR — it reads the text layer already embedded in the PDF, so there is no re-recognition error introduced by guessing at pixels. Garbled Korean or a scrambled line order, when it happens, comes from font-encoding (CID) issues or the coordinate layout of the body, which is a limit of any extractor. Everything is parsed by pdf.js inside your browser and the PDF is never uploaded, so you can pull text from internal reports or unsigned contracts without worrying about leaks.

자주 묻는 질문

FAQ

스캔한 PDF에서도 텍스트가 추출되나요?Will this extract text from a scanned PDF?

아니요. 스캔본은 글자가 이미지로 저장되어 있어 텍스트 레이어가 없습니다. 이 도구는 디지털로 만들어진 텍스트 기반 PDF용이며, 스캔본은 OCR 도구가 필요합니다. PDF에서 글자를 드래그해 선택할 수 있으면 추출이 잘 됩니다.

No. Scanned PDFs store text as images and have no text layer to read. This tool is for digitally created, text-based PDFs; scanned documents need an OCR tool. A quick test: if you can drag to select the text inside the PDF, extraction will work.

제 PDF 파일이 어딘가에 업로드되나요?Is my PDF uploaded anywhere?

아니요. 변환은 전적으로 브라우저 안에서 처리되며 파일이 서버로 전송되거나 저장되지 않습니다. 계약서나 내부 보고서처럼 민감한 문서도 안전하게 추출할 수 있습니다.

No. The conversion runs entirely in your browser, and your file is never sent to or stored on a server. That makes it safe for sensitive documents like contracts or internal reports.

서식이나 표, 글꼴은 그대로 유지되나요?Are formatting, tables, and fonts preserved?

결과물은 서식 없는 일반 텍스트(.txt)입니다. 굵게·글꼴·이미지 같은 서식은 빠지고 본문 글자만 남으며, 표는 내용은 나오되 칸 정렬이 흐트러질 수 있습니다.

The output is plain, unformatted text (.txt). Styling like bold, fonts, and images is dropped, leaving just the words. Tables come through as text but their column alignment may not be preserved.

파일 크기나 페이지 수에 제한이 있나요?Is there a file size or page limit?

정해진 제한은 없지만 처리가 사용자 기기에서 이뤄지므로, 수백 페이지짜리 큰 PDF는 기기 성능에 따라 시간이 더 걸릴 수 있습니다.

There's no fixed limit, but since processing happens on your own device, a very large PDF of several hundred pages may take longer depending on your hardware.

추출했더니 한글이 다 깨져서 나옵니다. 왜 그런가요?The extracted Korean (or other) text comes out as garbage. Why?

그 PDF가 한글 글자를 특수한 방식(CID 폰트나 서브셋 임베딩)으로 넣어, 글자 모양은 보여도 어떤 글자인지를 가리키는 정보(ToUnicode 매핑)가 빠져 있는 경우입니다. 이때는 텍스트 레이어가 깨진 코드로 나와 추출기가 복원할 수 없습니다. 뷰어에서 글자를 드래그해 복사했을 때도 깨진다면 같은 원인이며, 이런 PDF는 차라리 페이지를 이미지로 본 뒤 OCR로 다시 인식하는 편이 정확합니다.

That PDF embeds its characters in a special way (a CID font or a subset embedding) that shows the glyph shapes but omits the mapping (ToUnicode) that says which character each glyph is. The text layer then comes out as broken codes that no extractor can recover. If copying the same text from a viewer is also garbled, the cause is identical — for these files it's more reliable to view the pages as images and re-recognize them with OCR.

추출한 텍스트를 AI나 검색에 넣기 좋게 정리하려면요?How do I clean up the extracted text for AI or search?

다단·표가 섞인 PDF는 글자가 화면 좌표 순서대로 읽혀 문장이 뒤섞일 수 있으니, 추출 후 페이지마다 반복되는 머리글·바닥글·쪽번호를 먼저 지우세요. 그다음 문장 중간에서 끊긴 줄바꿈을 한 문단으로 이어 붙이면 AI가 문맥을 훨씬 잘 잡습니다. 표가 핵심인 문서라면 텍스트보다 'PDF 표를 Excel로' 도구로 칸 구조를 살려 뽑는 편이 깔끔합니다.

In multi-column or table-heavy PDFs, text is read in screen-coordinate order, so sentences can get interleaved; after extracting, first delete the headers, footers and page numbers that repeat on every page. Then rejoin line breaks that fall mid-sentence into single paragraphs, and an AI will follow the context far better. If tables are the point of the document, the 'PDF table to Excel' tool keeps the cell structure better than plain text does.

다른 변환 도구

More conversions

PNG를 PDF로 변환Convert PNG to PDF JPG를 PDF로 변환Convert JPG to PDF JPG를 PNG로 변환Convert JPG to PNG PNG를 JPG로 변환Convert PNG to JPG WebP를 PNG로 변환Convert WebP to PNG PNG를 WebP로 변환Convert PNG to WebP 여러 PDF를 하나의 PDF로 합치기Merge Multiple PDFs into a Single PDF PDF를 페이지별로 분할하기Split a PDF into Separate Pages PDF를 JPG로 변환Convert PDF to JPG PDF 페이지 회전하기Rotate PDF Pages 엑셀(XLSX)을 CSV로 변환Convert Excel (XLSX) to CSV CSV를 엑셀(XLSX)로 변환Convert CSV to Excel (XLSX)DOCX를 HTML로 변환Convert DOCX to HTML 이미지·스캔 PDF에서 텍스트 추출하기 (OCR)Extract Text from Images & Scanned PDFs (OCR)이미지 용량 줄이기 (압축)Compress images (reduce file size)사진 EXIF·메타데이터 제거Remove photo EXIF / metadata PDF에 워터마크 넣기Add a watermark to a PDF QR 코드 만들기Make a QR code HEIC를 JPG로 변환Convert HEIC to JPG 사진 배경 제거 (누끼)Remove Image Background 증명사진 만들기Passport & ID Photo Maker PDF 표 → Excel 변환Convert PDF Tables to Excel PDF에 서명 넣기Sign a PDF PDF → Word 변환Convert PDF to Word HWP 파일 열기 · Word로 추출Open HWP File · Export to Word