Structured Data Extraction Projects

Data Analysis & Investment Research Public

PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

browser-automation Public

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Customer Communication & Team Operations Public

unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Data Analysis & Investment Research Public

Vector Retrieval and RAG Public

dsRAG

Vector retrieval project for checking embedding storage, query semantics, RAG integration, data boundaries, and rollback.

Vector databaseRAGEmbeddings

Structured Data Extraction

PaddleOCR

crawlee

unstructured

presidio

seekdb

webclaw

dsRAG