This content originally appeared on DEV Community and was authored by DudJob
Dudjob.com was developed to serve as a comprehensive database for OnlyFans content creators, allowing users to browse, search, and explore creators based on multiple parameters. The creation of Dudjob involved an in-depth understanding of database design, web scraping, information indexing, and front-end/backend integration.
As the lead developer, I want to share an extensive technical breakdown of how we brought Dudjob into life, focusing on the key areas such as database design, web scraping, indexing, and more.
Step 1: Defining the Tech Stack
Before diving into the architecture, it was crucial to decide on the tech stack that would serve as the foundation of Dudjob.com. We chose the following technologies:
Backend Framework: Django, a robust Python framework known for its scalability and security, allowed us to develop rapidly without sacrificing performance.
Database: PostgreSQL, due to its powerful support for complex queries, full-text search capabilities, and ability to handle large datasets.
Web Scraping: We used Scrapy, a powerful scraping and web crawling framework, to collect and parse OnlyFans data efficiently.
Search Indexing: Elasticsearch for real-time search capabilities, providing fast lookups and advanced search features.
Frontend: React for a responsive and dynamic user interface.
Cloud Hosting: AWS for scalable infrastructure.
Step 2: Database Design
The database design forms the core of any project that involves handling vast amounts of structured and unstructured data. For Dudjob, we had to design a highly scalable and normalized database schema that could handle millions of user profiles, subscription details, pricing information, and more.
Schema Design
We designed the schema to handle the following key entities:
Creators Table: This table stores detailed information about the content creators on OnlyFans, such as:
id (Primary Key)
name
username (unique for each creator)
bio
profile_url
subscription_price
follower_count
created_at, updated_at timestamps
Posts Table: Each creator can have multiple posts, so we created a one-to-many relationship between creators and their posts.
id (Primary Key)
creator_id (Foreign Key referencing Creators)
media_url (images/videos)
post_description
post_date
Tags Table: To enable efficient filtering based on keywords, we incorporated a tags system.
id (Primary Key)
tag_name
A many-to-many relationship exists between posts and tags.
Metrics Table: This table tracks the metrics related to each creator, which is updated daily:
creator_id (Foreign Key referencing Creators)
follower_count
post_count
interaction_count
date
Database Optimization
To ensure the database could handle large-scale queries, we made use of several optimization techniques:
Indexing: Primary and foreign keys were indexed, along with frequently queried fields like username and subscription_price to speed up searches.
Partitioning: We partitioned the Metrics table by date to improve read and write performance for time-based queries.
Caching: We implemented caching for frequently accessed data such as popular creators and top searches, using Redis to reduce database load.
Full-Text Search: We utilized PostgreSQL’s built-in full-text search to enable users to perform keyword searches on bios, tags, and post descriptions.
Step 3: Web Scraping with Scrapy
One of the core components of Dudjob is the ability to aggregate information about OnlyFans creators. To achieve this, we used Scrapy for robust web scraping. Our goal was to automate the collection of public data, while staying within ethical and legal boundaries.
Scrapy Spiders
We developed multiple Scrapy spiders to gather the necessary data:
Creator Spider: This spider was responsible for crawling OnlyFans and related platforms to extract creator profiles, subscription prices, follower counts, etc.
Key Fields Scraped:
Username
Profile URL
Bio
Subscription tiers
Post counts and media types
Media Spider: Once the creator profiles were scraped, the media spider fetched additional content, such as thumbnails and teaser videos.
Throttling & Error Handling
To avoid getting blocked and maintain
efficiency, we employed several strategies:
Rate Limiting: We set a maximum request rate and employed exponential backoff to avoid hitting the servers too frequently.
Proxy Rotation: Scrapy's middleware allowed us to rotate IP addresses and use proxies to distribute our requests.
Error Handling: We used Scrapy's built-in error handling to catch and log HTTP errors, retries, and captchas. Additionally, retries were implemented for failed requests using a custom retry middleware.
Data Parsing
Data parsing was crucial since the structure of profiles and posts could vary significantly. For example, subscription prices were presented differently for discounted and non-discounted creators, and post descriptions could contain media, emojis, and hashtags.
We used custom parsers and regular expressions to clean and normalize the data before storing it in the database.
Step 4: Search and Indexing with Elasticsearch
Given the scale of data, efficient search functionality was paramount. We integrated Elasticsearch, which allowed for real-time search capabilities across millions of creator profiles.
Index Structure
We created the following indices in Elasticsearch:
Creator Index: Stores information about each creator, such as name, bio, tags, and pricing.
Post Index: Contains post-specific details, allowing users to search for specific content types.
Tag Index: Tags were indexed to allow filtering by content type, genre, and other categories.
Search Features
Full-Text Search: Elasticsearch provided full-text search across multiple fields, such as bios and post descriptions.
Filters: Users could filter search results based on subscription price, follower count, post count, and media types (images, videos).
Auto-Completion: Elasticsearch's "completion suggestions" were implemented to provide real-time suggestions as users typed.
Faceted Search: We added faceted search capabilities, allowing users to refine results based on subscription price ranges, tags, and categories.
Step 5: Frontend Integration
For the frontend, we chose React to build an intuitive and responsive user interface. The frontend communicated with our Django backend via a REST API, enabling real-time search and display of creator profiles.
Key Features:
Search Bar: Integrated with Elasticsearch, providing users with instant search results and filters.
Creator Profile Pages: Detailed pages displaying all available information about creators, including media previews, subscription tiers, and follower stats.
Infinite Scroll: To improve user experience, we implemented infinite scrolling for search results, allowing seamless browsing of creators.
Step 6: Deployment and Scaling
We deployed Dudjob.com using AWS, with a focus on scalability and high availability.
Infrastructure
EC2 Instances: Our Django backend and React frontend were hosted on EC2 instances, allowing for scalable compute power.
RDS (Relational Database Service): We used AWS RDS for hosting our PostgreSQL database, with automated backups and scaling options.
Elasticsearch on AWS: We used AWS Elasticsearch Service to handle our search infrastructure.
S3 for Media Storage: All media (e.g., creator images, post media) was stored on S3 to offload storage and ensure high availability.
CloudFront CDN: We used CloudFront for content delivery, improving load times for users across the globe.
We continue to optimize and expand Dudjob, it's a titanic task with too many things to handle at the same time, but we think it's working as intended (most of the time...) and we are really excited to be building the best OnlyFans search engine out there!
This content originally appeared on DEV Community and was authored by DudJob
DudJob | Sciencx (2024-10-15T09:11:45+00:00) How We Built Dudjob.com. Retrieved from https://www.scien.cx/2024/10/15/how-we-built-dudjob-com/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.