If you want users to post content without using HTML, consider utilizing user-side light markup systems instead. These systems can generate HTML without the need for direct HTML input.
I thought about using regex to remove all script tags and JavaScript attributes like onload and onclick.
However, attempting to process HTML with regex is not a reliable solution, especially when considering security implications. Attackers could intentionally submit malformed markup that would bypass your regex filtering.
If possible, encourage users to input XHTML as it is easier to parse. While regex may not be suitable for this task, using a simple XML parser to validate elements and attributes can help ensure that any potentially harmful content is removed before display.
HTML Purifier modifies HTML while maintaining its original format.
But why is preserving the original HTML important? If it's for editing purposes, then it's best to purify the HTML on output rather than during submission.
If allowing users to input free-form HTML is necessary, consider using HTML Purifier with a whitelist approach to ban unsafe elements and attributes. Although complex and requiring regular updates, it offers better protection than attempting to filter input with regex.
I don't want to buy a new domain just for this purpose.
You can use a subdomain if necessary, but be cautious of authentication token security between subdomains to prevent unauthorized access. If you're concerned about user scripting capabilities, restrict their access to avoid potential security risks such as attack scripts or malware injections.