Parsing HTML with regex

Robin S · June 17, 2020

For a brief moment I was contemplating parsing HTML with regex but then StackOverflow taught me this: https://stackoverflow.com/a/1732454/1036672

My sides are now hurting ?

Jan Romero · June 17, 2020

That’s a classic, but I’ve been scraping some stuff lately, and I’m not ashamed to say it looks like this and works fine:

Regex playlistRegex = new Regex(@"playlist = (\[.*?\]);", RegexOptions.Singleline);
Regex titelRegex = new Regex("player-archive-date.*?>(.*?)</div>.*?<span>(.*?)</span>", RegexOptions.Singleline);
Regex mp3Regex = new Regex(@"stream_url\s*?=\s*?'(.*?\.mp3)';");
Regex datumRegex = new Regex("datum=([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])");

//…
  
var i = line.IndexOf("/player/xxx/?xxx=xxx");
if (i == -1)
    continue;

c++;

line = line.Substring(i);
line = line.Substring(0, line.IndexOf("\");'"));
line = line.Replace("&", "&");
var datum = datumRegex.Match(line).Groups[1].Value;

*cough*

Of course, I can make solid assumptions about my input here.

The limited problem of stackoverflow’s OP seems somewhat suitable for regex, too, although I’m unsure what they’re trying to accomplish. Find all non-self-closing opening tags?

Sign In

Parsing HTML with regex

Recommended Posts

Robin S

Link to comment

Share on other sites

Jan Romero

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Store

My Details

Support