I need a regex to get stuff between <li> and </li>

  • Thread starter Thread starter SlurrerOfSpeech
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around extracting text between

  • and
  • tags from a larger HTML-like string using regular expressions. Participants explore various regex patterns and methods to achieve this extraction, focusing on the context of programming languages and regex engines.

    Discussion Character

    • Technical explanation
    • Debate/contested
    • Mathematical reasoning

    Main Points Raised

    • One participant shares a regex pattern to match the entire structure containing

      Friends

      and
        tags, but seeks to extract the content specifically between
      • tags.
      • Another participant notes that using $1 or \1 will return only the content of the first capturing group, suggesting that lookahead and lookbehind might be alternatives, though more complex.
      • A participant inquires about the programming language and regex engine being used, which is confirmed to be C#.NET with System.Text.RegularExpressions.
      • There is a question about whether a loop can be used to extract groups or if the solution should rely solely on regex.
      • A suggestion is made to utilize a specific method from the .NET documentation for regex matches.
      • Another participant proposes a regex pattern using lookbehind to directly capture the content between
      • tags.

      Areas of Agreement / Disagreement

      Participants express varying opinions on the best approach to extract the desired content, with no consensus reached on a single method or solution.

      Contextual Notes

      Participants discuss the limitations of regex in capturing subpatterns and the complexity introduced by lookahead and lookbehind assertions. The discussion does not resolve the effectiveness of the proposed regex patterns.

    SlurrerOfSpeech
    Messages
    141
    Reaction score
    11
    What I'm ultimately tried to do is get the
    Code:
    Some Guy, Some Other Guy, Some Guy 2, Some W. Bush
    from an expression like

    Code:
    <h2>Friends</h2><ul><li>Some Guy</li><li>Some Other Guy</li><li>Some Guy 2</li><li>Some W. Bush</li></ul>

    This expression is in a much larger piece of text but is the only time an expression of this exact form is in it. I'm using

    Code:
    (?s)<h2>Friends</h2>.*?<ul>.*?</ul>

    to get the expression and

    Code:
    <li>([a-zA-Z0-9. ]+)</li>

    to get

    Code:
    <li>Some Guy</li>, <li>Some Other Guy</li>, <li>Some Guy 2</li> and <li>Some W. Bush</li>
    , but I actually want what's BETWEEN the tags.
     
    Technology news on Phys.org
    $1 or \1 (depending on the interpreter) will give the content of the first bracket instead of the full match, here the content inside the tags.
    Alternatively, lookahead and lookbehind are an option, but more complicated and not necessary here.
     
    What language and/or regular expression engine are you using?
     
    FactChecker said:
    What language and/or regular expression engine are you using?

    C#.NET, System.Text.RegularExpressions
     
    Are you ok with using a loop to extract out the groups? Or should it be regex only?
     

    Similar threads

    • · Replies 2 ·
    Replies
    2
    Views
    5K
    • · Replies 2 ·
    Replies
    2
    Views
    3K
    • · Replies 3 ·
    Replies
    3
    Views
    14K
    • · Replies 4 ·
    Replies
    4
    Views
    3K
    Replies
    2
    Views
    3K
    • · Replies 2 ·
    Replies
    2
    Views
    2K
    • · Replies 2 ·
    Replies
    2
    Views
    2K
    • · Replies 4 ·
    Replies
    4
    Views
    2K
    • · Replies 105 ·
    4
    Replies
    105
    Views
    17K
    Replies
    2
    Views
    3K