Dealing with huge xml-like files containing illegal characters

Sometimes you have to deal with files that look pretty like xml. But htey are not, because contain a lot of illegal characters. Such files usually made by just concatenating strings and not verifying angainst any schema. When these files are comparably…


This content originally appeared on DEV Community and was authored by t3mplar

Sometimes you have to deal with files that look pretty like xml. But htey are not, because contain a lot of illegal characters. Such files usually made by just concatenating strings and not verifying angainst any schema. When these files are comparably small, it's not a big deal to use regex to replace all those characters, but for really big files it's not very convinient.

The idea was to read file element by element and if next part is not a valid xml, then I could clean all illegal characters and use well-formed xml for my needs.

public IEnumerable<XElement> GetElement(string filePath, string elementName)
{
    using var reader = XmlReader.Create(filePath, new XmlReaderSettings { CheckCharacters = false });
    reader.MoveToContent();

    while (reader.Read())
    {
        if (reader.NodeType == XmlNodeType.Element && reader.Name == elementName)
        {
            var str = reader.ReadOuterXml();
            XNode node;

            try
            {
                node = XElement.Parse(str);
            }
            catch (XmlException)
            {
                var pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);";
                var regex = new Regex(pattern, RegexOptions.IgnoreCase);
                var fixedStr = regex.Replace(str, string.Empty);

                node = XElement.Parse(fixedStr);
            }

            if (node is XElement el)
            {
                yield return el;
            }
        }
    }
}

Important note:
To skip checking for invalid characters, you should pass to XmlReader settings XmlReaderSettings { CheckCharacters = false } so it can omit checks and give me possibility to cleanup input string.


This content originally appeared on DEV Community and was authored by t3mplar


Print Share Comment Cite Upload Translate Updates
APA

t3mplar | Sciencx (2021-06-26T18:09:37+00:00) Dealing with huge xml-like files containing illegal characters. Retrieved from https://www.scien.cx/2021/06/26/dealing-with-huge-xml-like-files-containing-illegal-characters/

MLA
" » Dealing with huge xml-like files containing illegal characters." t3mplar | Sciencx - Saturday June 26, 2021, https://www.scien.cx/2021/06/26/dealing-with-huge-xml-like-files-containing-illegal-characters/
HARVARD
t3mplar | Sciencx Saturday June 26, 2021 » Dealing with huge xml-like files containing illegal characters., viewed ,<https://www.scien.cx/2021/06/26/dealing-with-huge-xml-like-files-containing-illegal-characters/>
VANCOUVER
t3mplar | Sciencx - » Dealing with huge xml-like files containing illegal characters. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/06/26/dealing-with-huge-xml-like-files-containing-illegal-characters/
CHICAGO
" » Dealing with huge xml-like files containing illegal characters." t3mplar | Sciencx - Accessed . https://www.scien.cx/2021/06/26/dealing-with-huge-xml-like-files-containing-illegal-characters/
IEEE
" » Dealing with huge xml-like files containing illegal characters." t3mplar | Sciencx [Online]. Available: https://www.scien.cx/2021/06/26/dealing-with-huge-xml-like-files-containing-illegal-characters/. [Accessed: ]
rf:citation
» Dealing with huge xml-like files containing illegal characters | t3mplar | Sciencx | https://www.scien.cx/2021/06/26/dealing-with-huge-xml-like-files-containing-illegal-characters/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.