Removing entities from HTML in Cocoa
Saturday, May 07, 2005
To display accented characters and certain symbols in a HTML or XML document you need to encode them. For example the copyright symbol © is represented in HTML as © Applications like NewsMac Pro need to be able to decode these entities and translate them to the appropriate character. Straightforward you might think, but actually it isn't. There are multiple ways in which characters can be encoded, as before with a textual name, but also with a decimal or hex value. In NewsMac Pro I used to use NSAttributtedString's initWithHTML method, however for what ever reason this seem to lock up under Tiger, so I had to find an alternative solution. I thought I'd post the following code to help out other developers because if you go searching on this topic you will most likely get people telling you to use the NSAttributedString method. This probably isn't the most elegant bit of code ever, but it serves its purpose:
+ (NSString *) decodeCharacterEntitiesIn:(NSString *)source
{
if(!source) return nil;
else if([source rangeOfString: @"&"].location == NSNotFound) return source;
else
{
NSMutableString *escaped = [NSMutableString stringWithString: source];
NSArray *codes = [NSArray arrayWithObjects:
@"&", @"<", @">", @""",
@" ", @"¡", @"¢", @"£", @"¤", @"¥", @"¦",
@"§", @"¨", @"©", @"ª", @"«", @"¬", @"­", @"®",
@"¯", @"°", @"±", @"²", @"³", @"´", @"µ",
@"¶", @"·", @"¸", @"¹", @"º", @"»", @"¼",
@"½", @"¾", @"¿", @"À", @"Á", @"Â",
@"Ã", @"Ä", @"Å", @"Æ", @"Ç", @"È",
@"É", @"Ê", @"Ë", @"Ì", @"Í", @"Î", @"Ï",
@"Ð", @"Ñ", @"Ò", @"Ó", @"Ô", @"Õ", @"Ö",
@"×", @"Ø", @"Ù", @"Ú", @"Û", @"Ü", @"Ý",
@"Þ", @"ß", @"à", @"á", @"â", @"ã", @"ä",
@"å", @"æ", @"ç", @"è", @"é", @"ê", @"ë",
@"ì", @"í", @"î", @"ï", @"ð", @"ñ", @"ò",
@"ó", @"ô", @"õ", @"ö", @"÷", @"ø", @"ù",
@"ú", @"û", @"ü", @"ý", @"þ", @"ÿ", nil];
int i, count = [codes count];
// Html
for(i = 0; i < count; i++)
{
NSRange range = [source rangeOfString: [codes objectAtIndex: i]];
if(range.location != NSNotFound)
{
[escaped replaceOccurrencesOfString: [codes objectAtIndex: i]
withString: [NSString stringWithFormat: @"%C", 160 + i]
options: NSLiteralSearch
range: NSMakeRange(0, [escaped length])];
}
}
// Decimal & Hex
NSRange start, finish, searchRange = NSMakeRange(0, [escaped length]);
i = 0;
while(i < [escaped length])
{
start = [escaped rangeOfString: @""
options: NSCaseInsensitiveSearch
range: searchRange];
finish = [escaped rangeOfString: @";"
options: NSCaseInsensitiveSearch
range: searchRange];
if(start.location != NSNotFound && finish.location != NSNotFound &&
finish.location > start.location)
{
NSRange entityRange = NSMakeRange(start.location, (finish.location - start.location) + 1);
NSString *entity = [escaped substringWithRange: entityRange];
NSString *value = [entity substringWithRange: NSMakeRange(2, [entity length] - 2)];
[escaped deleteCharactersInRange: entityRange];
if([value hasPrefix: @"x"])
{
int tempInt = 0;
NSScanner *scanner = [NSScanner scannerWithString: [value substringFromIndex: 1]];
[scanner scanHexInt: &tempInt];
[escaped insertString: [NSString stringWithFormat: @"%C", tempInt] atIndex: entityRange.location];
}
else
{
[escaped insertString: [NSString stringWithFormat: @"%C", [value intValue]] atIndex: entityRange.location];
}
i = start.location;
}
else i++;
searchRange = NSMakeRange(i, [escaped length] - i);
}
return escaped; // Note this is autoreleased
}
}








6 Comments:
"To display accented characters and certain symbols in a HTML or XML document you need to encode them."
This is not necessarily aways true.
http://www.w3.org/TR/2004/WD-i18n-html-tech-char-20040509/#IDAPNGO
Shouldn't it be a matter for the document author?
Of course it's a matter for the document author, not entirely sure what your point is? This is the code that I use to decode encoded characters in the HTML snippets in RSS and Atom feeds. An awful lot of pages (and feeds for that matter) just use the default encoding or specify none at all so encoded characters tend to be used extensively.
Thanks for posting this!
I did notice there's an extra comma in your array definition that stops the code from compiling.
I'm using this in the reverse direction to encode some HTML that could contain accented characters. Thanks for saving me a lot of time!
Hmm, this doesn't seem to decode common entities like &, ", >, <.. Not sure what's going on there. Am I missing something?
It's not an exhaustive list of encodings, you can easily add more. In the project this was from it was necessary to replace those entities you list separately. You can use the function CFXMLCreateStringByUnescapingEntities to do this.
The codes array does currently not map properly because it has as its first 4 entries @"&", @"<", @">" and @""", which are not the equivalent of 160, 161, 162, and 163. This also causes the other entries to be off by 4.
Post a Comment
<< Home